The frontier model roster in Thikaa
Every merchant account on Thikaa can pick its own LLM per bot. The current roster, as of April 2026:
The question is never "which one is best." It is "which one is best for this specific bot at this price point at this latency."
How we actually test models for Arabic
Leaderboards measure averages over benchmarks most of which are English. We built an internal eval suite with ~600 real Arabic conversations drawn (anonymized) from the Thikaa fleet. Each conversation is graded on:
Dialect fluency
Does the reply sound like a native speaker of the customer's dialect, not a translated-from-English robot?
Instruction-following
If the bot was told "never quote a price — always show the catalog link," does it obey under pressure?
Hallucination
Did the bot invent a tracking number, a discount code, or a product feature?
Grounding on docs
When the answer is in the RAG bundle, does the model find it and cite it, or just wing it?
Voice handling
Given a messy Whisper transcript with dialect + background noise, does it still answer correctly?
Vision handling
Given a product photo, does it match to the right SKU or fabricate a similar one?
Latency
Time to first token and total response time under production load.
Cost per conversation
Token cost × average conversation length.
Claude Opus 4.5: the reasoner
Claude Opus 4.5 wins for us on three axes: (a) best at following long system prompts without drifting, (b) best at Arabic reasoning — especially when a customer's question is indirect or culturally loaded, and (c) the 200K context window lets us paste the entire business knowledge base plus the conversation history into a single prompt.
Concrete example: an Arabic-speaking customer asks "هل المنتج ده مناسب لابني اللي عنده حساسية من القطن؟" (does this product suit my son who is allergic to cotton?). Opus reads the catalog, sees the material composition, cross-references against the allergen note, and answers honestly — including saying "no" when that is the right answer. GPT-4o sometimes says "yes" to be polite.
Where Opus loses: pure latency (slightly slower first token) and price per million tokens (highest of the roster). We route high-stakes conversations — complaints, high-value orders, complex support — to Opus and let volume traffic go elsewhere.
GPT-4o: the default workhorse
GPT-4o is the bot most merchants will start on and many will never need to leave. It handles Arabic well (not as native as Opus but comfortably conversational), it has excellent vision, it is fast, and the price point works for high-volume use.
Where GPT-4o shines: short-turn product Q&A, image-based triage, and English-dominant bots. Where it stumbles: very long multi-turn conversations (context gets fuzzy past ~50 turns) and deeply indirect Arabic phrasing where Opus keeps up but GPT-4o hallucinates an interpretation.
Gemini 3 Pro: the context monster
The 1M token context window is not a gimmick — for a merchant with a 300-page product manual plus 50K past conversations, Gemini 3 Pro is the only model that can hold the whole thing in one prompt. That eliminates a lot of retrieval error because the model can see everything.
Vision on Gemini 3 Pro is genuinely strong for Arabic OCR — handwritten notes, stylized store signage, and layouts with mixed LTR/RTL text are where it pulls ahead. The drawback is the per-token cost at 1M context, so we use Gemini where the context actually justifies it rather than as a default.
Qwen VL + Grok: the specialists
Qwen VL (Alibaba) earns a spot for merchants who want vision at scale without the frontier price tag. Arabic text recognition is competitive, and the pricing lets high-volume bots (e.g., e-commerce customer service with 10K+ image questions a month) stay in budget.
Grok has a devoted following among technical merchants — SaaS support bots, code-adjacent use cases. For a florist in Riyadh, it is overkill. For a developer tools company running bilingual English-Arabic docs support, it is a legitimate contender.
How we actually route traffic in production
Merchants can pick a single model per bot. Power users set up routing rules:
Default model
GPT-4o for everyday messages. Fast, vision-capable, solid Arabic.
Complaint detection → Opus
If intent-detection flags a complaint, re-run the response through Claude Opus 4.5 before sending. Higher cost, but the politeness recovery rate is measurably better.
Image-heavy flows → GPT-4o or Gemini
Depending on language: English/mixed → GPT-4o; Arabic text in images → Gemini 3 Pro.
High-value orders → Opus
For orders above a merchant-configurable threshold, the reasoner gets the reply.
Circuit breaker
If the chosen provider returns 3 failures in a row, Thikaa automatically fails over to the second-choice model for 60 seconds.
The point: frontier models are commodities now. The value is in the routing, the guardrails, and the conversation memory — not in picking one "winner."
FAQ
Can I switch models without rebuilding my bot?
Yes. Model choice is per bot, in the bot settings. Flipping Claude Opus ↔ GPT-4o is a dropdown — the prompts, knowledge base, and flows stay the same.
Does Thikaa mark up model pricing?
No. We pass through provider pricing at cost for API usage and bill platform access at a flat monthly fee ($10–$40). Every new tenant gets $5 starter AI credit.
What happens if my chosen model has an outage?
Thikaa's circuit breaker auto-fails over to the secondary model for 60 seconds after 3 consecutive failures. You can configure the fallback order.
Can I bring my own API key?
Yes, Enterprise plan supports bring-your-own-key (BYOK) for Claude, OpenAI, Gemini, and Qwen — the conversation still runs through Thikaa's orchestration but billing goes direct to the provider.
Which model is cheapest per conversation?
GPT-4o mini is usually cheapest for short Q&A. Qwen VL is competitive at high volume. Claude Opus 4.5 is the most expensive but resolves complex issues in fewer turns, so total cost can be lower.
Try every model on your own bot
Thikaa's $5 starter credit covers ~10,000 GPT-4o turns or ~2,000 Claude Opus turns. Test the models on your real customer conversations before committing.
Start Free Trial