Frontier LLMs

Why We Run Claude Opus 4.5 + GPT-4o in Production for Arabic Bots

Picking an AI model for an Arabic chatbot is not about leaderboards — it is about which model reads Khaleeji voice transcripts, follows a 20-page business brief, and refuses to invent a discount code. Here is what we run in production.

April 18, 2026 11 min read Models · Claude · GPT-4o · Arabic

The frontier model roster in Thikaa

Every merchant account on Thikaa can pick its own LLM per bot. The current roster, as of April 2026:

Claude Opus 4.5
Anthropic
200K context · vision · strong instruction-following · best at Arabic reasoning
Claude Sonnet 4
Anthropic
Faster, cheaper Claude for high-volume bots. Still 200K context.
GPT-4o
OpenAI
128K context · vision · fast · strong English + solid Arabic. Our default.
GPT-4o mini
OpenAI
Cheapest vision-capable model. Good for greetings and FAQ-heavy bots.
o1
OpenAI
Reasoning model. Useful for complex troubleshooting where accuracy beats latency.
Gemini 3 Pro
Google
Up to 1M context · excellent vision including Arabic OCR.
Qwen VL
Alibaba
Strong Arabic text + competitive vision at high volume.
Grok
xAI
Fast · vision · good code + technical reasoning.

The question is never "which one is best." It is "which one is best for this specific bot at this price point at this latency."

How we actually test models for Arabic

Leaderboards measure averages over benchmarks most of which are English. We built an internal eval suite with ~600 real Arabic conversations drawn (anonymized) from the Thikaa fleet. Each conversation is graded on:

Dialect fluency

Does the reply sound like a native speaker of the customer's dialect, not a translated-from-English robot?

Instruction-following

If the bot was told "never quote a price — always show the catalog link," does it obey under pressure?

Hallucination

Did the bot invent a tracking number, a discount code, or a product feature?

Grounding on docs

When the answer is in the RAG bundle, does the model find it and cite it, or just wing it?

Voice handling

Given a messy Whisper transcript with dialect + background noise, does it still answer correctly?

Vision handling

Given a product photo, does it match to the right SKU or fabricate a similar one?

Latency

Time to first token and total response time under production load.

Cost per conversation

Token cost × average conversation length.

Claude Opus 4.5: the reasoner

Claude Opus 4.5 wins for us on three axes: (a) best at following long system prompts without drifting, (b) best at Arabic reasoning — especially when a customer's question is indirect or culturally loaded, and (c) the 200K context window lets us paste the entire business knowledge base plus the conversation history into a single prompt.

Concrete example: an Arabic-speaking customer asks "هل المنتج ده مناسب لابني اللي عنده حساسية من القطن؟" (does this product suit my son who is allergic to cotton?). Opus reads the catalog, sees the material composition, cross-references against the allergen note, and answers honestly — including saying "no" when that is the right answer. GPT-4o sometimes says "yes" to be polite.

Where Opus loses: pure latency (slightly slower first token) and price per million tokens (highest of the roster). We route high-stakes conversations — complaints, high-value orders, complex support — to Opus and let volume traffic go elsewhere.

GPT-4o: the default workhorse

GPT-4o is the bot most merchants will start on and many will never need to leave. It handles Arabic well (not as native as Opus but comfortably conversational), it has excellent vision, it is fast, and the price point works for high-volume use.

Where GPT-4o shines: short-turn product Q&A, image-based triage, and English-dominant bots. Where it stumbles: very long multi-turn conversations (context gets fuzzy past ~50 turns) and deeply indirect Arabic phrasing where Opus keeps up but GPT-4o hallucinates an interpretation.

Gemini 3 Pro: the context monster

The 1M token context window is not a gimmick — for a merchant with a 300-page product manual plus 50K past conversations, Gemini 3 Pro is the only model that can hold the whole thing in one prompt. That eliminates a lot of retrieval error because the model can see everything.

Vision on Gemini 3 Pro is genuinely strong for Arabic OCR — handwritten notes, stylized store signage, and layouts with mixed LTR/RTL text are where it pulls ahead. The drawback is the per-token cost at 1M context, so we use Gemini where the context actually justifies it rather than as a default.

Qwen VL + Grok: the specialists

Qwen VL (Alibaba) earns a spot for merchants who want vision at scale without the frontier price tag. Arabic text recognition is competitive, and the pricing lets high-volume bots (e.g., e-commerce customer service with 10K+ image questions a month) stay in budget.

Grok has a devoted following among technical merchants — SaaS support bots, code-adjacent use cases. For a florist in Riyadh, it is overkill. For a developer tools company running bilingual English-Arabic docs support, it is a legitimate contender.

How we actually route traffic in production

Merchants can pick a single model per bot. Power users set up routing rules:

Default model

GPT-4o for everyday messages. Fast, vision-capable, solid Arabic.

Complaint detection → Opus

If intent-detection flags a complaint, re-run the response through Claude Opus 4.5 before sending. Higher cost, but the politeness recovery rate is measurably better.

Image-heavy flows → GPT-4o or Gemini

Depending on language: English/mixed → GPT-4o; Arabic text in images → Gemini 3 Pro.

High-value orders → Opus

For orders above a merchant-configurable threshold, the reasoner gets the reply.

Circuit breaker

If the chosen provider returns 3 failures in a row, Thikaa automatically fails over to the second-choice model for 60 seconds.

The point: frontier models are commodities now. The value is in the routing, the guardrails, and the conversation memory — not in picking one "winner."

FAQ

Can I switch models without rebuilding my bot?

Yes. Model choice is per bot, in the bot settings. Flipping Claude Opus ↔ GPT-4o is a dropdown — the prompts, knowledge base, and flows stay the same.

Does Thikaa mark up model pricing?

No. We pass through provider pricing at cost for API usage and bill platform access at a flat monthly fee ($10–$40). Every new tenant gets $5 starter AI credit.

What happens if my chosen model has an outage?

Thikaa's circuit breaker auto-fails over to the secondary model for 60 seconds after 3 consecutive failures. You can configure the fallback order.

Can I bring my own API key?

Yes, Enterprise plan supports bring-your-own-key (BYOK) for Claude, OpenAI, Gemini, and Qwen — the conversation still runs through Thikaa's orchestration but billing goes direct to the provider.

Which model is cheapest per conversation?

GPT-4o mini is usually cheapest for short Q&A. Qwen VL is competitive at high volume. Claude Opus 4.5 is the most expensive but resolves complex issues in fewer turns, so total cost can be lower.

Try every model on your own bot

Thikaa's $5 starter credit covers ~10,000 GPT-4o turns or ~2,000 Claude Opus turns. Test the models on your real customer conversations before committing.

Start Free Trial