Why voice notes dominate MENA WhatsApp
Walk through any market in Riyadh, Kuwait City, or Manama and watch people message each other. You will hear them — literally — because roughly 40% of WhatsApp messages in the Gulf arrive as voice notes rather than text. That number climbs for older customers, drivers behind the wheel, and anyone whose thumbs are busier than their mouth.
Arabic is an expressive spoken language. Typing "شلونكم اليوم، عندي استفسار بخصوص الطلب ٢٣٤٥" takes effort. Holding down the mic and saying it takes two seconds. Customers know which feels natural. Most platforms still pretend those voice notes do not exist — they reply with "please type your question" and lose the sale.
The real cost of ignoring voice
Here is what happens in a typical MENA business that runs a text-only chatbot:
- A customer at 9:47 PM sends a 25-second voice note asking about stock of a specific model.
- The bot replies: "Sorry, I cannot process voice messages. Please type your question."
- The customer does not type. They go to a competitor whose human agent heard them tomorrow morning.
- The business counts the lost sale as "no response" — never knowing the intent was already there.
Across a 500-customer-per-day operation, this pattern typically costs between $6,000 and $14,000 in lost revenue per month, based on aggregated conversion data from Thikaa merchants in the Gulf. The customers are not cheap — they are just speaking.
How Thikaa hears voice
Thikaa runs the full Whisper large-v3 model on its own servers. Not the quantized turbo variant that every cloud API ships by default — we tried that first and it butchered Kuwaiti pronunciation to the point of comedy ("أبي أحجز موعد" became something about needing stones). We run the full model, on GPU, cache every transcription for 7 days, and pass the text into the selected LLM.
The important part for merchants: there is no per-minute cost. Cloud transcription APIs charge per second; Thikaa bundles it into the platform fee. A shop that gets 3,000 voice notes a month pays zero for voice — and pays zero for the next 3,000.
The voice pipeline
Voice note arrives
WhatsApp webhook delivers the .ogg audio to Thikaa.
Dialect-aware transcription
Whisper large-v3 runs with Arabic language hint + optional dialect bias. Cached for 7 days.
LLM reasoning
The transcript flows into the selected model — Claude Opus 4.5, GPT-4o, or Gemini 3 Pro — with your business context (catalog, hours, policies).
Guardrails
Anti-hallucination checks strip any invented prices, tracking numbers, or discount codes.
Reply
Bot answers in the customer's dialect, in under 30 seconds end to end.
Dialects that actually matter
Modern Standard Arabic (فصحى) is the easy case. Real customers do not speak فصحى. They speak:
The default Whisper model handles all four, but Gulf voice notes hit hardest because Khaleeji reduces vowels heavily and drops final consonants — which trips up smaller models. Running the full large-v3 variant closes that gap.
What happens after transcription
Transcription is not the end — it is the cheap part. What matters is what the LLM does with it. In Thikaa the transcript flows straight into the same pipeline as typed text: retrieval (RAG over your catalog and FAQ), intent detection, tool calls (order lookup, appointment booking), and guardrails. The bot does not care whether the input came from the keyboard or the mic.
Concretely: a voice note saying "أبي أستفسر عن فرع البدع، كم يبعد عن السالمية" gets the same branch-lookup + driving-distance answer as if the customer typed it in English on the web widget.
Case study: a Kuwaiti home-services merchant
One of our merchants runs AC maintenance bookings across Kuwait. Before voice transcription: 40% of incoming WhatsApp messages were voice notes, and the team manually listened to each one at the start of every shift. Average response time: 4 hours. Lost-to-competitor rate: measurable but impossible to see directly.
After turning on voice understanding: average first-reply dropped to 28 seconds. Same-day bookings rose by 2.3×. The team still listens to ~5% of voice notes (flagged by the bot for ambiguity), but the bulk is now fully automated — in the customer's own dialect.
FAQ
Does Thikaa charge per voice minute?
No. Voice transcription is included in all paid plans starting at $10/month. Whisper large-v3 runs on Thikaa's infrastructure — there is no OpenAI pass-through bill.
Which dialects are supported?
All major Arabic dialects: Gulf (Khaleeji), Egyptian, Levantine (Lebanese, Syrian, Jordanian, Palestinian), Maghrebi (Moroccan, Tunisian, Algerian), Iraqi, plus Modern Standard Arabic. English voice notes are handled too.
How long can a voice note be?
Currently up to 3 minutes (180 seconds) per note. Longer notes are rare and typically get routed to a human agent.
Does it work with the official WhatsApp Business API?
Yes — both the official Cloud API and WhatsApp Direct (QR) paths deliver voice notes to Thikaa, and both run through the same transcription pipeline.
Can I review transcriptions?
Yes. Every transcription is stored against the conversation and visible to agents — useful for quality assurance and for training custom Q&A pairs from the actual language customers use.
Start answering voice notes today
Try Thikaa free for 14 days. Connect WhatsApp, upload your catalog, and your bot hears every voice note in Arabic from the first minute.
Start Free Trial