Why customers send photos instead of typing
There is a reason your WhatsApp inbox is full of photos: typing the details of a broken product, a long receipt number, or a specific item color takes far longer than snapping a picture. Customers have learned that pointing a camera is the fastest way to communicate, and they expect the business on the other end to understand.
Until recently, "understanding" meant a human agent opening the image and reading it. That workflow is now 10× cheaper and ~50× faster with frontier vision models — but only if they are wired into your inbox, which most platforms still have not done properly.
What vision models can actually read
Today's frontier vision models — GPT-4o, Claude Opus 4.5, Gemini 3 Pro, Qwen VL — read images well enough to run production support flows. Specifically:
Product recognition
Identify a specific SKU from a photo, including color, material, and packaging. Match against your catalog to check stock and price.
Receipt / invoice parsing
Extract date, merchant, line items, totals, and tax. Cross-reference against your order database to identify which order the customer is asking about.
Screenshot triage
Read error messages, UI states, and pricing pages from app or website screenshots. Detect which screen the customer is stuck on.
Document / ID verification
Read government IDs, trade licenses, visa stamps. (Deploy only where compliance allows — PII handling applies.)
Handwriting (Arabic + English)
Parse handwritten notes, including cursive Arabic and mixed-script. Useful for prescription refills, handwritten order forms, tailor measurements.
Comparison shopping
Customer sends a competitor's price screenshot — bot identifies the item, checks your catalog, and either matches the price (if policy allows) or explains the difference.
Seven real support cases image AI solves
A broken appliance photo
Customer sends a photo of a damaged product. Bot identifies the model, pulls warranty status from the order DB, and either opens a replacement ticket or routes to the warranty team — in under 30 seconds.
A faded receipt
"I bought this two weeks ago — can I return it?" with a phone photo of a crumpled receipt. Bot extracts the order ID, checks the 30-day window, and either starts the return flow or explains why it is out of policy.
A competitor's price tag
Customer screenshots another store's price. Bot reads the SKU, checks your catalog, and either price-matches per your policy or explains the feature difference honestly.
A delivery address screenshot
Customer sends a map screenshot instead of typing an address. Bot extracts the location, validates against your delivery zones, and confirms shipping eligibility.
A screenshot of your own website
"I saw this on your site but cannot find it" — bot reads the product name from the screenshot, fetches the current product page link, and sends it back with stock status.
A handwritten tailoring measurement
Tailor customer photographs a handwritten measurement sheet. Bot extracts numbers, fills the order form, and confirms totals before the customer pays.
An Arabic-language error message
App screenshot shows an Arabic error string. Bot reads it, matches against known error codes, and serves the exact fix — in the customer's dialect.
Model selection: who sees best for your use case
Vision quality is not a single number. Different models win different jobs:
GPT-4o (OpenAI)
Best general-purpose vision + strong English OCR. Fast and well-priced. Default for most support flows.
Claude Opus 4.5 (Anthropic)
Best reasoning about what is in an image — great for "why is this broken" or "what is the customer actually asking". 200K context means you can attach image + full conversation history + full catalog.
Gemini 3 Pro (Google)
Strong on Arabic OCR, handwriting, and diagrams. Up to 1M token context — useful when you want to dump a whole product manual alongside the customer's image.
Qwen VL (Alibaba)
Competitive on Arabic text and Asian languages; good price-performance for high-volume flows.
Thikaa lets you pick a different model per bot. A merchant might route product-photo questions to GPT-4o and reasoning-heavy warranty cases to Claude Opus 4.5 — same inbox, different brains under the hood.
Privacy, cost, and where vision still fails
Privacy: images uploaded by customers contain PII. Thikaa stores them tenant-scoped, applies the same retention policy as text messages, and supports explicit consent flows for ID documents. Vision API calls are made to the chosen provider — the provider's data-use policy applies. For sensitive verticals (healthcare, finance) pick a model whose policy fits.
Cost: vision adds roughly $0.005–$0.02 per image at current provider rates. Thikaa passes through at cost with no markup. The $5 starter credit every new tenant gets will cover roughly 300–1000 image lookups — plenty to test the workflow.
Where vision fails: tiny receipts photographed at an angle, ultra-low-light photos, and badly cropped screenshots still confuse the best models. Thikaa flags low-confidence results so a human can step in before a wrong answer ships.
FAQ
Do I need to build anything to enable image understanding?
No. If your bot is running on a vision-capable model (GPT-4o, Claude Opus 4.5, Gemini 3 Pro, Qwen VL) and a customer sends an image, Thikaa passes the image plus the conversation context to the model automatically.
What image formats are supported?
JPEG, PNG, WebP, and GIF (first frame). HEIC from iPhones is auto-converted. WhatsApp compresses images — the bot sees the same version the agent would see.
How do I control which model handles images?
From the bot settings, choose the AI provider per bot. Vision-capable models are labeled. You can override the default for specific flows (e.g., warranty flow uses Claude, product lookup uses GPT-4o).
Can the bot answer about multiple images in one message?
Yes. WhatsApp and Instagram albums are handled as multi-image messages. The model sees all images together and can reason across them ("is this the same product as the one you bought?").
What about video messages?
Video support is in beta — we transcribe the audio track and extract key frames. Fully-general video understanding is coming later in 2026.
Let your bot see images today
Start a 14-day trial, pick a vision-capable model, and customers can send you product photos, receipts, and screenshots from the first minute.
Start Free Trial