Fifteen AI Experiments.Real Products.Real Numbers.
EveryexperimentonthispageranonproductionAIintegrationsacrosssixliveCipherBitzproducts.Latencymeasured.Accuracybenchmarked.Failuremodesdocumented.Nodemoenvironments.Nosyntheticbenchmarks.Novendor-providednumbers.
Fifteen Experiments.Zero Vendor Benchmarks.
Every result below came from our own production integrations. Latency numbers are P50 unless stated. Accuracy is measured against a manually labelled evaluation set, not model self-reporting.
LLM Prompt Caching — API Call Reduction
Structured JSON Extraction from Natural Language
Grounded City AI vs Base Gemini — Accuracy
Occasion-Based Outfit Recommendation Accuracy
Edge Inference for Query Classification
Gemini Auto-Tagging for Business Listings
Structured Resume Parsing — Field Extraction
AI Job-Candidate Match Scoring
n8n + Gemini — AI Node Reliability in Prod
AI SEO Meta Description — CTR vs Manual
Gemini 2.0 Flash vs 1.5 Flash — Production Swap
pgvector Embeddings for Product Search
AI Auto-Generated Business Descriptions
Voice-to-AI Search on Mobile — MNCJob
Gemini Nano — On-Device Inference
Every Experiment FollowsThe Same Five Rules.
AI experiments fail to produce useful knowledge when they are undisciplined. These five rules are the design constraints every experiment here follows — without exception.
"An AI evaluation that uses synthetic data produces synthetic insights. We measure on production traffic because production traffic is where the model actually fails."
The hypothesis has a number in it.
A hypothesis that reads 'AI will improve search results' is not a hypothesis — it is an aspiration. Every hypothesis here specifies a metric, a threshold, and a measurement period before any experiment code is written.
Evaluation uses production data.
Synthetic benchmarks tell you how the model performs on synthetic data. We evaluate on anonymised production queries and production content — because model performance on real user behaviour is consistently different from performance on constructed evaluation sets.
Failure is planned for — not surprised by.
Each experiment defines upfront what failure looks like: the metric threshold below which the approach will not be deployed. When that threshold is not met, the experiment is closed formally — not kept running until the numbers eventually look better.
The model is chosen before the experiment — not after.
We do not run a prompt against all models and then choose the one that performed best to report. Model choice is declared in the hypothesis — and if we are comparing models, the comparison protocol is specified before running any evaluation.
Every experiment answers a real product question.
We do not run AI experiments for research interest alone. Each experiment is attached to a real product decision: should we deploy this feature, should we switch models, should we change the architecture. An experiment that does not inform a decision is not an experiment — it is entertainment.
How We Measure What Matters.
Accuracy is subjective without a baseline. Latency is meaningless without specifying percentile. Here is how we measure the two metrics that define production viability.
Measuring Accuracy
Every accuracy metric implies a manually labeled ground-truth dataset. We do not use LLMs to judge LLM outputs ("LLM-as-a-judge") for primary metrics, as it introduces correlated biases.
- ✓ Min 200 samples per evaluation
- ✓ Drawn from 90th percentile complex queries
- ✓ Scored by human product owners
Measuring Latency
"Time to First Token" (TTFT) only matters for streaming UI. For programmatic integrations, total round-trip time (P50 and P95) dictates whether a feature blocks the main thread.
- ✓ Measured from client request to complete payload
- ✓ Includes network overhead and JSON parsing
- ✓ P50 used as baseline, P95 for circuit breakers
The Latency Budget
A 200ms API call becomes a 600ms user delay once orchestration, routing, and database lookups are factored in. This is our standard architecture latency map.
Why We Default ToGoogle Gemini.
We are not married to any one vendor, but pragmatism dictates a default stack. After testing OpenAI, Anthropic, and Google DeepMind extensively, here is how we position them for production use cases.
Gemini 1.5 Flash
Powers 80% of our production integrations. Unbeatable latency-to-intelligence ratio for structurally constrained tasks. Massive context window (1M+) enables 'in-context lookup' architectures replacing complex RAG.
Gemini 1.5 Pro
Reserved for high-complexity analytical tasks where Flash falters. Slower and more expensive, but achieves 15-20% higher accuracy on multi-step reasoning and complex code generation tasks.
Claude 3.5 Sonnet
Our benchmark for code generation and deeply nuanced text writing. We use Claude internally for development, but rarely in production for clients due to Gemini's superior speed/cost for standard JSON-in/JSON-out tasks.
GPT-4o Mini
A highly capable model with excellent latency, but in our testing, Gemini Flash's 1M context window and strict JSON adherence makes Flash our default choice for the same price tier.
Architectural Decisions Driven by Data.
Experiments are only useful if they change how you build. Here are the concrete architectural shifts we've made based on our lab results.
Long Context > Complex RAG
Gemini Flash's 1M context window makes 'in-context lookup' faster and more accurate than chunking and vectorized retrieval for medium-sized corpuses.
Impact:Removed pgvector dependency for 3 internal tools.
Edge Classification First
Running a fast classifier at the edge (Cloudflare Workers AI) to route queries before hitting heavy LLM APIs reduces perceived latency by ~200ms.
Impact:Deployed on AskBLR to separate conversational intents.
Strict JSON Validation
LLMs fail silently when generating structure. We enforce rigid JSON schema validation middleware on every API response before it hits the application tier.
Impact:Reduced downstream parsing errors from 4.2% to 0.01%.
What We Will Not
Use AI For.
Understanding a technology's capabilities requires understanding its boundaries. These are the hard lines where we stop using AI and rely exclusively on deterministic engineering.
No Unsupervised Code Deployment
AI writes roughly 40% of our boilerplate and scaffolding. It deploys exactly 0% of our production code. Every line is reviewed by a Senior Engineer before merging.
No Generative UI
LLMs generating React components on the fly is an impressive demo that fails accessibility, performance, and brand consistency audits in production.
No Database Mutations
AI models can read from the database to contextualise answers. They are never granted UPDATE or DELETE permissions. State changes require deterministic code.
No Black-Box Prompts
A prompt is source code. It is version controlled, reviewed, and tested against a regression suite before deployment. Tuning is engineering, not magic.
Stop Experimenting.
Start Integrating.
You have seen our data. If you have a specific AI integration challenge or want to test a hypothesis in your own product, let's talk engineering.
Or email our engineering lead directly: engineering@cipherbitz.com