HomePlatformProductsLabsBuildCompanyContactStart a Project →
GEMINI · N8N · PRODUCTION AI INTEGRATIONS

Fifteen AI Experiments.
Real Products.
Real Numbers.

EveryexperimentonthispageranonproductionAIintegrationsacrosssixliveCipherBitzproducts.Latencymeasured.Accuracybenchmarked.Failuremodesdocumented.Nodemoenvironments.Nosyntheticbenchmarks.Novendor-providednumbers.

15
AI experiments documented
Gemini
Primary model — Flash + Pro
87.5%
Best accuracy achieved in production
40ms
Lowest edge inference latency
AI Experiment Monitor — CipherBitz Labs
LIVE EVALUATION
Req #847 / 1,200 today
CONFIDENCE SCORE
0%
confidence
Structured extraction94.2%
Intent classification88.7%
Local fact recall43.1%
Local fact recall requires grounding context.
TOKEN STREAM — gemini-1.5-flash
USER
What are the best restaurants near MG Road, Bengaluru for a business lunch?
ASSISTANT
0 tokens · 340ms
Latency: 340ms
Tokens: 0
Model: Flash
Cost: ₹0.0000
RESPONSE LATENCY (ms)
800
600
400
200
0
P50: 312ms
P95: 548ms
P99: 694ms
⚡ Flash
avg 320ms
₹0.002/1k
◈ Pro
avg 1,240ms
₹0.012/1k
MODEL BENCHMARK —
gemini-1.5-flash
88.7% ↑₹0.002/1k
Currently active
gemini-1.5-pro
94.2% ↑₹0.012/1k
overkill for most tasks
gemini-2.0-flash
91.4% ↑₹0.003/1k
testing now
gemini-nano
62.1% ↓₹0.0004/1k
not sufficient
ALL AI EXPERIMENTS

Fifteen Experiments.
Zero Vendor Benchmarks.

Every result below came from our own production integrations. Latency numbers are P50 unless stated. Accuracy is measured against a manually labelled evaluation set, not model self-reporting.

PRODUCTION
GEMINI

LLM Prompt Caching — API Call Reduction

gemini-1.5-flash
HYPOTHESIS
Caching recurring prompt patterns reduces Gemini API volume by 40%+ without degrading response quality.
KEY RESULT
0% cache hit rate
Below 40% target but meaningful cost reduction — caching deployed in production
P50: 0ms (cached)·
Tokens: 0 (cached)·
AskBLR
PRODUCTION
GEMINI

Structured JSON Extraction from Natural Language

gemini-1.5-flash
HYPOTHESIS
Gemini Flash with a strict JSON schema in the system prompt extracts structured data from freeform text at 94%+ accuracy.
KEY RESULT
0% accuracy
Exceeds hypothesis — now standard approach for all extraction tasks
P50: 280ms·
Avg 340 tokens·
FreeBill, NammaHubballi
PRODUCTION
GEMINI

Grounded City AI vs Base Gemini — Accuracy

gemini-1.5-flash
HYPOTHESIS
Gemini grounded with curated city data answers hyperlocal questions at 90%+ accuracy vs base Gemini at ~40%.
KEY RESULT
0% vs 43%
Hypothesis confirmed — grounded model deployed in AskBLR and NammaHubballi
P50: 420ms·
1.2k tokens avg·
AskBLR
PRODUCTION
GEMINI

Occasion-Based Outfit Recommendation Accuracy

gemini-1.5-flash
HYPOTHESIS
Gemini Flash generates contextually relevant outfit combinations from a catalog at 85%+ user acceptance rate vs random suggestion.
KEY RESULT
0% acceptance
Hypothesis met — 22% reduction in time-to-first-cart in A/B test
P50: 390ms·
520 tokens avg·
NextGirl
PRODUCTION
EDGE AI

Edge Inference for Query Classification

Cloudflare Workers AI
HYPOTHESIS
Lightweight classification at Cloudflare edge reduces latency for routing decisions by 200ms+ vs full Gemini API call.
KEY RESULT
0ms faster
Edge: 40ms vs Gemini: 320ms — deployed for intent routing in AskBLR
P50: 40ms·
Classification only·
AskBLR
PRODUCTION
GEMINI

Gemini Auto-Tagging for Business Listings

gemini-1.5-flash
HYPOTHESIS
Gemini Flash auto-categorises NammaHubballi listings at 90%+ accuracy vs manual tagging.
KEY RESULT
0% accuracy
3% below target — deployed with 12.5% human review queue for ambiguous cases
P50: 240ms·
180 tokens avg·
NammaHubballi
PRODUCTION
GEMINI

Structured Resume Parsing — Field Extraction

gemini-1.5-flash
HYPOTHESIS
Gemini Flash extracts structured resume fields (experience, skills, education) from freeform PDF text at 91%+ accuracy.
KEY RESULT
0% accuracy
Hypothesis exceeded — parsing deployed in MNCJob upload flow
P50: 560ms·
1.8k tokens avg·
MNCJob
PRODUCTION
GEMINI

AI Job-Candidate Match Scoring

gemini-1.5-flash
HYPOTHESIS
Gemini scoring of job-resume semantic match produces rankings that correlate with recruiter decisions at r > 0.80.
KEY RESULT
0 correlation
Below 0.80 target — sufficient for candidate shortlisting, not final ranking
P50: 890ms·
2.4k tokens avg·
MNCJob
PRODUCTION
n8n AI

n8n + Gemini — AI Node Reliability in Prod

gemini-1.5-flash via n8n
HYPOTHESIS
n8n's Gemini AI node achieves 99%+ execution success rate in production pipelines with retry logic configured.
KEY RESULT
0% success rate
0.7% below target — root cause is rate limiting at peak — mitigated
P50: 420ms·
Variable tokens·
CipherBitz Ops
INCONCLUSIVE
GEMINI

AI SEO Meta Description — CTR vs Manual

gemini-1.5-flash
HYPOTHESIS
AI-generated meta descriptions for FinCalc pages lift CTR by 10%+ vs manually written equivalents over 60 days.
KEY RESULT
+0% CTR (day 38)
Trending positive but insufficient data — 22 days remaining in measurement
Ongoing·
220 tokens avg·
FinCalc
INCONCLUSIVE
GEMINI

Gemini 2.0 Flash vs 1.5 Flash — Production Swap

gemini-2.0-flash
HYPOTHESIS
Gemini 2.0 Flash improves response quality by 5%+ on AskBLR queries vs 1.5 Flash at comparable cost and latency.
KEY RESULT
0% quality (day 7)
Too early — need 21 days of production data for statistical significance
P50: 290ms·
Comparable tokens·
AskBLR
INCONCLUSIVE
GEMINI

pgvector Embeddings for Product Search

text-embedding-004
HYPOTHESIS
pgvector semantic search improves NextGirl product recall by 25%+ vs keyword search at no additional infrastructure cost.
KEY RESULT
0% recall (controlled test)
Below 25% target — may need embedding model tuning — experiment continuing
P50: 12ms (query)·
Embeddings: 780ms·
NextGirl
FAILED
GEMINI

AI Auto-Generated Business Descriptions

gemini-1.5-flash
HYPOTHESIS
Gemini generates accurate business descriptions for NammaHubballi listings from name and category alone at 85%+ accuracy.
KEY RESULT
0% factual accuracy
Hallucinated hours, contacts, specialties — model lacks ground truth for local facts
P50: 240ms·
320 tokens avg·
Closed: Week 5
FAILED
GEMINI

Voice-to-AI Search on Mobile — MNCJob

Web Speech API + gemini-1.5-flash
HYPOTHESIS
Web Speech API + Gemini text search produces voice search experience with 90%+ recognition accuracy on mobile.
KEY RESULT
0% accuracy on iOS
iOS Safari recognition failure rate 40%+ — architectural constraint, not AI
Abandoned·
iOS Safari 16.x·
Closed: Week 3
FAILED
EDGE AI

Gemini Nano — On-Device Inference

gemini-nano
HYPOTHESIS
Gemini Nano running on-device achieves 75%+ of Gemini Flash response quality for simple classification tasks at zero API cost.
KEY RESULT
0% accuracy
12.9% below hypothesis — insufficient for product use case — API dependency maintained
P50: 180ms·
On-device·
Closed: Week 2
THE EXPERIMENT DESIGN STANDARD

Every Experiment Follows
The Same Five Rules.

AI experiments fail to produce useful knowledge when they are undisciplined. These five rules are the design constraints every experiment here follows — without exception.

"An AI evaluation that uses synthetic data produces synthetic insights. We measure on production traffic because production traffic is where the model actually fails."

Rule 1 of 5
1

The hypothesis has a number in it.

A hypothesis that reads 'AI will improve search results' is not a hypothesis — it is an aspiration. Every hypothesis here specifies a metric, a threshold, and a measurement period before any experiment code is written.

NOT: 'AI will improve tagging accuracy.' YES: 'Gemini Flash will tag NammaHubballi listings at 90%+ accuracy vs human labels on a 200-listing evaluation set within 7 days.'
2

Evaluation uses production data.

Synthetic benchmarks tell you how the model performs on synthetic data. We evaluate on anonymised production queries and production content — because model performance on real user behaviour is consistently different from performance on constructed evaluation sets.

MNCJob resume parsing was evaluated on 500 real uploaded resumes — not generated test documents — because real resumes have the formatting inconsistencies that matter.
3

Failure is planned for — not surprised by.

Each experiment defines upfront what failure looks like: the metric threshold below which the approach will not be deployed. When that threshold is not met, the experiment is closed formally — not kept running until the numbers eventually look better.

AI business description generation was closed when factual accuracy measured 54% against 85% hypothesis threshold. The failure was documented and archived — not iterated on indefinitely.
4

The model is chosen before the experiment — not after.

We do not run a prompt against all models and then choose the one that performed best to report. Model choice is declared in the hypothesis — and if we are comparing models, the comparison protocol is specified before running any evaluation.

The job-match scoring experiment declared 'gemini-1.5-flash' as the test model. Gemini Pro's higher accuracy was noted but not used to inflate the result — cost and latency are part of the evaluation.
5

Every experiment answers a real product question.

We do not run AI experiments for research interest alone. Each experiment is attached to a real product decision: should we deploy this feature, should we switch models, should we change the architecture. An experiment that does not inform a decision is not an experiment — it is entertainment.

The Cloudflare edge inference experiment answered: 'should AskBLR route classification through edge or Gemini API?' The answer (edge, 40ms vs 320ms) is now the production architecture.
MEASUREMENT METHODOLOGY

How We Measure What Matters.

Accuracy is subjective without a baseline. Latency is meaningless without specifying percentile. Here is how we measure the two metrics that define production viability.

Measuring Accuracy

Every accuracy metric implies a manually labeled ground-truth dataset. We do not use LLMs to judge LLM outputs ("LLM-as-a-judge") for primary metrics, as it introduces correlated biases.

  • Min 200 samples per evaluation
  • Drawn from 90th percentile complex queries
  • Scored by human product owners

Measuring Latency

"Time to First Token" (TTFT) only matters for streaming UI. For programmatic integrations, total round-trip time (P50 and P95) dictates whether a feature blocks the main thread.

  • Measured from client request to complete payload
  • Includes network overhead and JSON parsing
  • P50 used as baseline, P95 for circuit breakers

The Latency Budget

A 200ms API call becomes a 600ms user delay once orchestration, routing, and database lookups are factored in. This is our standard architecture latency map.

LLM INFERENCE280–800ms
Production Gemini 1.5 Flash API calls including network overhead.
EDGE INFERENCE30–60ms
Cloudflare Workers AI for lightweight classification before routing.
VECTOR SEARCH10–25ms
pgvector semantic queries against pre-computed embeddings.
ORCHESTRATION100–300ms
n8n workflow execution overhead per AI node.
USER CLIENT
Next.js App
ROUTING
Edge AI (40ms)
ORCHESTRATION
n8n / Server (150ms)
RETRIEVAL
pgvector (15ms)
INFERENCE
Gemini API (350ms)
MODEL SELECTION

Why We Default To
Google Gemini.

We are not married to any one vendor, but pragmatism dictates a default stack. After testing OpenAI, Anthropic, and Google DeepMind extensively, here is how we position them for production use cases.

STRUCT. EXTRACTION
RAG & GROUNDING
NATIVE MULTIMODAL
EDGE DEPLOYMENT
CODE GENERATION
Google DeepMind

Gemini 1.5 Flash

The Production Workhorse

Powers 80% of our production integrations. Unbeatable latency-to-intelligence ratio for structurally constrained tasks. Massive context window (1M+) enables 'in-context lookup' architectures replacing complex RAG.

P50 Latency320ms
Cost / 1M Input$0.075
Context Window1M Tokens
Google DeepMind

Gemini 1.5 Pro

The Complex Reasoner

Reserved for high-complexity analytical tasks where Flash falters. Slower and more expensive, but achieves 15-20% higher accuracy on multi-step reasoning and complex code generation tasks.

P50 Latency1200ms
Cost / 1M Input$1.25
Context Window2M Tokens
Anthropic

Claude 3.5 Sonnet

The Coding Standard

Our benchmark for code generation and deeply nuanced text writing. We use Claude internally for development, but rarely in production for clients due to Gemini's superior speed/cost for standard JSON-in/JSON-out tasks.

P50 Latency850ms
Cost / 1M Input$3.00
Context Window200k Tokens
OpenAI

GPT-4o Mini

The Fallback

A highly capable model with excellent latency, but in our testing, Gemini Flash's 1M context window and strict JSON adherence makes Flash our default choice for the same price tier.

P50 Latency380ms
Cost / 1M Input$0.150
Context Window128k Tokens
ENGINEERING IMPACT

Architectural Decisions Driven by Data.

Experiments are only useful if they change how you build. Here are the concrete architectural shifts we've made based on our lab results.

🧠

Long Context > Complex RAG

Adopted for datasets < 500k tokens

Gemini Flash's 1M context window makes 'in-context lookup' faster and more accurate than chunking and vectorized retrieval for medium-sized corpuses.

Impact:Removed pgvector dependency for 3 internal tools.

Edge Classification First

Standard Architecture

Running a fast classifier at the edge (Cloudflare Workers AI) to route queries before hitting heavy LLM APIs reduces perceived latency by ~200ms.

Impact:Deployed on AskBLR to separate conversational intents.

🚧

Strict JSON Validation

Mandatory

LLMs fail silently when generating structure. We enforce rigid JSON schema validation middleware on every API response before it hits the application tier.

Impact:Reduced downstream parsing errors from 4.2% to 0.01%.

HARD LIMITS

What We Will Not
Use AI For.

Understanding a technology's capabilities requires understanding its boundaries. These are the hard lines where we stop using AI and rely exclusively on deterministic engineering.

🚫

No Unsupervised Code Deployment

AI writes roughly 40% of our boilerplate and scaffolding. It deploys exactly 0% of our production code. Every line is reviewed by a Senior Engineer before merging.

🎨

No Generative UI

LLMs generating React components on the fly is an impressive demo that fails accessibility, performance, and brand consistency audits in production.

🗄️

No Database Mutations

AI models can read from the database to contextualise answers. They are never granted UPDATE or DELETE permissions. State changes require deterministic code.

📝

No Black-Box Prompts

A prompt is source code. It is version controlled, reviewed, and tested against a regression suite before deployment. Tuning is engineering, not magic.

WE BUILD THIS

Stop Experimenting.
Start Integrating.

You have seen our data. If you have a specific AI integration challenge or want to test a hypothesis in your own product, let's talk engineering.

Or email our engineering lead directly: engineering@cipherbitz.com