◎ GEMINI · N8N · PRODUCTION AI INTEGRATIONS

Fifteen AI Experiments.
Real Products.
Real Numbers.

EveryexperimentonthispageranonproductionAIintegrationsacrosssixliveCipherBitzproducts.Latencymeasured.Accuracybenchmarked.Failuremodesdocumented.Nodemoenvironments.Nosyntheticbenchmarks.Novendor-providednumbers.

AI experiments documented

Gemini

Primary model — Flash + Pro

87.5%

Best accuracy achieved in production

40ms

Lowest edge inference latency

CONFIDENCE SCORE

confidence

Structured extraction94.2%

Intent classification88.7%

Local fact recall43.1%

Local fact recall requires grounding context.

TOKEN STREAM — gemini-1.5-flash

USER

What are the best restaurants near MG Road, Bengaluru for a business lunch?

ASSISTANT

0 tokens · 340ms

Latency: 340ms

Tokens: 0

Model: Flash

Cost: ₹0.0000

RESPONSE LATENCY (ms)

800

600

400

200

P50: 312ms

P95: 548ms

P99: 694ms

⚡ Flash

avg 320ms

₹0.002/1k

◈ Pro

avg 1,240ms

₹0.012/1k

MODEL BENCHMARK —

gemini-1.5-flash

88.7% ↑₹0.002/1k

Currently active

gemini-1.5-pro

94.2% ↑₹0.012/1k

overkill for most tasks

gemini-2.0-flash

91.4% ↑₹0.003/1k

testing now

gemini-nano

62.1% ↓₹0.0004/1k

not sufficient

ALL AI EXPERIMENTS

Fifteen Experiments.
Zero Vendor Benchmarks.

Every result below came from our own production integrations. Latency numbers are P50 unless stated. Accuracy is measured against a manually labelled evaluation set, not model self-reporting.

PRODUCTION

GEMINI

LLM Prompt Caching — API Call Reduction

gemini-1.5-flash

HYPOTHESIS

Caching recurring prompt patterns reduces Gemini API volume by 40%+ without degrading response quality.

KEY RESULT

0% cache hit rate

Below 40% target but meaningful cost reduction — caching deployed in production

P50: 0ms (cached)·

Tokens: 0 (cached)·

AskBLR

PRODUCTION

GEMINI

Structured JSON Extraction from Natural Language

gemini-1.5-flash

HYPOTHESIS

Gemini Flash with a strict JSON schema in the system prompt extracts structured data from freeform text at 94%+ accuracy.

KEY RESULT

0% accuracy

Exceeds hypothesis — now standard approach for all extraction tasks

P50: 280ms·

Avg 340 tokens·

FreeBill, NammaHubballi

PRODUCTION

GEMINI

Grounded City AI vs Base Gemini — Accuracy

gemini-1.5-flash

HYPOTHESIS

Gemini grounded with curated city data answers hyperlocal questions at 90%+ accuracy vs base Gemini at ~40%.

KEY RESULT

0% vs 43%

Hypothesis confirmed — grounded model deployed in AskBLR and NammaHubballi

P50: 420ms·

1.2k tokens avg·

AskBLR

PRODUCTION

GEMINI

Occasion-Based Outfit Recommendation Accuracy

gemini-1.5-flash

HYPOTHESIS

Gemini Flash generates contextually relevant outfit combinations from a catalog at 85%+ user acceptance rate vs random suggestion.

KEY RESULT

0% acceptance

Hypothesis met — 22% reduction in time-to-first-cart in A/B test

P50: 390ms·

520 tokens avg·

NextGirl

PRODUCTION

EDGE AI

Edge Inference for Query Classification

Cloudflare Workers AI

HYPOTHESIS

Lightweight classification at Cloudflare edge reduces latency for routing decisions by 200ms+ vs full Gemini API call.

KEY RESULT

0ms faster

Edge: 40ms vs Gemini: 320ms — deployed for intent routing in AskBLR

P50: 40ms·

Classification only·

AskBLR

PRODUCTION

GEMINI

Gemini Auto-Tagging for Business Listings

gemini-1.5-flash

HYPOTHESIS

Gemini Flash auto-categorises NammaHubballi listings at 90%+ accuracy vs manual tagging.

KEY RESULT

0% accuracy

3% below target — deployed with 12.5% human review queue for ambiguous cases

P50: 240ms·

180 tokens avg·

NammaHubballi

PRODUCTION

GEMINI

Structured Resume Parsing — Field Extraction

gemini-1.5-flash

HYPOTHESIS

Gemini Flash extracts structured resume fields (experience, skills, education) from freeform PDF text at 91%+ accuracy.

KEY RESULT

0% accuracy

Hypothesis exceeded — parsing deployed in MNCJob upload flow

P50: 560ms·

1.8k tokens avg·

MNCJob

PRODUCTION

GEMINI

AI Job-Candidate Match Scoring

gemini-1.5-flash

HYPOTHESIS

Gemini scoring of job-resume semantic match produces rankings that correlate with recruiter decisions at r > 0.80.

KEY RESULT

0 correlation

Below 0.80 target — sufficient for candidate shortlisting, not final ranking

P50: 890ms·

2.4k tokens avg·

MNCJob

PRODUCTION

n8n AI

n8n + Gemini — AI Node Reliability in Prod

gemini-1.5-flash via n8n

HYPOTHESIS

n8n's Gemini AI node achieves 99%+ execution success rate in production pipelines with retry logic configured.

KEY RESULT

0% success rate

0.7% below target — root cause is rate limiting at peak — mitigated

P50: 420ms·

Variable tokens·

CipherBitz Ops

≈ INCONCLUSIVE

GEMINI

AI SEO Meta Description — CTR vs Manual

gemini-1.5-flash

HYPOTHESIS

AI-generated meta descriptions for FinCalc pages lift CTR by 10%+ vs manually written equivalents over 60 days.

KEY RESULT

+0% CTR (day 38)

Trending positive but insufficient data — 22 days remaining in measurement

Ongoing·

220 tokens avg·

FinCalc

≈ INCONCLUSIVE

GEMINI

Gemini 2.0 Flash vs 1.5 Flash — Production Swap

gemini-2.0-flash

HYPOTHESIS

Gemini 2.0 Flash improves response quality by 5%+ on AskBLR queries vs 1.5 Flash at comparable cost and latency.

KEY RESULT

0% quality (day 7)

Too early — need 21 days of production data for statistical significance

P50: 290ms·

Comparable tokens·

AskBLR

≈ INCONCLUSIVE

GEMINI

pgvector Embeddings for Product Search

text-embedding-004

HYPOTHESIS

pgvector semantic search improves NextGirl product recall by 25%+ vs keyword search at no additional infrastructure cost.

KEY RESULT

0% recall (controlled test)

Below 25% target — may need embedding model tuning — experiment continuing

P50: 12ms (query)·

Embeddings: 780ms·

NextGirl

⊘ FAILED

GEMINI

AI Auto-Generated Business Descriptions

gemini-1.5-flash

HYPOTHESIS

Gemini generates accurate business descriptions for NammaHubballi listings from name and category alone at 85%+ accuracy.

KEY RESULT

0% factual accuracy

Hallucinated hours, contacts, specialties — model lacks ground truth for local facts

P50: 240ms·

320 tokens avg·

Closed: Week 5

⊘ FAILED

GEMINI

Voice-to-AI Search on Mobile — MNCJob

Web Speech API + gemini-1.5-flash

HYPOTHESIS

Web Speech API + Gemini text search produces voice search experience with 90%+ recognition accuracy on mobile.

KEY RESULT

0% accuracy on iOS

iOS Safari recognition failure rate 40%+ — architectural constraint, not AI

Abandoned·

iOS Safari 16.x·

Closed: Week 3

⊘ FAILED

EDGE AI

Gemini Nano — On-Device Inference

gemini-nano

HYPOTHESIS

Gemini Nano running on-device achieves 75%+ of Gemini Flash response quality for simple classification tasks at zero API cost.

KEY RESULT

0% accuracy

12.9% below hypothesis — insufficient for product use case — API dependency maintained

P50: 180ms·

On-device·

Closed: Week 2

THE EXPERIMENT DESIGN STANDARD

Every Experiment Follows
The Same Five Rules.

AI experiments fail to produce useful knowledge when they are undisciplined. These five rules are the design constraints every experiment here follows — without exception.

"An AI evaluation that uses synthetic data produces synthetic insights. We measure on production traffic because production traffic is where the model actually fails."

Rule 1 of 5

The hypothesis has a number in it.

A hypothesis that reads 'AI will improve search results' is not a hypothesis — it is an aspiration. Every hypothesis here specifies a metric, a threshold, and a measurement period before any experiment code is written.

NOT: 'AI will improve tagging accuracy.' YES: 'Gemini Flash will tag NammaHubballi listings at 90%+ accuracy vs human labels on a 200-listing evaluation set within 7 days.'

Evaluation uses production data.

Synthetic benchmarks tell you how the model performs on synthetic data. We evaluate on anonymised production queries and production content — because model performance on real user behaviour is consistently different from performance on constructed evaluation sets.

MNCJob resume parsing was evaluated on 500 real uploaded resumes — not generated test documents — because real resumes have the formatting inconsistencies that matter.

Failure is planned for — not surprised by.

Each experiment defines upfront what failure looks like: the metric threshold below which the approach will not be deployed. When that threshold is not met, the experiment is closed formally — not kept running until the numbers eventually look better.

AI business description generation was closed when factual accuracy measured 54% against 85% hypothesis threshold. The failure was documented and archived — not iterated on indefinitely.

The model is chosen before the experiment — not after.

We do not run a prompt against all models and then choose the one that performed best to report. Model choice is declared in the hypothesis — and if we are comparing models, the comparison protocol is specified before running any evaluation.

The job-match scoring experiment declared 'gemini-1.5-flash' as the test model. Gemini Pro's higher accuracy was noted but not used to inflate the result — cost and latency are part of the evaluation.

Every experiment answers a real product question.

We do not run AI experiments for research interest alone. Each experiment is attached to a real product decision: should we deploy this feature, should we switch models, should we change the architecture. An experiment that does not inform a decision is not an experiment — it is entertainment.

The Cloudflare edge inference experiment answered: 'should AskBLR route classification through edge or Gemini API?' The answer (edge, 40ms vs 320ms) is now the production architecture.

MEASUREMENT METHODOLOGY

How We Measure What Matters.

Accuracy is subjective without a baseline. Latency is meaningless without specifying percentile. Here is how we measure the two metrics that define production viability.

Measuring Accuracy

Every accuracy metric implies a manually labeled ground-truth dataset. We do not use LLMs to judge LLM outputs ("LLM-as-a-judge") for primary metrics, as it introduces correlated biases.

✓ Min 200 samples per evaluation
✓ Drawn from 90th percentile complex queries
✓ Scored by human product owners

Measuring Latency

"Time to First Token" (TTFT) only matters for streaming UI. For programmatic integrations, total round-trip time (P50 and P95) dictates whether a feature blocks the main thread.

✓ Measured from client request to complete payload
✓ Includes network overhead and JSON parsing
✓ P50 used as baseline, P95 for circuit breakers

The Latency Budget

A 200ms API call becomes a 600ms user delay once orchestration, routing, and database lookups are factored in. This is our standard architecture latency map.

LLM INFERENCE280–800ms

Production Gemini 1.5 Flash API calls including network overhead.

EDGE INFERENCE30–60ms

Cloudflare Workers AI for lightweight classification before routing.

VECTOR SEARCH10–25ms

pgvector semantic queries against pre-computed embeddings.

ORCHESTRATION100–300ms

n8n workflow execution overhead per AI node.

USER CLIENT

Next.js App

ROUTING

Edge AI (40ms)

ORCHESTRATION

n8n / Server (150ms)

RETRIEVAL

pgvector (15ms)

INFERENCE

Gemini API (350ms)

MODEL SELECTION

Why We Default To
Google Gemini.

We are not married to any one vendor, but pragmatism dictates a default stack. After testing OpenAI, Anthropic, and Google DeepMind extensively, here is how we position them for production use cases.

STRUCT. EXTRACTION

RAG & GROUNDING

NATIVE MULTIMODAL

EDGE DEPLOYMENT

CODE GENERATION

Google DeepMind

Gemini 1.5 Flash

The Production Workhorse

Powers 80% of our production integrations. Unbeatable latency-to-intelligence ratio for structurally constrained tasks. Massive context window (1M+) enables 'in-context lookup' architectures replacing complex RAG.

P50 Latency320ms

Cost / 1M Input$0.075

Context Window1M Tokens

Google DeepMind

Gemini 1.5 Pro

The Complex Reasoner

Reserved for high-complexity analytical tasks where Flash falters. Slower and more expensive, but achieves 15-20% higher accuracy on multi-step reasoning and complex code generation tasks.

P50 Latency1200ms

Cost / 1M Input$1.25

Context Window2M Tokens

Anthropic

Claude 3.5 Sonnet

The Coding Standard

Our benchmark for code generation and deeply nuanced text writing. We use Claude internally for development, but rarely in production for clients due to Gemini's superior speed/cost for standard JSON-in/JSON-out tasks.

P50 Latency850ms

Cost / 1M Input$3.00

Context Window200k Tokens

OpenAI

GPT-4o Mini

The Fallback

A highly capable model with excellent latency, but in our testing, Gemini Flash's 1M context window and strict JSON adherence makes Flash our default choice for the same price tier.

P50 Latency380ms

Cost / 1M Input$0.150

Context Window128k Tokens

ENGINEERING IMPACT

Architectural Decisions Driven by Data.

Experiments are only useful if they change how you build. Here are the concrete architectural shifts we've made based on our lab results.

🧠

Long Context > Complex RAG

Adopted for datasets < 500k tokens

Gemini Flash's 1M context window makes 'in-context lookup' faster and more accurate than chunking and vectorized retrieval for medium-sized corpuses.

Impact:Removed pgvector dependency for 3 internal tools.

⚡

Edge Classification First

Standard Architecture

Running a fast classifier at the edge (Cloudflare Workers AI) to route queries before hitting heavy LLM APIs reduces perceived latency by ~200ms.

Impact:Deployed on AskBLR to separate conversational intents.

🚧

Strict JSON Validation

Mandatory

LLMs fail silently when generating structure. We enforce rigid JSON schema validation middleware on every API response before it hits the application tier.

Impact:Reduced downstream parsing errors from 4.2% to 0.01%.

HARD LIMITS

What We Will Not
Use AI For.

Understanding a technology's capabilities requires understanding its boundaries. These are the hard lines where we stop using AI and rely exclusively on deterministic engineering.

🚫

No Unsupervised Code Deployment

AI writes roughly 40% of our boilerplate and scaffolding. It deploys exactly 0% of our production code. Every line is reviewed by a Senior Engineer before merging.

🎨

No Generative UI

LLMs generating React components on the fly is an impressive demo that fails accessibility, performance, and brand consistency audits in production.

🗄️

No Database Mutations

AI models can read from the database to contextualise answers. They are never granted UPDATE or DELETE permissions. State changes require deterministic code.

📝

No Black-Box Prompts

A prompt is source code. It is version controlled, reviewed, and tested against a regression suite before deployment. Tuning is engineering, not magic.

WE BUILD THIS

Stop Experimenting.
Start Integrating.

You have seen our data. If you have a specific AI integration challenge or want to test a hypothesis in your own product, let's talk engineering.

Discuss an AI Project View Software Division

Or email our engineering lead directly: engineering@cipherbitz.com

Fifteen AI Experiments.Real Products.Real Numbers.

Fifteen Experiments.Zero Vendor Benchmarks.

LLM Prompt Caching — API Call Reduction

Structured JSON Extraction from Natural Language

Grounded City AI vs Base Gemini — Accuracy

Occasion-Based Outfit Recommendation Accuracy

Edge Inference for Query Classification

Gemini Auto-Tagging for Business Listings

Structured Resume Parsing — Field Extraction

AI Job-Candidate Match Scoring

n8n + Gemini — AI Node Reliability in Prod

AI SEO Meta Description — CTR vs Manual

Gemini 2.0 Flash vs 1.5 Flash — Production Swap

pgvector Embeddings for Product Search

AI Auto-Generated Business Descriptions

Voice-to-AI Search on Mobile — MNCJob

Gemini Nano — On-Device Inference

Every Experiment FollowsThe Same Five Rules.

The hypothesis has a number in it.

Evaluation uses production data.

Failure is planned for — not surprised by.

The model is chosen before the experiment — not after.

Every experiment answers a real product question.

How We Measure What Matters.

Measuring Accuracy

Measuring Latency

The Latency Budget

Why We Default ToGoogle Gemini.

Gemini 1.5 Flash

Gemini 1.5 Pro

Claude 3.5 Sonnet

GPT-4o Mini

Architectural Decisions Driven by Data.

Long Context > Complex RAG

Edge Classification First

Strict JSON Validation

What We Will Not Use AI For.

No Unsupervised Code Deployment

No Generative UI

No Database Mutations

No Black-Box Prompts

Stop Experimenting.Start Integrating.

Fifteen AI Experiments.
Real Products.
Real Numbers.

Fifteen Experiments.
Zero Vendor Benchmarks.

Every Experiment Follows
The Same Five Rules.

Why We Default To
Google Gemini.

What We Will Not
Use AI For.

Stop Experimenting.
Start Integrating.