Engineering🇰🇷 한국어

5 Reasons Your Demo Works But Production Crashes

Common patterns across AI, RAG, and ML projects — why does "it worked fine" fall apart in production?

5 Reasons Your Demo Works But Production Crashes

5 Reasons Your Demo Works But Production Crashes

Common patterns across AI, RAG, and ML projects — why does "it worked fine" fall apart in production?

Demo vs Launch

Demo: Good inputs + single run + someone watching

Launch: Bad inputs + repetition + edge cases + operations + accountability

Fail to recognize this difference, and your demo that got applause will be rolled back within a week of launch.

1. Input Distribution Shifts

Demo set vs Reality

During demos, you pick examples that work well. In reality, you get typos, abbreviations, weird formats, and adversarial inputs.

Symptoms: Dramatic failures on specific cases. "90% average accuracy, so why are complaints flooding in?"

Remedies:

  • Shadow traffic to understand real input distribution
  • Canary deployment to expose only partial traffic first
  • Automated failure case collection loop

2. Dependencies Multiply

Tools / Search / External APIs / Permissions / Network

In demos, all external services work perfectly. In production, APIs slow down, tokens expire, networks drop.

Symptoms: Retry storms, timeouts, partial failures. "It worked yesterday, why is it broken today?"

Remedies:

  • Time budget (cap on total request time)
  • Circuit breaker to prevent failure propagation
  • Graceful degradation (fallback paths when externals fail)

3. Evaluation Criteria Change

Accuracy → Trust / Accountability / Explainability

In demos, "correct = success". In production, "correct can still be problematic" and "wrong = major incident".

Symptoms: Accurate answers generating complaints. Legal team reaches out. "Who's responsible for this?"

Remedies:

  • Policies/guardrails (sensitive topics, PII)
  • Abstain option (refuse to answer when uncertain)
  • Evidence-first (show sources before conclusions)

4. State/Cache/Concurrency Enter the Picture

Production means repetition

Demos run once and done. In production, the same question comes 1000 times, gets cached, and is processed concurrently.

Symptoms: Same question, different answers. Cache pollution. Race conditions.

Remedies:

  • Deterministic path (temperature=0, fixed seed)
  • Clear caching policy (when to cache, when to regenerate)
  • Idempotency guarantee (same request = same result)

5. Operations Begin

Monitoring / Alerts / Rollback / Hotfix

Demos have no operations. In production, alerts fire at 3 AM, and you discover something's been silently broken for a week.

Symptoms: Silent failures (wrong results, no error logs). Cost explosions (infinite retries).

Remedies:

  • Define SLO/SLI (success rate, latency, cost caps)
  • Set error budget (acceptable failure rate)
  • Design logging (track 0-hit, retry, fallback)

Pre-Launch Checklist

ItemCheck
Do you have test data similar to real traffic?
Is there a fallback when external dependencies fail?
Are abstain conditions defined?
Are sensitive topic guardrails in place?
Is caching policy clear?
Does same input produce same output?
Are error logs being collected?
Is there a cost cap?
Is there a rollback procedure?
Is there a designated person to contact during incidents?
If 3 or more items are ☐, you're not ready to launch.

Next in Series

  • Part 2: For Vibe Coders — "Why does it break when I deploy what worked locally?"
  • Part 3: For Teams/Organizations — "The real reason launches fail: Alignment, Accountability, Operations"

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts