Teardowns
Most AI pilots do not die in the model.
We once shipped a support-triage model that hit 94% accuracy in testing and got used by exactly nobody.
The model was never the problem. It lived in a separate tab, and the support agents already had eleven tabs open. We moved one inference call into the tool they already stared at all day, changed nothing about the model, and usage went from 0% to 80% in a week. Most “AI did not work” stories are actually “we forgot it had to live where the work happens” stories.
This page is our logo wall replacement: anonymized autopsies, the numbers that mattered, and the fix. No named clients. No victory lap. Just the ways pilots fail when smart people confuse a demo with a working system.
Autopsy 01
The 94%-accurate bot nobody used.
A support team had a real triage problem. Tickets came in messy. Agents lost time deciding priority, routing, and whether a refund policy applied. The model did the narrow job well: on a held-out sample, it matched the human routing label 94% of the time. In a room, the demo looked obvious. In production, usage was zero.
The reason was embarrassing and useful. The AI lived one click too far away. Agents worked in the ticketing tool. The model lived in a separate tab with a login, a search field, and a copy button. The team already had the help center, CRM, refund sheet, queue view, chat, and internal notes open. “One more tab” was not one more tab. It was one more place to forget.
We did not retrain. We did not swap models. We moved the inference call into the existing ticket sidebar and wrote the result into the field the team already used. The label appeared where the handoff happened. Usage went from 0% to roughly 80% in the first week, and the remaining gap was mostly edge cases the team wanted humans to own.
The lesson is blunt: the last mile is a product problem, not a model problem. If the output is not inside the workflow, the model is theater with a login screen. We spend unusual time on the boring integration layer because that is where pilots quietly die.
Autopsy 02
When we deleted the vector database.
A fintech's AP team wanted “RAG” over policy docs, vendor rules, and approval notes. The first build had the shape everyone expects now: embeddings, chunks, a vector database, a reranker, and a chat box. It was expensive enough to notice and accurate enough to be dangerous. On a 120-question test set, the system often found a semantically similar paragraph and still missed the exact rule.
The corpus was not web scale. It was a few hundred pages, with stable headings, repeated terms, and document names people already knew. The fancy stack was solving the wrong problem. We rewrote the retrieval layer with good chunking, document-aware grep, strict citation rules, and a small model call only after the exact passages were found. Accuracy went up, cost went down, and the answer became easier for a controller to audit.
The usable rule: reach for embeddings when meaning differs from wording, the language is varied, or the corpus is too large for exact matching to be sane. Start with search when the docs are small, structured, and full of terms humans actually use. A boring baseline is not anti-AI. It is how you avoid paying rent on complexity.
Senior judgment sometimes looks like removing a tool from the diagram. We are not trying to make the architecture impressive. We are trying to move the number with the fewest pieces you will still understand after handoff.
A quick test: before approving a vector database, ask the vendor to beat keyword search on a blind set of real questions. If they cannot show the sample size, misses, and baseline, you are buying vibes.
Eval Theater
How to spot fake accuracy.
A founder does not need to be a machine-learning researcher to interrogate an accuracy claim. You need eight plain questions. Use them on any vendor, including us. If the answers get slippery, the metric is probably decoration.
No holdout set
If the examples used to tune the system are also used to score it, the number is not evidence.
Cherry-picked wins
A demo that only shows successes is a sales reel. Ask for misses and borderline cases.
Training-data overlap
Public FAQs, old tickets, and copied examples can make recall look like reasoning.
No human spot-check
Someone who knows the work should review a sample and mark false confidence.
No baseline
“87% accurate” means little if a rules script, grep, or the current team already gets 85%.
No sample size
A single number without n is a magic trick. Ten examples and 1,000 examples are not the same claim.
Internal tests only
“It passed our tests” is not a method. Ask what failed, who graded it, and how labels were made.
Demo-safe workflow
If the demo path avoids messy inputs, permissions, handoffs, and exceptions, it is not your job yet.
We teach clients to catch this because clean measurement protects both sides. If we cannot win on a real baseline, you should know before a second invoice exists.
Death certificate
The six ways AI pilots die.
Most failures are not mysterious. They fall into a small set of repeatable patterns. Naming the pattern early keeps the pilot small enough to save.
“Working” became opinion.
Symptom: every meeting debates quality. Root cause: nobody agreed the number. Fix: a written metric contract on day one.
The mandate was “make us AI-first.”
Symptom: weeks of workshops, no shipped workflow. Root cause: scope was a slogan. Fix: one workflow, one number.
You rent the system forever.
Symptom: every change needs the vendor. Root cause: the code, prompts, evals, or weights never leave. Fix: take the assets.
The accuracy number was fake.
Symptom: the demo wins and production misses. Root cause: no real test set. Fix: a holdout set plus human spot-check.
It worked where nobody worked.
Symptom: good model, no usage. Root cause: the output lived outside the job. Fix: put it in the tool they already use.
Nobody owned it after handoff.
Symptom: three weeks later, it is stale. Root cause: no internal owner. Fix: name the owner before you start.
A better bet
Bring one ugly workflow.
We will tell you which death mode is most likely before we build. If the number is real and the owner exists, a two-week pilot is enough to learn something honest.