Anonymized case study

One number, agreed before we wrote a line of code.

Most AI case studies are logo theater. They hide the baseline, blur the sample size, and call a demo a transformation. This one shows the number, the denominator, and the caveat.

DPR Labs is an elite senior AI consulting studio for companies that want owned systems, not rented slop. We build inside your business, aim at one painful workflow, prove the result against a held-out set, then hand over the code, weights, prompts, evals, and operating notes. The client name below is withheld because the work touched live operations, but the method and numbers are the point.

10 min

Median handling time, down from 38 minutes

Illustrative / anonymized

61%

Exceptions resolved with no human in the loop

Illustrative / anonymized

n=2431

Live freight exceptions measured over two weeks

Illustrative / anonymized

100%

Code, weights, retrieval rules, and eval harness owned by the client

Illustrative / anonymized

The problem

A 40-person logistics ops team was buried under exceptions.

The team handled freight-exception tickets: missed appointments, carrier status gaps, address mismatches, accessorial disputes, late pickup notices, and customer updates that had to be accurate before anyone moved freight. Volume averaged roughly 1,200 tickets per week. Median handling time was about 38 minutes. That meant thousands of operator minutes disappeared into repeat lookup work before the team even reached the judgment calls that deserved human attention.

The existing workflow was not stupid. It was just overloaded. Operators bounced between the ticketing system, carrier portals, old resolution notes, customer rules, and tribal knowledge. Senior people knew which exceptions could be resolved safely and which ones needed escalation, but that expertise lived in heads and scattered notes. Hiring more people would have widened the queue, not fixed the repeat decision loop.

The controversial decision was to reject the usual mandate: “automate logistics.” That is not a project. That is a fog machine. We agreed on one number before code: median handling time per exception. If that number did not move, the pilot failed.

The build

Classifier first. Retrieval second. No chatbot theater.

We built a classifier plus retrieval agent inside the ticketing system the team already used. The classifier sorted the exception type, confidence, and risk path. The retrieval layer pulled carrier context, prior resolutions, customer rules, and policy snippets. The agent then either resolved the ticket with a traceable reason or escalated it with the evidence already assembled.

The important part was not the model call. The important part was the measurement harness. We created a 500-ticket held-out eval set before production rollout, with labels that reflected the actual operating decision: resolve, draft for review, or escalate. Every prompt, rule, retrieval change, and classifier adjustment had to beat that held-out set before it touched live work. The team could inspect the misses instead of trusting a vendor screenshot.

We also refused to create a new app for operators to remember. The result appeared in the ticket sidebar, wrote back to the fields the team already used, and preserved a readable trail: source snippets, confidence, action taken, and why a human was still needed. Billion-dollar AI often looks less like a magic assistant and more like removing 12 tabs from a job no one should have to do by hand.

Results

The number moved: 38 minutes to 10 minutes.

Baseline was the team’s own pre-project measurement: roughly 1,200 freight-exception tickets per week, a 38-minute median handling time, and effectively 0% autonomous resolution. After rollout, measured across n=2,431 live exceptions over two weeks, median handling time fell to 10 minutes. That is a 74% reduction on the single number we wrote down before work began.

Autonomous resolution moved from 0% to 61%. That does not mean the system pretended every ticket was safe. It means 61% met the agreed standard for no-human resolution, while the rest were escalated with context instead of dumped back into the queue cold. The operators still owned judgment. The machine owned repeat assembly, retrieval, and first-pass routing.

Ownership was part of the result. The client owns 100% of the code, weights, eval set, prompts, retrieval rules, and handoff notes. There is no per-seat rent on their own workflow and no vendor hostage situation if they want to retrain, fork, or fire us.

Why we show the caveat: this was one team over two weeks. The result is strong, but it is not a universal guarantee. Peak season, carrier mix changes, and new exception types need eval-guided retraining. A number without its baseline, sample size, and limits is marketing, not proof.

What made it work

The system had a spine before it had a brain.

One metric: median handling time, not a dashboard designed to find a win after the fact.
One workflow: freight exceptions, not a vague corporate AI transformation program.
One held-out set: 500 tickets the build had to respect before touching production.
One place to work: the existing ticketing system, not another tab with a login screen.
One owner: the ops team inherited the assets, not a dependency on DPR Labs forever.

This is why we are anti-slop. Slop is not merely low-quality text. Slop is any system that cannot tell you what number it is moving, where the test set came from, who owns the result, or what happens when the world changes. Serious AI work is less glamorous and far more profitable: choose the bottleneck, write the metric, build the harness, ship into the workflow, measure honestly.

Mini case

A smaller support triage win we will not overclaim.

A support operations group wanted faster first-pass triage on inbound customer queries. Same discipline: one agreed metric, a held-out sample, delivery inside the existing support tool, and full ownership at handoff. The system drafted routing, pulled relevant account context, and suggested a first response for human review.

Drafting time dropped meaningfully on the initial sample, and the team kept the harness. We are deliberately not quoting a giant headline percentage because the sample was too small to defend publicly. That restraint is the point. If a case study cannot survive its own caveat, it should not be a case study.

Bring one ugly queue

Agree the number first. See the eval before you scale.

If you have a high-volume operational workflow and a metric you would stake a decision on, we will define success before code exists, prove it on your held-out data, and hand you the system you paid for.

Book a pilot

Read the one-number method