3 Takeaways From Building AI Agents in Freight Tech

I spent a couple years building agents in a freight tech startup. Here's some takeaways.

See like an LLM.

Once you internalize an LLM's "perspective," context engineering clicks. It becomes a matter of curating coherent, generalizable instructions alongside the details each decision needs — memories, supporting context.

A few things we shipped: compacting thread history into summaries carried forward to each inbound email, packing task-relevant metadata dynamically so the model saw the right fields in each decision, and maximizing the share of tokens that hit prompt caching to cut costs at scale.

The payoff was concrete: fewer details dropped across long threads, and more consistent behavior on the inputs that actually mattered.

The same intuition applies to task design. If a probabilistic model has to act on a user's behalf at near perfect accuracy, it is more sane to hand it easy tasks it can execute reliably and accurately.

Andrej Karpathy nailed this with a "task contractor" analogy in his 2023 talk at Microsoft Build. Paraphrasing:

"Would you hand a contractor clear, complete instructions? Or vague ones that skip key details? The quality of your instructions shapes the quality of the outcome." — Andrej Karpathy

Great models still make weird mistakes at scale.

Run an LLM on millions of inputs for the same task, and you'll see strange things in the long tail.

For our email and voice agents, regression evals were the key to taming this. A regression eval is a fixed test set you re-run on every prompt or model change to confirm that yesterday's fixes still hold. The loop was simple: every failure we caught — in internal review or from customer feedback — got pinned, reduced to a minimal reproducible case, and added to the suite.

Over months, that compounded into a safety net we couldn't work without. None of this is novel; it's just disciplined error analysis, the highest-leverage habit I know of for a production agent (the kind of work Hamel Husain and Eugene Yan have written about at length).

And the curated eval data gave us other superpowers: a finetuning set when warranted, faster onboarding for new engineers, a shared definition of ground truth across the team, the list goes on.

Crafting hard test sets — capability evals built from worst-case inputs — was less useful for us. Our agents had one job: facilitate the freight transaction. Since gpt-4o, frontier models have comfortably handled that class of task. Fixating on near-impossible inputs (garbled inbound emails, voice calls with severe transcription errors) wasn't worth it; those failures were rare and, for our purposes, acceptable. We got more signal building evals around the valid inputs the agent would actually see.

One more thing here: we were observability maximalists. Log everything. You can't fix what you can't see. When something breaks at 2am, you want to know exactly which step went sideways.

I'm bullish on UIs that aren't chat.

Chat is, in my opinion, often a trap for agents -- specifically task agents powering business workflow automations. We deployed our agents through email and phone channels instead, for two reasons. First, the work already lived in email and phone calls — meeting the process where it was beat asking everyone to adopt a new surface. Second, a chat box invites open-ended input, which is exactly the complexity you're trying to strip out of the task (see above).

If you do build a chat interface — perhaps the right call for some use cases / products — simply remember that you're parking yourself next to ChatGPT and Claude. Users have those benchmarks burned into their brains. Anything that feels even slightly worse is instantly obvious, and painful.

Refs.

Citing a few great blogs (Hamel Husain, Eugene Yan, Andrej Karpathy the goat 🐐).