3 Takeaways From Building AI Agents in Freight Tech
I spent two years building AI agents at a freight tech startup. Production agents, messy business process, millions of conversations. Here's what stuck.
See like an LLM.
Understanding how tokens flow through an LLM at inference time changes how you build.
In building email and voice agents, it was a critical framing. You internalize the model's perspective, and context engineering clicks. We focused on foundational techniques like improving email thread compaction and dynamic metadata packing, while retaining need-to-haves like prompt caching (🤑💸).
These efforts resulted in more accurate generalization and superior outcomes.
Same intuition applies to task design. If a probabilistic model needs to do something on behalf of users at 98%+ accuracy, you really don't want to hand it a complicated question. Strip the task back to the simplest possible ask.
Andrej Karpathy nailed this with a "task contractor" analogy in his 2023 talk at Microsoft Build. Paraphrasing:
"Would you hand a contractor clear, complete instructions? Or vague ones that skip key details? The quality of your instructions shapes the quality of the outcome." — Andrej Karpathy
Agents are no different. Even if the use case is complex/open-ended, strive to make your prompting deliberate and aimed at generalization.
Great models still make weird mistakes at scale.
Run an LLM on millions of inputs for the same task, and you'll see strange things at the edges.
For our email and voice agents, regression evals were the key to remedying this problem.
Every weird failure we observed in internal review or from customer feedback got pinned and bolted onto our test suite. Over months, that built into a safety net we couldn't live without.
The compounding was satisfying.
And the curated eval data gave us other superpowers: finetuning when warranted, onboarding new team members faster, aligning ground truth across the team, the list goes on.
Crafting hard test sets (capability evals) were less useful for us. Our agents had one job: facilitate the freight transaction. Since gpt-4o, LLMs have been able to handle what we were asking. Fixating on impossible inputs (garbled inbound emails, voice calls with huge transcription errors) wasn't worth it. Those failures are acceptable. Better to build evals around valid input data.
One more thing here: we were observability maximalists. Log everything. You can't fix what you can't see. When something breaks at 2am, you want to know exactly which step went sideways.
I'm bullish on UIs that aren't chat.
Chat is a trap for task agents powering business automations. We deployed our agents through email and phone channels instead.
If you do build a chat interface, which is perhaps the right call for other use cases, remember that you're parking yourself next to ChatGPT and Claude. Users have those benchmarks burned into their brains. Anything that feels even slightly worse is instantly obvious, and painful.
One more thing.
A lot of what I put into practice traces back to a few great blogs (Hamel Husain, Eugene Yan, Andrej Karpathy the goat 🐐). If you're building agents, the knowledge is out there. Start building.