What I've Learned Building With LLM APIs — Mark Phillips

A year ago I started treating LLMs as infrastructure. Not a demo, not a side project — infrastructure. That shift changed how I think about every part of the integration.

Here’s what actually mattered.

Prompts are code

The single biggest mistake I see: treating prompts as configuration. They’re not. A prompt is a function. It has inputs, it has expected outputs, and it breaks in ways you didn’t predict.

Write them like you write code:

Version control them
Test them with representative inputs
Document the failure modes you’ve found

When a prompt changes behavior after a model update, you want to know which prompt changed behavior, not go spelunking through a blob of string templates.

Token costs compound fast

A simple chat feature that handles 1,000 requests per day can easily burn $50–$500/month depending on context window size. That range is entirely determined by whether you’re sloppy with context.

Things that kill you:

Sending full conversation history when a summary suffices
Embedding large documents verbatim when retrieval would do
Not caching responses for near-identical queries

Instrument your token usage from day one. Add it to your dashboards the same way you’d add latency or error rates. You will not regret this.

Structured output is non-negotiable for integration

If you’re parsing free-text LLM output in production, you’ve introduced a failure mode that will eventually bite you. Use structured output — JSON schema, function calling, tool use — for anything that downstream code touches.

Free text is fine for surfaces where a human reads the response directly. For anything machine-consumed, demand structure.

Latency is a UX problem, not a technical one

P50 latency for a GPT-4 call is 2–5 seconds. For Claude, similar. Users will tolerate this if you design for it — streaming, skeleton states, perceived progress.

What they won’t tolerate: a frozen UI with no feedback for 4 seconds. That feels broken even if it isn’t.

Stream everything user-facing. It makes the latency feel like 0.5 seconds.

The eval problem is real

How do you know if your prompt got better or worse after you changed it? If your answer is “I tried a few examples and it seemed fine,” you have a problem.

Build a small eval suite before you ship. Even 20–30 representative inputs with expected outputs gives you a regression baseline. You don’t need a framework — a JSON file and a script that calls the API and checks outputs is enough.

These aren’t revolutionary insights. They’re the basics that are easy to skip when you’re moving fast. The teams I’ve seen get burned by LLM integrations usually weren’t missing sophistication — they were missing discipline on the fundamentals.