Code Agent and Token Cost

May 11

This one runs a bit long — about ten minutes. If you're a heavy VibeCoding user or curious about how LLM billing actually works under the hood, it's worth finishing.

VibeCoding is becoming infrastructure for a lot of engineers. When you hit a rate limit or context ceiling in the middle of a long task, the quickest fix is obvious: throw $200 at a GPT Pro subscription, throw another $200 at Claude Max — problem gone.

But if we don't just buy our way around this, and instead ask the real question — where the hell do all the tokens go — aren't you curious? I sure am.

Let's start with billing. Most mainstream LLM APIs, including Claude and OpenAI, charge by token count, with separate rates for input and output. Input is cheaper; output costs more. In the specific context of Code Agents, input tokens are the overwhelming majority — often over 80% of total consumption.

Some providers offer a KV Cache Discount: when the server detects that the input prefix of a new request heavily overlaps with a previous one, the overlapping portion hits the cache and gets a significant discount. The mechanism makes physical sense — it avoids redundant attention computation.

This will become a plot point. We'll come back to it.

Most Code Agents today, including Claude Code, still run on the classic ReAct framework. The original ReAct paper is over three years old. Its core loop: the model produces a Tool Call, receives an Observation, reasons through a CoT, then produces the next Tool Call.

It was an elegant design when it was proposed. But over the past three years, the agent research community has produced a lot of optimization paradigms — Plan Before Act, Hierarchical Planning, Task Decomposition with Memory… Code Agents have adopted basically none of them.

The arrogance isn't entirely unjustified — engineering stability will always outrank academic novelty at the product level. But that doesn't mean we can't look at the cost.

The cost lands on token consumption, and ultimately on the user. The worst offender is context management. The only way to describe it is: A Piece of SHIT.

Each loop iteration, the Agent appends the full user Query, the current Tool Call, the Observation, and the model's own CoT into the Context — then feeds the entire thing back unchanged on the next round. Under this model, Context grows quadratically, and most of it becomes historical noise that's nearly useless for whatever the model is actually trying to do right now.

When Context approaches the model's limit, Claude Code triggers Auto Compact — an interesting mechanism in itself. It's not a semantic summarization pass; it's a rule-based structural pruning at the linguistic level. Codex takes a blunter approach and just terminates the session.

Claude's context window is around 200K tokens. That sounds large until you've done a few dozen tool calls on any reasonably-sized codebase — then it's gone fast.

Solutions to this problem fall into two camps: Harness Level and Model Level.

Harness Level means engineering the Agent's runtime framework without touching the model itself. Two main approaches:

First, fine-grained Context management — actively filtering and compressing historical information, keeping only what's genuinely relevant to the current task. Compressing Observations is the primary lever.

Second, introducing a Plan Before Code paradigm — having the Agent complete an explicit planning pass before execution, reducing aimless exploratory Tool Calls and cutting token consumption by reducing loop iterations.

Papers along these lines have been coming out for about a year. The arrogant CC has shown zero interest. From an engineering standpoint, adding structural complexity always introduces potential side effects.

The standard academic benchmark for validating these methods is SWE-Bench. Hit good numbers there and you can publish. But SWE-Bench is fundamentally a closed evaluation with deterministic answers. Most real Code Agent usage is open-ended exploration — unfamiliar codebases, undefined requirements — far beyond what SWE-Bench covers. Academic proof doesn't translate cleanly to actual user experience.

Model Level is an entirely different angle: keep the Agent code unchanged, but use a better model. If it can solve your problem in 3 loops instead of 15, token consumption drops on its own.

The most striking news on this front came from DeepSeek. DeepSeek V4's API pricing is near-disruptive — after two rounds of discounts, it lands at roughly one-tenth of the baseline price. At that level, almost no token compression technique can match the savings, because you can't algorithmically optimize your way to a 1000% efficiency gain.

What's even more counterintuitive: because of KV Cache Discounts, some optimization approaches that reduce raw token count actually end up costing more — because restructuring the input breaks the cache hit pattern. Counterintuitive, but completely logical once you understand the billing mechanics.

Worth noting: some Claude Code developers are pretty dismissive of Harness Level approaches. They don't want to introduce complex context management at the harness layer. Their position is that whatever CC can't solve today, the next model will handle.

Maybe that's principled engineering conservatism. Maybe it's passing the buck.

There's another take: on a long enough timeline, obsessing over token counts is just a phase. A mentor of mine compared tokens to mobile data — we might be in the 3G era right now. When 5G arrives, nobody cares how much data a single request burns.

That analogy has some weight. Compute costs will keep falling. Context windows will keep growing — a 1M context window isn't unthinkable. What feels like a bottleneck today might genuinely be a historical footnote in the transition period.

My own take: I'm open on this field, but I lean Model Level right now. Partly hindsight — Harness Level approaches have been around for a year and none of them have made it into production at scale, which tells you something. Partly distribution — if Model Level solves the problem, it'll spread like DeepSeek did. People running low on tokens will find the better API on their own.

And from a vendor's perspective: token cost is temporary. Time and compute are forever.

If you have a take or a solution, reach out — happy to talk.