all topics

Cost / Efficiency

Tokens, Usage & Limits

Understanding the unit of AI work — and how to stop burning through it.

TLDR;

Every interaction with an AI model has a cost measured in tokens. Tokens are not a billing abstraction — they are the fundamental unit of what the model reads and writes. Understanding how tokens work, where they accumulate silently, and how usage limits impose real constraints on your workflows is the difference between an AI integration that scales and one that surprises you with a rate-limit wall at the worst possible moment.

What Tokens Actually Are

A token is a chunk of text — not a character and not a word, but something in between. Models process language by first converting raw text into a sequence of token IDs using a tokenizer trained alongside the model. The tokenizer groups common character sequences into single tokens, so frequently occurring words like the or function often map to a single token, while rare or constructed words get split.

As a practical rule of thumb for English prose: roughly 1 token per 4 characters, or about 750 words per 1,000 tokens. But this varies significantly:

# English prose — typical density
"The quick brown fox" = 5 tokens

# Code — often denser, more tokens per "word"
"getUserPreferences(userId)" = 7 tokens

# Whitespace and indentation count
"    return {" = 4 tokens  # 4 spaces + return + { = 4

# Repeated boilerplate multiplies fast
A 200-line config file ≈ 8001,200 tokens
Diagram showing how 'get user id' tokenizes into 3 tokens while 'getUserId()' tokenizes into 5 tokens — illustrating that camelCase code is more token-dense than prose.
Same character count, different token count. camelCase identifiers and punctuation split more aggressively than prose.

The practical implication: code, JSON, and structured data almost always consume more tokens per meaningful unit than natural language does. Attaching a 500-line file is not a neutral act.

Input vs. Output Tokens

Every API call involves two distinct token pools. Input tokens are everything you send: the system prompt, any conversation history, documents, code snippets, and your current message. Output tokens are what the model generates in response.

Both count toward rate limits and, on paid APIs, both appear in your billing. Output tokens are typically priced at a multiple of input tokens — often 3–5× — because generation is computationally heavier than encoding. A model asked to write a 500-line module is doing something far more expensive than a model asked to review one.

The conversation window is cumulative. Every message you send also carries every prior message. A 30-turn conversation is not 30 requests of equal size — it is 30 requests of rapidly growing size.

The Context Window

The context window is the total number of tokens the model can process in a single call — input and output combined. Modern models offer large windows (100k–200k tokens), and this can create a false sense of abundance. The limit is real: exceed it and the call fails. Approach it carelessly and you will pay for tokens that do not meaningfully improve the result.

Horizontal stacked bar diagram showing the anatomy of a 200,000-token context window: system prompt, conversation history, current message, and available output budget.
What actually fills the context window on a typical API call. History is the only segment that grows automatically — and it grows on every turn.

Where Tokens Accumulate Without You Noticing

Wasteful token consumption rarely comes from any single obvious source. It accumulates in patterns that feel harmless in isolation:

System Prompt Repetition

Your system prompt is sent with every single API call. A 2,000-token system prompt across 1,000 daily calls is 2 million tokens before a user types a word. If that prompt contains a 400-line style guide, a lengthy disclaimer, and three examples that only apply to one edge case — every call pays for all of it.

Unbounded Conversation History

Chat interfaces that replay the entire conversation thread on each turn are the most common source of runaway token usage. Turn 1 sends 500 tokens. Turn 20 sends 12,000 tokens — mostly history the model has already "seen." Without truncation or summarization, growth is linear but the cost function feels exponential in practice.

Stacked bar chart showing token cost per API call growing across 10 conversation turns. The cyan top segment (new message) stays constant while the blue bottom segment (history) grows dramatically each turn.
The new message is a constant cost. The history carried forward is not. By turn 10, the bulk of every API call is conversation you've already paid for.

Whole-File Attachments

Attaching a complete file when only a function signature or a specific section is relevant is one of the highest-leverage waste vectors. A 600-line service file attached because the user wants to modify one method contributes 500+ tokens of context the model will largely not use.

Redundant Examples

Three worked examples in a prompt may be necessary for the first call. They are rarely necessary for call 47. Prompts that were written for calibration but never trimmed for production continue paying their original cost indefinitely.

Techniques for Reducing Token Waste

Write Tight System Prompts

A system prompt should contain the minimum needed to reliably constrain the model's behavior. Audit it ruthlessly: if a line has not changed the output in any recent test, remove it. Instructions are not free insurance — they cost tokens on every call. Treat the system prompt as production code, not a scratchpad.

# Bloated — 180 tokens
You are an expert senior software engineer with 15 years of experience.
You write clean, well-documented, maintainable code. Always add JSDoc
comments. Always explain your reasoning. Be concise. Be thorough.
Never write incomplete implementations. Prefer readability over
cleverness. Do not use deprecated APIs...

# Tight — 40 tokens
Senior engineer. Return only the changed function. JSDoc on public
methods. No explanations unless asked.

Summarize or Truncate History

For long-running sessions, maintain a rolling summary rather than a full transcript. After every N turns, ask the model to compress the conversation into a paragraph of key decisions and context. Replace the raw history with that summary. You lose verbatim recall but preserve the signal — which is usually all that matters.

Send Only Relevant File Sections

Instead of attaching whole files, extract and send only the relevant function, class, or block. A file read operation that grabs lines 120–165 instead of lines 1–600 is not just cheaper — it focuses the model's attention and often produces better results because the irrelevant context cannot interfere.

Use Prompt Caching

Anthropic's API and some other providers support prompt caching: marking a stable prefix of the input (typically the system prompt plus any large reference documents) for server-side caching. Cached tokens are billed at a significant discount — around 10% of the normal input rate — on subsequent calls. If your system prompt and reference documents don't change between calls, caching makes them nearly free after the first one.

// Anthropic SDK — cache the stable prefix
{
  "system": [
    {
      "type": "text",
      "text": "You are a code reviewer...\n\n[600-line style guide]",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [{ "role": "user", "content": "Review this PR..." }]
}
// Cache TTL: 5 minutes. Refresh it before it expires for long sessions.

Prefer Structured Output Over Explanation

Asking the model to return structured JSON or a minimal diff instead of a prose explanation followed by a code block dramatically reduces output tokens. "Return only the modified function" produces fewer tokens than "Explain your changes and then show the updated code." Output tokens are your most expensive category — minimize unnecessary verbosity in responses.

Organizing Projects for Token Efficiency

Token efficiency is not just a prompt-writing skill — it is a project organization decision. How you structure your codebase, your context files, and your AI workflows determines your baseline token cost before you write a single prompt.

Tiered Context Files

Not all context is equally relevant to every task. A CLAUDE.md or equivalent project context file works best when it is kept minimal and always-relevant. Information that is only needed for specific tasks — database schema details, API endpoint references, architecture decision records — belongs in separate files loaded on demand, not in the global context that every call imports.

project/
├── CLAUDE.md          # ≤300 tokens: always-on essentials
├── .context/
│   ├── schema.md      # loaded only when touching the DB layer
│   ├── api-ref.md     # loaded only when building integrations
│   └── style-guide.md # loaded only for prose/UI tasks

Prefer Narrow Task Scoping

A session that tries to accomplish ten things accumulates history across all of them. A session scoped to one well-defined task runs shorter, uses less context, and produces more coherent output. Start fresh sessions for distinct tasks rather than carrying everything in one growing thread. This is especially important in agentic workflows where the model reads files and executes tools — each operation compounds the running total.

Keep Reference Documents Stable and Cacheable

If your workflow consistently includes a large reference document — an API spec, a design system, a data dictionary — keep it in a stable, unchanging form. Frequent edits to cached documents invalidate the cache and reset the billing discount. Treat cacheable context the way you treat a CDN asset: stable content, versioned changes.

Use Model Selection Intentionally

Flagship models (Opus, GPT-4) cost 5–15× more per token than mid-tier models (Sonnet, GPT-4o-mini). Routing simple, well-defined tasks — classification, extraction, template filling — to a smaller model eliminates a large fraction of cost with no meaningful quality loss. Reserve expensive models for tasks that genuinely require deep reasoning or large context synthesis.

Usage Limits: What They Are and How They Work

API providers enforce usage limits to protect infrastructure stability. Hitting these limits in production is a reliability issue, not just an inconvenience. Understanding their mechanics is the first step to designing around them.

The Two Axes: TPM and RPM

Limits operate primarily on two dimensions:

Tokens per minute (TPM) — the total volume of input plus output tokens that can flow through the API in any 60-second window. This is the limit most teams hit first when working with large contexts or high request volumes.

Requests per minute (RPM) — the number of distinct API calls allowed in a 60-second window, regardless of token volume. Heavy use of lightweight prompts can hit RPM before TPM.

Both limits reset on a rolling window. A 429 response means the current window is exhausted — not that you are banned or that today's quota is gone. The practical implication: a brief wait (often 10–60 seconds) is usually sufficient.

Daily and Monthly Limits

Higher-tier usage plans include daily or monthly token ceilings in addition to per-minute limits. These accumulate across all calls in the billing period. A runaway script or an accidentally unbounded loop can consume a significant portion of a daily limit in minutes. Monitoring daily consumption alongside per-minute rate is essential for production deployments.

Tier Progression

Most providers structure limits in tiers tied to account spending history or manual review. New accounts start at conservative limits. As spend accumulates — or after a limit-increase request is approved — the ceiling rises. This means that projects designed for scale should be planned for their target tier, not their starting tier. Validate production-scale throughput in staging before cutover.

Avoiding Timeouts and Rate-Limit Waits

The goal is not to avoid hitting limits at the exact boundary — it is to design systems that degrade gracefully and never stall user-facing work when a limit is reached.

Timeline chart showing token consumption rising rapidly over 22 seconds until hitting the 100% rate limit, triggering a 429 error and backoff period, then resetting at 60 seconds and resuming at a managed pace.
A 60-second rolling window. Burst traffic hits the limit fast; the fix is not a bigger limit but a smoother send rate.

Implement Exponential Backoff with Jitter

The standard approach: on receiving a 429, wait before retrying. The wait duration should double on each consecutive failure (exponential) with a randomized offset (jitter) to prevent thundering-herd behavior when multiple clients retry simultaneously.

async function callWithBackoff(fn, maxAttempts = 5) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (err.status !== 429 || attempt === maxAttempts - 1) throw err;
      const base = Math.min(1000 * 2 ** attempt, 30000);
      const jitter = Math.random() * 1000;
      await sleep(base + jitter);
    }
  }
}

Self-Throttle Before Hitting the Limit

The most reliable way to avoid 429 errors is to never reach the limit in the first place. Track your own token consumption in a sliding window and add artificial delay when approaching the threshold — for example, when you've consumed 85% of your per-minute allowance. This is more predictable than reactive backoff and eliminates the user-visible pause entirely.

class TokenBudget {
  constructor(limitTPM) {
    this.limit = limitTPM;
    this.used = 0;
    this.windowStart = Date.now();
  }

  async reserve(estimatedTokens) {
    const elapsed = Date.now() - this.windowStart;
    if (elapsed >= 60_000) { this.used = 0; this.windowStart = Date.now(); }
    if (this.used + estimatedTokens > this.limit * 0.85) {
      await sleep(60_000 - elapsed + 100); // wait out the window
      this.used = 0; this.windowStart = Date.now();
    }
    this.used += estimatedTokens;
  }
}

Use Asynchronous / Queued Processing

For batch workloads — processing a list of items, generating content for a dataset, running analysis across a codebase — a queue with configurable concurrency is far more reliable than firing all requests at once. Set concurrency to a value where expected throughput stays below 70–80% of your TPM limit. The queue absorbs bursts and smooths the rate naturally.

Cache Idempotent Results

If the same prompt will produce the same output (code formatting, translation of a stable string, classification of a fixed label set), cache the response. A semantic cache keyed on the prompt content eliminates the API call entirely on repetition. For static reference tasks, even a simple key-value store with a TTL of several hours can eliminate a substantial fraction of calls.

Parallelize Within Budget, Not Beyond It

Concurrency is not free. Running 20 parallel requests consumes 20× the tokens per second. Design parallel workloads with the rate limit as an explicit constraint, not an afterthought. P(total throughput) = min(concurrency × avg_tokens_per_request, TPM_limit). Work backward from the limit to choose your concurrency ceiling.

Use the Batch API for Non-Urgent Work

Anthropic and OpenAI both offer batch processing APIs for asynchronous, non-time-sensitive requests. Batch calls run at significantly reduced cost (often 50% discount) with a looser rate limit envelope, and results are delivered within a defined window (typically 24 hours). For evaluation runs, large-scale content generation, or any workload without a real-time user waiting on the response, the batch API is almost always the right choice.

A Practical Checklist

The following steps apply to most production AI integrations. Run through them before shipping, and revisit when costs or rate-limit incidents rise unexpectedly:

Token efficiency audit:
  [ ] System prompt reviewed and trimmed in the last 30 days
  [ ] Conversation history capped or summarized at a defined turn limit
  [ ] File attachments scoped to relevant sections, not full files
  [ ] Prompt caching enabled for stable system prompt + reference content
  [ ] Output format constrained (JSON / diff / function-only where possible)

Rate limit resilience:
  [ ] Exponential backoff with jitter on all API calls
  [ ] Proactive self-throttling at 80% of TPM allowance
  [ ] Async queue with concurrency tied to rate limit budget
  [ ] Response caching for idempotent or repeated prompts
  [ ] Batch API used for offline / evaluation workloads

Project organization:
  [ ] Context files tiered: always-on vs. task-specific
  [ ] Sessions scoped to single tasks; fresh starts for distinct work
  [ ] Reference documents kept stable to maximize cache hit rate
  [ ] Model routing: small model for simple tasks, large for complex
Tokens are not an implementation detail. They are the unit of resource consumption for everything your AI integration does. Teams that understand them ship more reliable, more affordable systems — and stop being surprised when the bill arrives or the rate limiter fires.
top