Prompt Caching

A technique that stores and reuses computed representations of repeated prompt prefixes, reducing latency and cost for LLM requests that share a common context.

What is Prompt Caching?

Prompt caching is an optimization technique — supported natively by providers such as Anthropic and OpenAI — that allows the model or inference layer to cache the computed key-value (KV) representations of a prompt prefix. When subsequent requests reuse the same prefix, the cached computation is retrieved instead of recalculated, cutting both latency and token cost.

Understanding Prompt Caching

LLMs process every token in a prompt from scratch on each call. For use cases with a large, stable system prompt or shared context (e.g., a 10,000-token document), reprocessing this prefix on every request is expensive. Prompt caching solves this by storing the KV state of the static portion after the first call.

Key aspects of prompt caching include:

  1. Cache Prefix: The portion of the prompt that remains constant across requests — typically the system prompt or a large document.
  2. Cache Miss vs. Hit: The first request always incurs full compute (miss); subsequent requests with the same prefix retrieve the cached state (hit).
  3. TTL (Time-to-Live): Cached entries expire after a provider-defined window (e.g., 5 minutes on Anthropic, session-level on OpenAI).
  4. Cost Reduction: Anthropic charges ~10% of the standard input token price for cache-hit tokens; OpenAI charges 50% for cached tokens.
  5. Latency Reduction: Cache hits skip expensive KV computation, often cutting time-to-first-token by 40–80%.

When to Use Prompt Caching

  1. Large Static Contexts: Long system prompts, lengthy documents, or big codebases prepended to every request.
  2. High-Volume Applications: Chatbots, customer support agents, and coding assistants that share the same context across many users.
  3. Multi-Turn Conversations: Caching the conversation history up to the latest turn reduces compute on each new message.
  4. RAG Pipelines: Caching retrieved context passages that are reused across similar queries.

Related Terms

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026