Tokens per second

A throughput metric measuring how many tokens an LLM can generate per second during inference.

What is Tokens per second?

‍Tokens per second is a throughput metric measuring how many tokens an LLM can generate per second during inference. In practice, it is one of the clearest ways to judge how fast a model streams output and how much work a serving stack can complete in real time. (docs.nvidia.com)

Understanding Tokens per second

‍Teams use tokens per second to compare model speed, serving configurations, and deployment optimizations. NVIDIA’s benchmarking docs describe TPS as a standard LLM inference metric, and Hugging Face similarly treats token throughput as a core performance signal for generation systems. (docs.nvidia.com)

‍This metric can be reported per request, per user, or in aggregate across a cluster. It is closely related to latency metrics like time to first token and inter-token latency, but it focuses on steady-state generation speed rather than just how quickly the first output appears. Higher TPS usually means better responsiveness, lower queueing risk, and better hardware utilization, although the exact number depends on prompt length, batch size, context window, quantization, and decoding strategy. (docs.nvidia.com)

‍Key aspects of Tokens per second include:

Decode speed: How quickly the model emits output tokens once generation begins.
Throughput: How much text the system can produce across one request or many concurrent requests.
Serving efficiency: How batching, caching, and hardware choices affect observed speed.
User experience: Faster TPS usually makes streaming responses feel more responsive.
Benchmark context: TPS should be interpreted alongside latency, not in isolation.

Advantages of Tokens per second

Easy to compare: It gives teams a simple way to compare models, GPUs, and inference stacks.
Production relevant: It maps directly to how fast users receive generated text.
Useful for tuning: It helps validate batching, quantization, and decoding changes.
Capacity planning: It supports estimates for concurrency and expected load.
Cost awareness: Higher throughput often translates to better cost per generated token.

Challenges in Tokens per second

Measurement variance: Numbers change with prompt length, output length, and traffic mix.
Metric ambiguity: Some tools count only output tokens, while others include more of the request path.
Hardware dependence: The same model can show very different TPS across CPUs, GPUs, and serving engines.
Tradeoff with quality: Aggressive optimizations can affect accuracy, formatting, or determinism.
Benchmark mismatch: Lab results often differ from real user traffic.

Example of Tokens per second in action

Scenario: A support team runs a chat assistant that must answer in near real time.

They benchmark two deployment settings. Configuration A generates 35 tokens per second, while Configuration B generates 18 tokens per second. If most answers are 200 tokens long, the faster setup finishes much sooner and keeps the interface feeling fluid.

The team then checks TPS alongside time to first token and total request latency. That gives them a fuller picture of the experience, because a model can start quickly but still generate slowly, or generate quickly after a long wait.

How PromptLayer helps with Tokens per second

‍PromptLayer helps teams connect throughput metrics like tokens per second with prompt changes, model comparisons, and evaluation results. That makes it easier to see whether a slower response comes from the prompt, the model, or the broader agent workflow, then iterate with confidence.

Ready to try it yourself? Sign up for PromptLayer and start managing your prompts in minutes.