What does tokens per watt mean?

Tokens per watt measures how many tokens of AI output a chip can generate per watt of power. Its inverse, energy per token, is the energy required to produce one token. Both capture real-world inference efficiency better than peak TFLOPS/W.

Why is energy per token better than TFLOPS per watt?

Peak TFLOPS/W assumes the compute is fully utilized. At low batch sizes, inference is bandwidth-bound and compute units sit idle waiting on memory, so peak efficiency overstates real performance. Energy per token reflects the actual cost of serving a model.

What makes AI inference energy-expensive?

Most of the energy goes to moving model weights from memory to the compute, not to the math. Generating each token requires reading the entire model once, so the dominant cost is data movement across the memory interface.

How efficient is Sophon in tokens per watt?

Sophon decodes an 80-billion-parameter model at about 16.3 millijoules per token, roughly 390× less energy than an HBM-bound GPU at low batch — because the weights are stored directly above the compute and barely move.

Tokens per Watt: The Real Metric for AI Inference Efficiency

When people compare AI chips they usually reach for peak TFLOPS or peak TFLOPS per watt. Those numbers describe the best case — a chip running flat out with all its compute units fed. Real inference rarely looks like that. The metric that actually predicts what a model costs to serve is energy per token, or its inverse, tokens per watt: how much useful output the chip produces for each unit of power it draws.

Why peak TFLOPS/W is misleading

Peak efficiency assumes the compute is saturated. But generating a token from a large model at low batch is bandwidth-bound — the compute units sit idle, waiting on weights streaming in from memory. A chip can have spectacular peak TFLOPS/W and still deliver poor tokens per watt, because in the regime that matters it spends its energy moving data, not computing. The peak number measures the engine; tokens per watt measures the trip.

What actually drives energy per token

For a large language model, producing one token means reading every weight once. The energy of that read — pulling each weight across a memory interface — dominates the total. The multiply-accumulate that follows is comparatively free. So energy per token is, to first order, a measure of how far the weights have to travel and how much each trip costs. Lower the distance and the per-bit transfer energy, and tokens per watt rises in lockstep.

On a log scale, Sophon's FP8 decode spends 16.3 mJ per token against a GPU's ~6,400 mJ — roughly 390× less energy.

What the numbers look like

On Sophon, weights are stored a few hundred nanometers above the compute, so a read costs femtojoules and an FP8 multiply-accumulate completes at 0.310 picojoules all-in. The result is roughly 16.3 millijoules to decode a token of an 80-billion-parameter model — against the order of 6.4 joules for an HBM-bound GPU at low batch. That is about a 390× difference in energy per token, and it comes almost entirely from removing the distance the data has to travel. In a datacenter, that ratio is the difference between a power budget that scales and one that does not.

Peak TFLOPS measures the engine. Tokens per watt measures the trip.
PhantaField

Frequently asked questions

What does tokens per watt mean?: Tokens per watt measures how many tokens of AI output a chip can generate per watt of power. Its inverse, energy per token, is the energy required to produce one token. Both capture real-world inference efficiency better than peak TFLOPS/W.
Why is energy per token better than TFLOPS per watt?: Peak TFLOPS/W assumes the compute is fully utilized. At low batch sizes, inference is bandwidth-bound and compute units sit idle waiting on memory, so peak efficiency overstates real performance. Energy per token reflects the actual cost of serving a model.
What makes AI inference energy-expensive?: Most of the energy goes to moving model weights from memory to the compute, not to the math. Generating each token requires reading the entire model once, so the dominant cost is data movement across the memory interface.
How efficient is Sophon in tokens per watt?: Sophon decodes an 80-billion-parameter model at about 16.3 millijoules per token, roughly 390× less energy than an HBM-bound GPU at low batch — because the weights are stored directly above the compute and barely move.