The KV cache is the working memory of every agent. Yantrion compresses it in place — so the attention kernel reads the dense form directly, and the freed VRAM goes straight back to serving.
Talk to usWhen HBM fills, the field had two relief valves. Both damage the agent. We built a third.
Forgets context
Drop older tokens to free space. The agent loses the thread — its first instruction is gone by the last tool call.
Corrupts IDs & args
Crush precision across the board. Numbers drift, identifiers transpose — exactly the bytes a tool-calling agent cannot afford to lose.
Keeps everything, smaller
Fewer bytes per token, losslessly where it matters. Attention runs on the compressed state — decompression is never paid.
Working memory grows with every tool call until HBM is full. The engine collapses that footprint 4–10× and hands the reclaimed VRAM back to the scheduler immediately.
Because attention operates on the compressed representation, there's no decompress-then-compute tax — the density becomes throughput.
Working memory — grows with every tool call.
The shipping gate
Each rung is a quality-vs-density operating point, measured on live agent cache. The gate is retrieval 1.000 (needle-in-a-haystack) at 2.5×, inside the live vLLM / SGLang serving path — not on a benchmark harness.
Pick safety (2.11×) or push long-session economics (9.95×, auto-tuned and quality-gated). The default for agents is Balanced, 4.0×.
The 2.5× shipping gate
retrieval 1.000 in live vLLM + SGLang
Measured on live agent cache — no rung ships until it passes the gate.
Production-hardened
Model- & hardware-agnostic
Prove it on your workload before you commit to anything.
One flag in vLLM / SGLang — no model change, no retraining, no wrapper. Live on your cluster in an afternoon.
Run your own traces side by side against the uncompressed baseline. Retrieval matches or beats it — you watch it happen.
Flag off anytime and you're back to standard vLLM / SGLang. Nothing about your model or ops is rewritten.
If your agents have to be right and have to scale, let's put the engine on your traces — measured against your own baseline.
contact@yantrion.com