How the engine works

Algorithm and kernel co-design. Fewer bytes per token, turned into speed.

The KV cache is the working memory of every agent. Yantrion compresses it in place — so the attention kernel reads the dense form directly, and the freed VRAM goes straight back to serving.

Talk to us

Compression — not eviction or quantization.

When HBM fills, the field had two relief valves. Both damage the agent. We built a third.

Evict

Forgets context

Drop older tokens to free space. The agent loses the thread — its first instruction is gone by the last tool call.

Quantize

Corrupts IDs & args

Crush precision across the board. Numbers drift, identifiers transpose — exactly the bytes a tool-calling agent cannot afford to lose.

Yantrion · compress

Keeps everything, smaller

Fewer bytes per token, losslessly where it matters. Attention runs on the compressed state — decompression is never paid.

Same tokens. A fraction of the bytes.

Working memory grows with every tool call until HBM is full. The engine collapses that footprint 4–10× and hands the reclaimed VRAM back to the scheduler immediately.

Because attention operates on the compressed representation, there's no decompress-then-compute tax — the density becomes throughput.

TodayHBM FULL

Working memory — grows with every tool call.

Evict → forgets contextQuantize → corrupts IDs & args

Yantrion4–10× denser

Freed VRAM — returned to serving instantly

Attention runs on the compressed state

Decompression never paid

IDs exact · NIAH 1.000

The shipping gate

Four rungs. None ships until it's lossless where it matters.

Each rung is a quality-vs-density operating point, measured on live agent cache. The gate is retrieval 1.000 (needle-in-a-haystack) at 2.5×, inside the live vLLM / SGLang serving path — not on a benchmark harness.

Pick safety (2.11×) or push long-session economics (9.95×, auto-tuned and quality-gated). The default for agents is Balanced, 4.0×.

The 2.5× shipping gate

retrieval 1.000 in live vLLM + SGLang

2.5× gate

Conservativemaximum safety margin

2.11×

Balanceddefault for agents

4.0×

Denselong-session economics

7.10×

Calibratedauto-tuned, quality-gated

9.95×

Measured on live agent cache — no rung ships until it passes the gate.

Production-hardened

5 / 5, zero leaks.

Concurrent serving at 32K / 64K / 128K
Runs under CUDA graphs
Mid-request VRAM freeing
Stress + abort paths
50-request soak — zero leaks

Model- & hardware-agnostic

By construction.

Attention families: 3 — incl. MLA (Kimi K2-class)
Vendors: AMD + NVIDIA
Frameworks: vLLM · SGLang
Retrieval (NIAH): 1.000 at the shipping gate

One flag. No lock-in.

Prove it on your workload before you commit to anything.

Install

One flag in vLLM / SGLang — no model change, no retraining, no wrapper. Live on your cluster in an afternoon.

Verify

Run your own traces side by side against the uncompressed baseline. Retrieval matches or beats it — you watch it happen.

No lock-in

Flag off anytime and you're back to standard vLLM / SGLang. Nothing about your model or ops is rewritten.

Run your agents on the engine.

If your agents have to be right and have to scale, let's put the engine on your traces — measured against your own baseline.

contact@yantrion.com