
Every agent session is gigabytes of working memory, pinned in GPU HBM for the life of the task. Yantrion makes that memory 4–10× denser — attention runs on the compressed state, IDs stay exact, nothing that matters is lost. Same model, the GPUs you already own.
One flag in the serving path · model- & hardware-agnostic
The problem
Every agent session is gigabytes of working memory, resident in HBM for the life of the task. The cache, not the chip, decides how many agents you can serve.
And as agents graduate from minutes-long chats to hours-long workers, the constraint only compounds — the compute sits idle, stranded behind a memory wall.
Whoever owns the density of agent memory owns the economics of the agentic era.
One GPU
The GPU isn't full. Its memory is — an 8-GPU node saturates HBM (288 GB, BF16 KV baseline) at ~350 concurrent 64K agents.
The breakthrough
The only relief valves the field had were lossy: eviction forgets context, quantization corrupts the IDs and arguments an agent depends on.
Yantrion is algorithm-and-kernel co-design: fewer bytes per token, turned into speed. Attention runs directly on the compressed state — decompression is never paid — and the freed VRAM returns to serving instantly.
Agent memory, 4–10× denser — faster, cheaper, and nothing that matters is lost.
Working memory — grows with every tool call.
Proof, not promises
Four compression rungs, every one measured on live agent cache. Nothing ships until it clears the gate.
The 2.5× shipping gate
retrieval 1.000 in live vLLM + SGLang
Measured on live agent cache — no rung ships until it passes the gate.
Needle-in-a-haystack at 32K / 64K / 128K on Kimi K2-class MLA at 2.5× — inside the live vLLM / SGLang serving path.
Concurrent 32K / 64K / 128K under CUDA graphs, mid-request VRAM freeing, stress, abort, 50-request soak — zero leaks.
Attention families across AMD + NVIDIA — the engine is model- and hardware-agnostic by construction.
What it's worth
Better agents per GPU — not just more of them. The economics fall out of the quality, not the other way around.
Kimi K2-class MLA · 64K sessions · 8-GPU node (288 GB HBM, BF16 KV baseline). Validation expanding across model families and context lengths.
IDs, numbers, and arguments preserved — live retrieval matches or beats the uncompressed baseline, demoed side by side.
No eviction, no truncation — hours-long agents keep their first instruction as crisply as their last tool call.
At $2/GPU-hr, an 8-GPU node drops from ~4.6¢ to 0.5–1.1¢ per agent-hour — the economics fall out of the quality story.
Product roadmap
One flag in vLLM / SGLang — no model change, no retraining. 4–10× more concurrent agents on the GPUs you already own, AMD or NVIDIA.
Flat per-GPU license
Coding agents drown models in tool output. The refinery holds the bulk at compressed-memory cost and feeds the model only signal — your token bill shrinks, whichever model you run.
Pay per token refined
Sessions persist, migrate, and resume as compressed state — memory that outlives the request. Models co-trained to the engine: quality rises as compression deepens.
Early-access program
Getting started
Prove it on your workload first.
Install
One flag — live on your cluster in an afternoon.
Verify
Side by side — your traces vs. the uncompressed baseline.
No lock-in
Flag off anytime — standard vLLM / SGLang underneath.
Where we're going
The density engine of the agentic era — shipping in vLLM today.
Prompt → engineering design, running in production on the Engine — so we feel every byte before you do.
Million-token simulation, genomics, computational design — the workloads HPC has always lived on.
With math, algorithms, and silicon — the company we're building.
We love the math, the algorithms, and the kernels — that love is the company.
Prove it on your workload
We don't ask you to take the numbers on faith. Run the engine on your model and your hardware, compare retrieval and cost side by side with your uncompressed baseline, and switch it off if it doesn't hold. The claim is only as good as your measurement of it.
Write to us and we'll set up a measured pilot on your stack.