How the engine works

    Algorithm and kernel co-design. Fewer bytes per token, turned into speed.

    The KV cache is the working memory of every agent. Yantrion compresses it in place — so the attention kernel reads the dense form directly, and the freed VRAM goes straight back to serving.

    Talk to us

    Compression — not eviction or quantization.

    When HBM fills, the field had two relief valves. Both damage the agent. We built a third.

    Evict

    Forgets context

    Drop older tokens to free space. The agent loses the thread — its first instruction is gone by the last tool call.

    Quantize

    Corrupts IDs & args

    Crush precision across the board. Numbers drift, identifiers transpose — exactly the bytes a tool-calling agent cannot afford to lose.

    Yantrion · compress

    Keeps everything, smaller

    Fewer bytes per token, losslessly where it matters. Attention runs on the compressed state — decompression is never paid.

    Same tokens. A fraction of the bytes.

    Working memory grows with every tool call until HBM is full. The engine collapses that footprint 4–10× and hands the reclaimed VRAM back to the scheduler immediately.

    Because attention operates on the compressed representation, there's no decompress-then-compute tax — the density becomes throughput.

    TodayHBM FULL

    Working memory — grows with every tool call.

    Evict → forgets contextQuantize → corrupts IDs & args
    Yantrion4–10× denser
    Freed VRAM — returned to serving instantly
    Attention runs on the compressed state
    Decompression never paid
    IDs exact · NIAH 1.000

    The shipping gate

    Four rungs. None ships until it's lossless where it matters.

    Each rung is a quality-vs-density operating point, measured on live agent cache. The gate is retrieval 1.000 (needle-in-a-haystack) at 2.5×, inside the live vLLM / SGLang serving path — not on a benchmark harness.

    Pick safety (2.11×) or push long-session economics (9.95×, auto-tuned and quality-gated). The default for agents is Balanced, 4.0×.

    The 2.5× shipping gate

    retrieval 1.000 in live vLLM + SGLang

    2.5× gate
    Conservativemaximum safety margin
    2.11×
    Balanceddefault for agents
    4.0×
    Denselong-session economics
    7.10×
    Calibratedauto-tuned, quality-gated
    9.95×

    Measured on live agent cache — no rung ships until it passes the gate.

    Production-hardened

    5 / 5, zero leaks.

    • Concurrent serving at 32K / 64K / 128K
    • Runs under CUDA graphs
    • Mid-request VRAM freeing
    • Stress + abort paths
    • 50-request soak — zero leaks

    Model- & hardware-agnostic

    By construction.

    Attention families
    3 — incl. MLA (Kimi K2-class)
    Vendors
    AMD + NVIDIA
    Frameworks
    vLLM · SGLang
    Retrieval (NIAH)
    1.000 at the shipping gate

    One flag. No lock-in.

    Prove it on your workload before you commit to anything.

    Install

    One flag in vLLM / SGLang — no model change, no retraining, no wrapper. Live on your cluster in an afternoon.

    Verify

    Run your own traces side by side against the uncompressed baseline. Retrieval matches or beats it — you watch it happen.

    No lock-in

    Flag off anytime and you're back to standard vLLM / SGLang. Nothing about your model or ops is rewritten.

    Run your agents on the engine.

    If your agents have to be right and have to scale, let's put the engine on your traces — measured against your own baseline.

    contact@yantrion.com
    Yantrion

    The density engine of the agentic era.

    Agent memory, 4–10× denser — on the GPUs you already own.

    © 2026 Yantrion, Inc. All rights reserved.

    Built at the metal, in the open — vLLM & SGLang.