BestAIFor.com
Benchmark

Open-source single-GPU reproductions of Cartridges and STILL for neural KV-cache compaction [P]

M
Matthieu Morel
April 21, 202612 min read
Share:
Open-source single-GPU reproductions of Cartridges and STILL for neural KV-cache compaction [P]

Benchmark Reality Check: Open-Source Single-GPU Reproductions of Cartridges and STILL for Neural KV-Cache Compaction

TL;DR

Neural KV-cache compaction — using learned compression rather than heuristic eviction — is one of the more credible paths to running long-context LLMs without bleeding GPU memory. Cartridges and STILL are two recent papers pushing this frontier, and community single-GPU reproductions have now made both accessible for independent evaluation. The benchmark gains in memory reduction are documented and reproducible. Whether the quality trade-offs hold at production context lengths, across model families beyond the ones tested, remains a genuinely open question.

Key Takeaways

  • Cartridges proposes storing compressed KV representations — "cartridges" — using a learned encoder that can reduce KV cache memory by reported ratios of 4–8× on the paper's benchmark suite, according to the original Cartridges preprint
  • STILL (Structured Implicit Token-Level Learning) approaches the same memory wall through structured sparsity applied during the KV compression training phase, with quality preservation as the primary design constraint
  • Community reproducers have confirmed that both methods run end-to-end on a single A100 80GB GPU for models up to 7B parameters, removing the multi-GPU barrier that previously limited independent benchmark verification
  • Perplexity degradation in reproduced runs is consistent with paper-reported figures at shorter contexts (under 16K tokens), but reproduction authors note deviation at 32K+, which the original papers did not fully stress-test
  • Neither method is a drop-in replacement for standard KV cache management — both require a training phase on representative data, which adds infrastructure overhead that matters in production
  • The reproduction effort is explicitly open-source under permissive licenses, with training scripts, evaluation harnesses, and checkpoint loading code available on GitHub
  • For teams working on long-context inference optimization, these reproductions provide the first independently verifiable benchmark baseline outside the original authors' controlled environments

What KV-Cache Compaction Actually Solves (and What It Costs)

Memory in transformer inference is not a single problem. It has a shape. The KV cache — the store of attention key-value pairs for every processed token — grows linearly with sequence length, scales with model depth and head dimension, and sits entirely on GPU VRAM at inference time. For a 70B-parameter model at full precision, the KV cache alone at 128K context can exceed the memory budget of a single H100. You're either paying for more GPUs, or you're cutting context.

Three strategies have dominated the practical response: token eviction (drop low-scoring tokens from the cache), quantization (compress the numerical precision of stored values), and prefill recomputation (don't store everything, recompute on demand). Each involves a different trade-off between throughput, quality, and engineering complexity.

Neural compaction is a fourth category. Instead of deciding which tokens to evict or how many bits to use, a learned encoder compresses the KV states into a dense latent space — smaller in size, but reconstructable (imperfectly) on demand. Cartridges and STILL are two of the most technically rigorous papers in this space. The fact that they are now reproducible on a single GPU is the specific development worth paying attention to, because it changes who can run the benchmark independently.

Cartridges and STILL: The Neural Approach

How Cartridges Works

Cartridges frames the problem as learned document caching. A trained encoder compresses the KV cache for a chunk of context (a "cartridge") into a compact latent. At inference time, the decoder reconstructs an approximation of the original KV states from the latent before attention is computed. The key design decision is that the encoder is trained jointly with a reconstruction objective and an end-task distillation loss — so it learns what information in the KV states actually matters for downstream generation quality, not just raw reconstruction fidelity.

The original paper benchmarks on LLaMA-family models and reports 4–8× memory reduction at 4K–16K token contexts, with perplexity increases of less than 0.5 points on standard language modeling benchmarks. The single-GPU reproduction confirms these numbers hold for the 7B model size on A100 hardware.

How STILL Works

STILL takes a different entry point. Rather than training an external encoder-decoder, STILL applies structured sparsity directly to the KV cache during inference, guided by a lightweight neural predictor trained to identify which KV positions are recoverable. The predictor adds a small inference-time cost — measured in milliseconds, not seconds — but allows the compression ratio to be dynamically adjusted per layer based on attention pattern statistics gathered during the training phase.

The practical difference from Cartridges: STILL doesn't require storing explicit compressed latents. It discards positions it predicts as recoverable and reconstructs them via a small learned fallback. This makes it more memory-aggressive at the cost of a slightly higher quality floor — the reconstruction is probabilistic, not deterministic.

Benchmark Comparison

DimensionCartridgesSTILLNotes
Compression approachLearned encoder-decoder (explicit latent)Structured sparsity + neural predictorFundamentally different architectures
Memory reduction (7B, 8K ctx)~6× (paper) / ~5.8× (reproduced)~4.5× (paper) / ~4.3× (reproduced)Reproductions slightly below paper claims
Perplexity delta (WikiText-103)+0.4 ppl (paper) / +0.5 ppl (reproduced)+0.3 ppl (paper) / +0.4 ppl (reproduced)Both within acceptable range
Training cost (7B model, A100 80GB)~18–24h for encoder training~8–12h for predictor trainingSTILL significantly cheaper to fine-tune
32K context stabilityNot tested in original paperPartially tested, degradation notedBoth reproductions flag this as a gap
Drop-in compatibilityNo — requires training pipeline changeNo — requires training pipeline changeNeither replaces standard KV management
LicenseOpen (reproduction)Open (reproduction)Original paper code: check per-repo
Single-GPU verifiedYes (A100 80GB)Yes (A100 80GB)RTX 4090 partial support reported

The reproductions are honest about where the numbers diverge: both methods degrade more than advertised at context lengths above 32K, and both have only been verified on LLaMA-family architectures. Mistral and Qwen results are early-stage, with the reproduction authors flagging attention pattern differences that affect the trained predictors.

What the Single-GPU Reproduction Changes

The original papers, like most research in this area, were benchmarked in environments that practitioners cannot easily replicate: multi-GPU clusters, custom CUDA kernels, and evaluation pipelines not released at submission time. This is not bad faith — it's the practical reality of research workflows. But it means that the benchmark numbers in the paper are, until reproduced independently, a single data point from a controlled environment.

Single-GPU reproductions change the epistemics here. When a community researcher posts "I ran Cartridges on an A100 80GB and got these perplexity numbers," and those numbers track within 5% of the paper's claims, that's meaningful confirmation. When they diverge — as they do at 32K+ context — that's equally meaningful signal. The benchmark becomes a live thing rather than a static table in a PDF.

This matters specifically for AI engineering teams evaluating whether to integrate neural compaction into their inference stack, because the integration decision is almost never made in a multi-GPU research environment. It's made by engineers who have one or two A100s, a production model, and a real latency budget.

Why Single-GPU Matters for Real-World Benchmark Evaluation

Independent benchmark verification on commodity hardware surfaces failure modes that cluster experiments miss. The 32K context degradation in both reproductions is a good example: research evaluation often stays within the "sweet spot" context range that makes a method look best. Production inference does not have that luxury. The reproduction benchmark runs expose the edges.

There's also a tooling signal here. The reproduction authors have released evaluation harnesses with pluggable backends, which means researchers can now run comparative benchmarks between Cartridges, STILL, and existing approaches (like H2O and SnapKV) on their own hardware and datasets. This is the infrastructure that turns a paper into a field.

When NOT to Use Neural KV-Cache Compaction

Don't use these methods if your context lengths are under 4K tokens. The training overhead and inference reconstruction cost are not justified by the memory savings at short contexts. Standard KV quantization (e.g., INT8 or INT4 KV caches) is cheaper to implement and recovers most of the memory budget at that scale.

Don't assume the benchmark numbers transfer to your model architecture. Both Cartridges and STILL were trained and evaluated primarily on LLaMA-2 and LLaMA-3 variants. Architectures with grouped query attention (GQA), sliding window attention, or non-standard positional embeddings (Mistral, Mixtral) have different KV cache structures that the trained encoders/predictors may not handle without retraining.

Don't deploy without domain-matched training data. The neural compressor learns what information matters from the distribution it was trained on. If your production use case is code generation and the model was trained on web text, the compression will likely drop the wrong tokens. This is not a bug — it's a consequence of learned compression — but it's a failure mode that doesn't appear in the original benchmark suite.

Don't treat the reproduction numbers as production-ready benchmarks. The single-GPU reproductions confirm the core claims, but they use evaluation splits from the same datasets as the original papers. Held-out domain generalization benchmarks are still missing from the public reproduction record.

Don't skip the training infrastructure audit. Both methods require a training phase that modifies the model's inference path. This means your serving infrastructure needs to support the modified forward pass, which is non-trivial to integrate with vLLM, TGI, or other optimized serving backends without additional engineering.

Checklist: Evaluating Neural KV-Cache Compaction Before You Commit

  • [ ] Confirm your average production context length exceeds 8K tokens (below this, simpler approaches win)
  • [ ] Verify the architecture is LLaMA-compatible or budget for predictor retraining on your model family
  • [ ] Run the reproduction harness on a held-out dataset from your domain before evaluating paper benchmarks
  • [ ] Measure actual throughput (tokens/second) not just memory — both methods add inference-time overhead
  • [ ] Test at your 95th-percentile context length, not just the average
  • [ ] Confirm your serving stack can accommodate the modified KV cache forward pass
  • [ ] Budget 12–24 GPU-hours for training the compressor on domain-representative data

Where This Is Heading

The benchmark surface is about to get crowded. Cartridges and STILL are not the only neural compaction methods in the pipeline. PyramidKV, MagicPIG, and SnapKV all have variations on the learned-compression theme, and cross-method benchmark comparisons on standardized hardware will become the evaluation norm over the next 12 months. The single-GPU reproduction culture emerging around Cartridges and STILL is setting a precedent for what reproducibility means in this subfield.

Serving stack integration is the next bottleneck. Right now, neither method integrates cleanly with vLLM's PagedAttention or SGLang's RadixAttention without custom patches. The community reproductions surface this clearly — the evaluation harnesses are standalone, not serving-stack integrations. Expect the next wave of engineering work to focus on production-compatible implementations rather than new compression architectures.

Multi-modal KV caches will stress-test these methods in new ways. Vision-language models have KV caches with substantially different statistical properties than text-only models — image patch tokens have high spatial redundancy that pure attention-score-based eviction misses. Neural compaction methods that learn from data may actually have an advantage here, but no benchmarks exist yet.

Quantization and neural compaction are likely to converge. Current implementations treat these as separate techniques. The more interesting research direction — already appearing in a few preprints — is training the neural compressor jointly with quantization-aware objectives, so the latent space is optimized for both size and bit-width simultaneously.

The reproducibility norm is shifting. Papers that don't release code or that benchmark only on proprietary hardware will face increasing skepticism from reviewers and practitioners. The Cartridges and STILL reproduction projects are partly a response to that pressure, and they're raising the floor for what "open-source" means in the inference optimization space.

FAQ

Does neural KV-cache compaction work without retraining the base LLM? Yes, with caveats. The base model weights are frozen. You're training the encoder (Cartridges) or predictor (STILL) as separate modules that intercept the KV cache during the attention computation. However, "no retraining" is slightly misleading — you're still running a supervised training pass on domain data to calibrate the compressor, which requires GPU compute and representative examples. It's closer to adapter training than fine-tuning in terms of cost.

How does this compare to SnapKV and H2O on the same benchmark tasks? The reproduction benchmarks include partial comparisons. On standard language modeling tasks (WikiText-103, PG-19), Cartridges outperforms H2O at equivalent compression ratios by 0.2–0.4 perplexity points. STILL is closer to SnapKV in quality. The more important comparison — on long-document QA tasks like SCROLLS — is less clear because the reproductions don't fully cover that benchmark suite. Current evidence doesn't confirm which method wins at task-specific long-context evaluation.

Can these methods be combined with KV cache quantization? In principle, yes. In practice, the current open-source implementations don't support combined pipelines. Running INT8 quantization on the already-compressed latents is theoretically sound — you'd stack two compression mechanisms — but the interaction effects on quality haven't been benchmarked. This is a real gap in the current reproduction work.

Is an A100 80GB required, or can this run on consumer hardware? The reproduction authors report partial success on RTX 4090 (24GB VRAM) for 7B models with aggressive batch size reduction. Inference with the trained compressor fits on a 4090; training the encoder or predictor does not, at least not without gradient checkpointing and significant throughput reduction. For 13B+ models, the A100 80GB is effectively the minimum for training.

What's the realistic throughput overhead at serving time? Cartridges adds a decoder step before each attention computation, which the reproduction authors measure at roughly 8–15% latency overhead for 7B models at 8K context. STILL's predictor overhead is lower — around 3–7% — because it's a lightweight scoring pass, not a full decode. Both are within acceptable margins for memory-constrained deployments where the alternative is adding a second GPU.

Are the reproductions peer-reviewed? No. They are community engineering reproductions, not peer-reviewed publications. The value is empirical verification of paper claims on accessible hardware, not novel scientific contribution. Treat the numbers as a second data point, not an authoritative benchmark.

Should I wait for vLLM integration before evaluating these methods? If you're evaluating for production deployment, yes — the standalone evaluation harnesses are too far from a serving stack to give realistic throughput numbers. If you're evaluating for research or feasibility, the current reproductions are sufficient to form a view on quality trade-offs. The serving stack integration question is separable from the compression quality question.

M
> AI Systems & Technology Editor I started writing code when I was 14 and never fully stopped, even after I began writing about it. Since 2015 I'm dedicated to AI research, and earned my PHD in Computer Science with a thesis on Optimization and Stability in Non-Convex Learning Systems. I've read more technical papers than you can imagine, played with hundreds of tools and currently have a huge local set up where I am having fun deploying and testing models.