Chameleon is an open-source, stateless AI LLM runtime that dynamically loads, executes, and unloads large language models on demand, routing each request to the optimal model and then freeing VRAM by unloading it.

Yes. Chameleon is released under the MIT License as community-maintained open-source software, free to use for both commercial and private projects with no commercial licensing model.

How does Chameleon save VRAM?

Chameleon keeps no model permanently resident. It loads a model only to run a request, then unloads it to a blank state, and uses a bounded VRAM cache with configurable warm slots so you can trade memory for latency.

Which inference backends does Chameleon support?

Chameleon uses a pluggable inference layer supporting llama-cpp-python, vLLM, Transformers, and ExLlamaV2, with a Rust control plane for routing and lifecycle management and gRPC-coordinated worker pools.

Chameleon

Overview / Description

Chameleon is an open-source AI LLM runtime that dynamically loads, runs, and unloads large language models on demand instead of keeping them permanently in memory. It is built for teams and organizations running multiple specialized LLMs on limited VRAM, who need different models for tasks like coding, reasoning, summarization, and chat without paying for idle GPU memory. Its core workflow routes each request to the optimal model, loads that model, executes the request, then unloads it back to a blank state to free VRAM. Chameleon supports rules-based or ML-based intent classification for model routing, a bounded VRAM cache with configurable warm slots to trade memory for latency, and hot model registration without a restart. Architecturally it pairs a Rust control plane that handles the gateway, routing logic, lifecycle management, and VRAM budgeting with a Python skills layer that loads and runs models through pluggable inference backends including llama-cpp-python, vLLM, Transformers, and ExLlamaV2. A multi-worker pool is coordinated over gRPC, built-in telemetry tracks metrics in SQLite, and a distributed mode adds Kubernetes support. Chameleon is released under the MIT License as community-maintained open-source software with no commercial licensing model, making it free to use for both commercial and private projects.

Used For

Running multiple LLMs on limited VRAM, dynamic model routing per task, on-demand model loading and unloading, optimizing GPU memory costs, self-hosting an LLM inference runtime

Pricing

Plan

Free