Chameleon

Chameleon is a stateless LLM runtime designed for efficient multi-model deployments. Instead of keeping models resident in memory, it routes each inference request to the appropriate model, loads it just-in-time, runs the request, then fully unloads — leaving zero idle VRAM. This makes it practical to run a large roster of models on constrained hardware without the overhead of managing separate processes or restarting systems between model switches. It's particularly useful for developers and teams who need to serve multiple specialized models (e.g., coding, summarization, embeddings) from a single machine without the memory bloat that comes from keeping them all warm simultaneously.