lmxd VRAM Daemon Crams Three AI Agents onto an 8GB GPU with Intelligent Memory Swapping

Running even a single local large language model on a consumer GPU can push hardware to its limits. But what if you need three AI agents working simultaneously on a single GeForce GTX 1080 with just 8GB of VRAM? A new open-source daemon called lmxd claims to make that possible by intelligently sharing GPU memory and swapping inference contexts on the fly. The project, surfaced in developer forums, challenges the assumption that serious local AI requires expensive, high‑VRAM hardware.

At its core, lmxd is a C++ background service that wraps around a single llama.cpp inference backend. Instead of each AI agent launching its own model instance—which would quickly exhaust available video memory—lmxd acts as a middleman. It admits multiple models against a virtual “VRAM ledger” and reuses the same underlying engine, drastically reducing the memory footprint. When three small agents load, they don't each consume a full 2–3 GB; they share the llama.cpp runtime and only keep resident the parts absolutely needed at any given moment.

The secret sauce is KV‑cache swapping. Every large language model maintains a key‑value cache that stores the context of the current conversation. Without it, the model would have to re‑process the entire history for every token, making long chats impossibly slow. But that cache can be several gigabytes for a typical 7‑billion‑parameter model. lmxd tracks precisely how much VRAM each agent’s KV cache occupies via its ledger. When agent A is idle and agent B needs to generate, lmxd saves agent A’s cache to system RAM—or even to an NVMe drive if RAM is tight—then loads agent B’s cache onto the GPU. The switch happens in under a second for small models, making the round‑robin feel nearly seamless.

Developer feedback from early testers highlights the practicality. One user reported running three distinct agents on a Windows 11 machine with the aging GTX 1080: a coding assistant using a 1.5‑billion‑parameter model, a general‑purpose chat agent with a 3‑billion‑parameter model, and a small RAG‑based search agent. All three stayed responsive as long as not all were prompted simultaneously. “It’s like having a GPU concierge,” the user wrote. “The ledger ensures we never oversubscribe VRAM, and the swaps happen fast enough that I don’t notice delays between agent turns.”

The VRAM ledger itself is a lightweight, in‑memory data structure that lmxd updates in real time. Each model that connects to the daemon declares its expected memory usage—model weights plus maximum KV cache size—and the ledger approves or denies requests based on available headroom. This avoids the dreaded out‑of‑memory errors that plague users who try to run multiple llama.cpp instances manually. When an agent finishes its turn and releases its cache, lmxd reclaims the space and marks it ready for the next agent. If a new model attempts to load but the ledger shows insufficient free VRAM, the daemon can either trigger an emergency swap of the least‑recently‑used agent or queue the request until space frees up.

Performance data shared in the original discussion paints an encouraging picture. On the GTX 1080 test system, swapping three 3‑billion‑parameter models between system RAM (32 GB DDR4) and the GPU added roughly 200–500 milliseconds of latency per agent switch. For a typical conversational interaction where an agent generates a response in 2–5 seconds, the overhead is negligible. However, if all three agents were asked to generate tokens simultaneously, the round‑robin scheduling would introduce noticeable pauses, as each agent must wait its turn to occupy the GPU. lmxd is designed for interactive use cases where only one agent is active at a time, making it a perfect fit for personal assistant setups, home automation controllers, or development environments where you juggle different AI tools.

Crucially, lmxd relies on llama.cpp’s support for half‑precision floating point and various quantization techniques. The project’s documentation recommends using 4‑bit or 5‑bit quantized models to keep weights small enough that multiple models can fit in the VRAM budget. For example, a 3‑billion‑parameter model quantized to 4 bits occupies around 1.8 GB, leaving room for one or two more of similar size within 8 GB after accounting for overhead. The daemon does not merge models or quantize them on the fly; it expects the user to provide pre‑quantized GGUF files. This keeps the tool simple and avoids the complexity of dynamic precision adjustment.

Security and isolation concerns are minimal because lmxd runs entirely on the local machine. Each agent is just a client connecting to the daemon over a Unix socket or TCP port. The agents never share context or data unless explicitly programmed to do so. The VRAM ledger ensures that a memory‑hungry agent cannot starve others—an accidental benefit for stability. Early adopters have noted that lmxd could become a foundation for more sophisticated AI orchestration on Windows, potentially allowing a single GPU to serve multiple users on a home network or power a multi‑modal assistant that switches between vision, text, and sound models.

Comparisons with other memory‑saving techniques are inevitable. Nvidia’s own unified memory can spill over to system RAM, but it often cripples performance due to high‑latency page faults. Projects like vllm and text‑generation‑inference focus on batching multiple requests through a single model, not on hosting multiple distinct models. lmxd fills a niche: running several small, independent models on a single GPU by treating VRAM as a pooled resource with aggressive caching. It is, in essence, a memory‑swap‑based task scheduler for AI inference.

Critics point out that the approach may not scale gracefully to larger models. A 13‑billion‑parameter model, even at 4 bits, needs about 7 GB of VRAM, leaving no room for a second model on an 8‑GB card. The daemon would then only be able to host one agent, defeating its purpose. However, lmxd’s developer hints at future support for “model streaming,” where only a subset of layers are kept on the GPU at a time, potentially enabling a 13‑billion‑parameter model to share space with a tiny 1‑billion‑parameter helper. That remains experimental and would require deeper integration with llama.cpp’s layer‑offloading mechanism.

For Windows enthusiasts, lmxd represents a democratizing step. Many PC gamers and hobbyists own older GPUs like the GTX 1080 or RTX 2060 (6 GB). Until now, local AI tinkerers often resigned themselves to running one model at a time or juggling models by hand. With lmxd, they can keep a lightweight coding autocompletion agent always loaded while a more capable chat model sleeps in RAM, ready to be swapped in within seconds. The setup is particularly appealing for those experimenting with AI‑powered automation in tools like Home Assistant, where different tasks (voice control, image analysis, text summarization) demand different models.

Installation on Windows is straightforward for those comfortable with the command line. After building lmxd from source or downloading a pre‑compiled binary, users start the daemon with a simple configuration file that specifies the path to the llama.cpp backend and the models to preload. Each agent is launched with a command‑line argument pointing to the daemon’s socket. From the agent’s perspective, it’s talking to a local API endpoint identical to llama.cpp’s server mode. The VRAM ledger and swapping logic are completely transparent.

Community enthusiasm has already sparked discussions about integrating lmxd with popular front‑ends like Ollama or LM Studio. If those projects adopt the daemon as a backend option, even non‑technical users could enjoy multi‑agent setups through a familiar GUI. The original forum thread includes snippets of a Python wrapper that mimics the OpenAI API, making it trivial for existing tools to connect. “I replaced the API base URL with localhost:8081 and my three agents started sharing my 1080 without a single code change in my app,” one developer reported.

Looking ahead, the lmxd paradigm could extend beyond Windows. Linux builds are already functional, and the daemon’s core design—a ledger‑based memory arbitrator—is platform‑agnostic. As on‑device AI grows more sophisticated, we may see similar techniques built into operating systems themselves. Imagine Windows 12 with a native “AI manager” that allocates GPU resources among Copilot, third‑party assistants, and gaming AI features in real time. lmxd proves the concept is viable today with commodity hardware.

The project is not without rough edges. Error handling when a swap fails—for instance, if system RAM is exhausted—needs improvement. Currently, the daemon may crash, requiring a manual restart. Documentation is sparse, and the build process demands familiarity with C++ and CMake. But for early adopters, the payoff is substantial: a $200 used GPU can now behave like a mini AI server hosting multiple personalities.

In the end, lmxd is a clever engineering hack that turns a hard limitation into a manageable constraint. It doesn’t break the laws of physics; it simply acknowledges that not every agent needs to be constantly resident in VRAM. By marrying a precise memory ledger with fast context swapping, it opens a new frontier for local AI experimentation. For Windows users hungry to push their aging hardware further, lmxd might just be the most important open‑source tool they install this year.