Microsoft Opens Windows AI Experimental LLM APIs to RTX 30-Series and Newer GPUs, 6GB VRAM Required

Microsoft has quietly unlocked experimental artificial intelligence APIs that let Windows 11 PCs with Nvidia GeForce RTX 30-series or newer GPUs run powerful local language models. The move, which requires at least 6GB of dedicated video memory, effectively extends some of the Copilot+ AI capabilities to a broader range of existing hardware.

Until now, the most advanced on-device AI features in Windows 11—such as Recall, Cocreator, and Windows Studio Effects—were exclusive to Copilot+ branded PCs, all of which pack neural processing units (NPUs) with 40 TOPS or more. Those machines debuted with Qualcomm’s Snapdragon X Elite and Plus processors, leaving tens of millions of RTX-equipped desktops and laptops on the sidelines. The new experimental APIs change that calculus.

A Quiet Expansion of Windows AI

The experimental APIs surfaced in late 2024 inside preview builds of the Windows App SDK. Developers who target these interfaces can tap into locally executing large language models without touching the cloud. Microsoft’s documentation describes them as “a set of experimental APIs to integrate on-device language models into your Windows applications.” The models run via DirectML, the hardware-accelerated machine learning layer that leverages both dedicated NPUs and GPUs.

Crucially, the APIs support both NPUs and discrete GPUs. On Copilot+ PCs, they default to the integrated NPU for maximum power efficiency. On other systems with a supported RTX GPU and at least 6GB of VRAM, the APIs fall back to the GPU. Microsoft explicitly lists Nvidia GeForce RTX 30-series and newer as compatible. RTX 20-series cards, despite having Tensor Cores, are not called out, nor are AMD Radeon or Intel Arc GPUs—at least not yet.

What the APIs Unlock

The experimental surface includes two core components: a text generation API and a chat completion API. Both accept prompts and return token streams, much like popular cloud-based LLM services. The underlying model appears to be a variant of Microsoft’s Phi Silica, a 3.3-billion-parameter transformer distilled from the Phi-4 model and optimized for on-device inference. Phi Silica was initially tailored for the Qualcomm NPU in Copilot+ PCs, but the new APIs bind to a broader DirectML backend.

Developers can already call the APIs from WinUI 3, C++, WPF, or WinForms applications. Early sample code shows a few dozen lines of C# invoking LanguageModel.TextGenerationPromptAsync to get a response. The model understands natural language, generates text, summarizes documents, and performs basic reasoning. Because everything stays on-device, latency is minimal, and privacy is absolute—no data ever leaves the PC.

System Requirements and Fine Print

To use the experimental APIs on an RTX GPU, the following must be met:

Windows 11, version 22H2 or later (build 22621 or higher)
Windows App SDK 1.6 Experimental 2 or newer
Nvidia GeForce RTX 30-series, 40-series, or newer GPU
At least 6GB of dedicated VRAM
Latest Nvidia Game Ready or Studio Driver (R535 or later)

On systems with less than 6GB VRAM or unsupported GPUs, the API calls will fail gracefully. Microsoft notes that performance may vary based on GPU speed, available VRAM, and system power state. The model loads into GPU memory on first use, consuming roughly 3–4GB of VRAM for the base model. That leaves enough headroom for light applications but may cause contention on 6GB cards if other GPU-intensive tasks are running.

How It Compares to Copilot+ PC Execution

On Copilot+ PCs, the same APIs route to the NPU, which is far more power-efficient. The Qualcomm NPU in a Snapdragon X Elite can sustain 40 TOPS while drawing just a few watts, allowing continuous AI workloads without noticeable battery drain. An RTX 30-series GPU, by contrast, may consume 50–150 watts under inference load, making it impractical for always-on background tasks on a laptop. For desktop users and plugged-in laptops, however, the GPU path delivers higher throughput and lower latency thanks to massive parallel compute.

There is also a model-size difference. The Phi Silica model running on NPUs is reportedly tuned to 3.3 billion parameters with 4-bit quantization. The GPU-flavored model distributed via the experimental API may use a slightly larger configuration or different quantization scheme to better utilize GPU tensor cores. Microsoft has not published detailed benchmarks yet.

Who Benefits and Use Cases

This expansion hits three distinct audiences:

Developers can start building and testing local AI features without waiting for NPU hardware. Game modders, creative tool authors, and enterprise devs can integrate offline text generation, summarization, or contextual help into their apps right now on an RTX 30-series or newer card.
Enthusiasts and power users who own high-end gaming rigs with RTX 30-series or 40-series GPUs can experiment with local AI workloads that previously required a Copilot+ PC. They might run custom chatbots, local document Q&A tools, or AI-assisted coding environments entirely offline.
IT departments can evaluate on-device AI without purchasing new hardware. For sensitive environments where data cannot touch the cloud, having an RTX GPU suddenly becomes a viable path to deploy local LLM capabilities.

Real-world example: a developer building a WinForms application that summarizes long technical documents could embed the chat completion API. The app runs entirely on the user’s RTX 3060 with 12GB VRAM, delivering near-instant results without internet connectivity.

The Fine Line Between Experimental and Stable

Microsoft labels the entire surface “experimental,” meaning the APIs may change or vanish in future Windows App SDK releases. There is no guarantee of backward compatibility, and Microsoft warns against shipping production applications built solely on these interfaces. The early feedback loop is clearly intended to shape a stable version that might eventually land in mainstream Windows releases.

This experimental nature carries risk. Developers who invest time now may face breaking changes when the APIs graduate. Yet the opportunity is compelling: Microsoft is essentially opening a backdoor for powerful local AI on hardware already in the wild, potentially making “Copilot+ features without a Copilot+ PC” a reality for tasks that don’t require the NPU’s always-on efficiency.

What’s Missing and What’s Next

Not all Copilot+ AI features will come to RTX GPUs via these APIs. Recall, for example, relies on continuous screen semantic indexing—a workload that needs an NPU’s ultra-low-power characteristics to run persistently without crippling battery life or thermals. Windows Studio Effects and Cocreator also lean heavily on NPU-optimized inference pipelines. The experimental APIs focus on language generation only; vision and audio models remain out of scope.

Microsoft has hinted at broader GPU support beyond Nvidia. A DirectML roadmap suggests eventual compatibility with AMD RDNA3 and Intel Arc discrete GPUs, but no timeline is public. The present Nvidia-only requirement likely reflects the maturity of the CUDA/DirectML bridging layer and the widespread availability of RTX Tensor Cores, which dramatically accelerate transformer inference.

Another open question is whether the model itself will be updatable. Currently, the APIs ship with a fixed Phi Silica variant. As Microsoft iterates on Phi, the baked-in model risks becoming stale. A future update could allow developers to bring their own fine-tuned models via ONNX or a similar format, but nothing has been announced.

Community Catalyst and Developer Adoption

The Windows developer community has received the news with cautious optimism. Early adopters on Reddit and GitHub note that even a 6GB RTX 3060 laptop can run the chat completion model at interactive speeds, generating 15–20 tokens per second. This compares favorably to cloud services for simple tasks, especially considering zero network overhead and enhanced privacy.

Some developers have already started prototyping offline assistants that hook into local file systems, much like early versions of Windows Copilot, but without the Copilot+ hardware requirement. The ability to run a reasonably capable LLM entirely on-device without paying per token has ignited discussions about a new wave of locally intelligent Windows applications.

Microsoft’s decision to gate the feature behind 6GB VRAM, rather than requiring an NPU, signals a pragmatic acknowledgment that discrete GPUs remain the most capable AI accelerators in the x86 ecosystem. With tens of millions of RTX 30-series and 40-series GPUs already in users’ hands, the addressable market is enormous.

Closing the Gap with Apple and Others

Apple has shipped local AI models through Core ML for years, and the M-series chips include a unified memory architecture that makes large on-device models feasible. Microsoft’s Windows AI strategy has lagged, fragmented between x86 CPUs, integrated GPUs, and the relatively nascent NPU ecosystem. By embracing discrete GPUs through DirectML, Microsoft takes a significant step toward parity.

The move also pressures AMD and Intel to improve their Windows AI drivers and optimize for DirectML. Intel’s Core Ultra processors have integrated NPUs, and AMD’s RDNA3 GPUs are DirectML-capable, but neither is currently supported by these experimental APIs. Competitive pressure from Nvidia’s already-working solution may accelerate support.

Practical Advice for Curious Users

For users wanting to test the experimental APIs today:

Update Windows 11 to at least build 22621.
Install the latest Nvidia driver (R535 or newer).
Grab the Windows App SDK 1.6 Experimental 2 from NuGet or the official download center.
Start a new WinUI 3 project in Visual Studio 2022.
Add the Microsoft.Windows.AI.Generative package.
Call LanguageModel.CreateAsync() and then use the text generation or chat completion methods.

A complete sample application is available on Microsoft’s Windows AI samples GitHub repository under the “PhiSilicaExperimental” folder.

Developers are urged to review the experimental API terms and plan for migration when the stable surface arrives. Even so, the immediate benefit—being able to embed a capable local LLM in a Windows app today—is too attractive to ignore.

The Bottom Line

Microsoft’s experimental AI APIs mark a quiet but pivotal expansion of Windows’ AI capabilities. By lowering the hardware bar from NPU-equipped Copilot+ PCs to widely owned RTX 30-series and newer GPUs, the company instantly multiplies the potential developer base for on-device AI. While the APIs remain in flux and their future shape is uncertain, they offer a tangible preview of a Windows ecosystem where powerful local language models are an OS-level commodity.

For users who invested in a high-end Nvidia GPU, the APIs turn that silicon into a practical AI co-processor—no new PC required. The wait for broader GPU support, non-experimental stability, and a richer model catalog continues, but the foundation has been laid. Microsoft is finally giving the PC its AI moment, and it starts with RTX.