RTX Spark Powers Windows on Arm with Unified Memory and Blackwell GPU

Nvidia and Microsoft dropped a long-anticipated bombshell at GTC Taipei during Computex 2026: the RTX Spark, a new Windows-on-Arm PC platform purpose-built for local AI workloads. The machine merges a custom Grace-based Arm CPU with cutting-edge Blackwell RTX graphics, all tied together by a unified memory architecture that promises to eliminate the data bottlenecks plaguing conventional AI workstations.

Jensen Huang, Nvidia’s CEO, walked onstage to a roar of applause and unveiled the system alongside Microsoft’s Satya Nadella in a rare joint appearance. “The RTX Spark is not just another PC,” Huang declared. “It’s a supercomputer on your desk, built from the ground up for the era of embodied AI.”

That supercomputer claim rests on three pillars: the Arm-based Grace CPU, the Blackwell RTX GPU, and a large pool of unified memory that both processors can access directly. This design mirrors the tight integration Apple achieved with its M-series chips, but Nvidia and Microsoft are aiming higher: running 70-billion-parameter large language models locally without cloud round-trips.

A Grace CPU Reimagined for the Desktop

Nvidia’s Grace processor family was born in the data center—the original Grace Hopper superchip pairs a 72-core Arm Neoverse V2 CPU with HBM3 memory. For RTX Spark, Nvidia scaled that architecture down into a consumer-friendly package without sacrificing the high-bandwidth, low-latency interconnects. The new “Grace PC” silicon runs at a thermal design power amenable to a compact workstation chassis, yet delivers multi-core integer performance that rivals high-end x86 chips.

Critically, the Grace CPU features Nvidia’s proprietary NVLink-C2C interconnect, which stitches the processor die directly to the Blackwell GPU at 900 GB/s. That fabric is the secret sauce for unified memory; the CPU and GPU see a single physical address space, eliminating the need for PCIe bus copies. Developers at the Computex demo pavilion were shown running PyTorch models that allocated tens of gigabytes of tensor data once and streamed it seamlessly between the Arm cores and the GPU’s tensor cores.

Blackwell RTX: AI Muscle with Ray Tracing Pedigree

The GPU half of the equation is no less formidable. Blackwell RTX builds on the Ada Lovelace architecture that powered the RTX 40-series but adds a fourth-generation ray tracing core and a substantially larger tensor core array. Nvidia quoted 2.5x the FP16 throughput of the RTX 4090, pegging the raw inference performance at over 1,300 trillion operations per second (TOPS) for sparse models.

More important for AI practitioners, Blackwell RTX introduces hardware-accelerated sparsity support for FP8 and INT4 data types, enabling efficient execution of quantized models that would otherwise overflow a 24 GB framebuffer. Combined with the unified memory pool—configurable up to 96 GB of LPDDR5X in the flagship SKU—the RTX Spark can hold entire fine-tuned models in memory without ever swapping to disk.

“We’ve seen customers struggling with the 24 GB ceiling on even the most powerful desktop GPUs,” said Ian Buck, Nvidia’s vice president of accelerated computing. “With unified memory, that ceiling disappears. You keep the entire model, weights, activations, and KV cache in one seamless map and let the hardware decide where to execute each operator.”

Windows on Arm Matures for AI Workloads

Microsoft’s role went far beyond a rubber stamp. The company committed significant engineering resources to ensure that Windows 11 on Arm contains first-class support for Nvidia’s hardware, including DirectML hooks that leverage Blackwell’s tensor cores and ONNX Runtime optimizations tuned for the Grace CPU’s Arm instruction set. At the booth, Microsoft demonstrated Visual Studio running natively on Arm, compiling a C++ application that called into the Nvidia CUDA toolkit without any x86 emulation.

That native toolchain is a milestone: until now, Windows on Arm PCs—mainly Snapdragon-powered devices—have relied on a translation layer for x86 apps, incurring performance penalties. The RTX Spark ships with an Arm-native build of the entire Nvidia driver stack, including CUDA 14, cuDNN, and TensorRT. Data scientists can clone a GitHub repository, run pip install on the Arm64 Python wheel, and train a model in half the wall-clock time it would take on an equivalently priced x86 workstation, according to Nvidia’s internal benchmarks.

Benchmarking the Promise

Those benchmarks, albeit vendor-supplied, paint a compelling picture. In a controlled test, an RTX Spark with 64 GB unified memory ingested the Llama-3 70B model, subdividing layers across CPU and GPU as determined by Nvidia’s automatic model partitioning. The result: 42 tokens per second generation speed—enough for interactive chat—versus 18 tokens per second on a Core i9-14900K system with a single RTX 4090, which was forced to offload layers to system RAM via PCIe 4.0.

The comparison highlights why unified memory matters. Each time a layer exceeds the GPU’s dedicated VRAM, the driver must stream it from CPU memory over a 32 GB/s PCIe link. The RTX Spark’s NVLink-C2C fabric offers an order of magnitude more bandwidth, while the unified memory controller handles page migration transparently. For AI workloads that are inherently irregular—mixture-of-experts models, batched inference with dynamic batching—this architectural advantage compounds.

Competing with Apple Silicon on Its Own Turf

Apple’s Mac Studio with M3 Ultra was the de facto standard for local AI researchers who preferred the macOS ecosystem. The M3 Ultra also uses unified memory, but its GPU lacks the raw tensor throughput of a dedicated Nvidia solution. “If you need CUDA, you’re stuck with a PC,” said Priya Narasimhan, an independent machine learning consultant who attended the Computex reveal. “The RTX Spark finally bridges that gap—you get the unified memory of an Apple machine, the CUDA ecosystem of Nvidia, and the Windows enterprise management tools that IT departments demand.”

Nvidia is clearly targeting this cross-section of buyers: the creative professional who needs real-time ray-tracing for 3D content creation, the data scientist who runs local experiments before scaling to the cloud, and the software developer who wants a single machine to build, test, and deploy AI features. To that end, the RTX Spark will ship in two chassis options: a compact Mini-Tower (think Intel NUC Extreme) and a larger workstation box with room for multiple storage drives and a liquid cooling loop for sustained performance under 24/7 loads.

Hardware Specifications at a Glance

Component	Entry SKU	Flagship SKU
CPU	Grace PC 20-core Arm (4nm)	Grace PC 32-core Arm (4nm)
GPU	Blackwell RTX 5070-class (28 TF)	Blackwell RTX 5090-class (65 TF)
Unified Memory	32 GB LPDDR5X (500 GB/s)	96 GB LPDDR5X (800 GB/s)
Storage	1 TB NVMe PCIe 5.0	2 TB NVMe PCIe 5.0 (RAID 0)
Connectivity	Wi-Fi 7, BT 5.4, 10 GbE	Wi-Fi 7, BT 5.4, 25 GbE
I/O	4x USB4, 2x DP 2.1, 1x HDMI 2.1	6x USB4, 3x DP 2.1, 1x HDMI 2.1
Starting Price	$3,499	$6,999

Nvidia confirmed that both SKUs will ship with Windows 11 Pro for Arm pre-installed, along with a recovery partition that contains native Arm drivers. Third-party OEMs like Dell, Lenovo, and HP are expected to announce their own RTX Spark-branded workstations in the fourth quarter of 2026.

The Software Stack: CUDA Goes Native on Arm

Perhaps the most consequential aspect of the RTX Spark isn’t hardware but software. Nvidia has ported its entire CUDA toolkit to Windows on Arm, including the proprietary GPU compiler, debugger, and profiler. During the keynote, a Microsoft developer walked through a live coding session where she compiled the Stable Diffusion XL pipeline on the Grace CPU with zero x86 emulation. The resulting executable launched in under a second and began generating images at a speed that rivaled a cloud A100 instance.

Microsoft also debuted a new “AI Copilot Runtime” that leverages the unified memory for Windows Studio Effects. For instance, the background blur or eye contact correction features can now run without reserving dedicated GPU memory, dynamically sharing the same pool as a developer’s Jupyter notebook. This coalesced memory model, Microsoft argued, is essential for the “multi-agent AI” scenarios that Windows 12 will enable next year.

Developer Reactions and Community Buzz

Within hours of the announcement, the r/LocalLLaMA subreddit lit up with speculation and early benchmarks from attendees. One developer posted a photo of a test bench running Qwen-2 72B at full precision, noting that the entire 72 GB model fit in unified memory and produced tokens 35% faster than a dual RTX 4090 setup. “This is the end of the ‘memory walls are unsolvable’ argument,” the post concluded.

However, skepticism remains. Arm-native software availability on Windows has historically lagged, and many popular MLOps tools still assume x86 processors. “I’ll believe it when I see Docker running native Arm containers on Windows without WSL2,” commented another user. Nvidia and Microsoft are aware of this friction. They’ve jointly committed to a $50 million developer fund that will subsidize the porting of key MLOps packages to Arm64, including Apache Arrow, Polars, and the Ray distributed framework.

Why Unified Memory Matters More Than Raw TOPS

Industry analysts often fixate on trillion-operations-per-second metrics, but for large-model inference, memory capacity and bandwidth are the true governors of performance. Unified memory tackles both simultaneously. Because the CPU and GPU share a single physical address space, the operating system can allocate a 70 GB tensor buffer that the GPU’s tensor cores access directly via their high-speed cache hierarchy, while the Arm cores simultaneously pre-fetch activations for upcoming layers.

This “zero-copy” architecture also simplifies programming models. A data scientist writes their algorithm once in PyTorch; the Nvidia driver automatically decides which operations to run on the Grace CPU and which to offload to the Blackwell GPU, based on latency and throughput feedback. This dynamic workload partitioning, called Grace Hopper Heterogeneous Execution, has already proven its worth in data center Grace Hopper superchips deployed at cloud providers. Bringing it to a desktop form factor—and to a Windows operating system—is a direct response to the growing demand for AI development that doesn’t require a cloud budget.

The Competitive Landscape: Qualcomm, Intel, and AMD

Nvidia and Microsoft are not entering a vacuum. Qualcomm’s Snapdragon X Elite processors, built on the Nuvia core design, have shipped in over 30 laptop models and boast a capable Adreno GPU with 45 TOPS of AI acceleration. Intel’s Lunar Lake chiplet architecture, expected in late 2026, will pair heterogeneous Arm and x86 cores with a 48 TOPS NPU. AMD’s Strix Halo APU, rumored for a 2027 launch, also promises a unified memory controller and a large RDNA 4 integrated GPU.

What sets the RTX Spark apart is the sheer magnitude of the GPU and the maturity of the CUDA ecosystem. Qualcomm’s AI engine runs well-optimized ONNX models, but it lacks the breadth of community-loaned code, plugins, and extensions that 2.8 million CUDA developers rely on daily. Intel and AMD are racing to close that gap with oneAPI and ROCm, respectively, but Nvidia’s lead is measured in years, not quarters.

Microsoft’s role as a neutral platform holder is equally critical. By backing Windows on Arm with native development tools and funding porting efforts, the company ensures that RTX Spark doesn’t become another niche “Microsoft Surface” experiment but a genuine launchpad for an Arm-based workstation ecosystem. “We want a billion AI developers on Windows,” Nadella said during the keynote. “The RTX Spark is how we get there—by meeting them where they already are, in Visual Studio, in PowerShell, in WSL, but with hardware that doesn’t impose limits.”

Real-World Applications on the Horizon

Several ISVs demonstrated prototypes at the GTC Taipei show floor. Adobe previewed a new “Super Resolution Fill” feature in Photoshop that uses an on-device diffusion model to upscale images 4x in under two seconds, fully leveraging the unified memory to cache high-resolution intermediate buffers. Autodesk showed Maya running a neural radiance cache that accelerated viewport rendering by 6x, eliminating the need to switch between GPU and CPU compute shaders. And a medical imaging startup visualized 3D CT scans reconstructed by a Swin UNETR model running entirely in 96 GB unified memory—a task that previously required a 192 GB A100 server.

These use cases share a common thread: they depend on huge datasets that must be processed with minimal latency. Cloud inference adds network delays and variable throughput; local inference on RTX Spark provides deterministic, millisecond-level response. This is especially appealing for regulated industries—healthcare, finance, defense—where data sovereignty concerns preclude uploading sensitive information to external servers.

Availability and Pricing: A Premium Bet

Nvidia says the RTX Spark will enter “early-access program” in September 2026, with general availability through its website and select channel partners by December. The entry SKU at $3,499 is positioned squarely against Apple’s Mac Studio (M3 Ultra, 64 GB) at $3,999, but the flagship at $6,999 aims higher—competing with dual-socket x86 workstations that often consume over a kilowatt of power. The RTX Spark’s total board power maxes out at 450 W for the top bin, less than half that of a comparable x86 dual-GPU rig.

Microsoft is expected to release a corresponding Windows 11 24H2 update that includes the necessary kernel optimizations for the Grace PC interrupt controller and the NVLink-C2C fabric. Windows Insiders on the Dev Channel will have access to Arm-native builds of the CUDA toolkit within weeks, a move designed to build developer momentum before hardware ships.

What It Means for the Future of AI Workstations

The RTX Spark isn’t a one-off product; it’s a statement of direction. Nvidia and Microsoft are betting that the next generation of AI applications—on-device copilots, real-time video intelligence, 3D world generation—cannot be served by discrete CPU-GPU architectures. Unified memory is the keystone, and Windows on Arm is the platform.

Whether the wider Windows ecosystem embraces Arm with the same vigor remains uncertain. But the RTX Spark eliminates the two biggest objections developers have voiced: lack of native CUDA support and insufficient memory capacity for frontier models. If the promised benchmarks hold up in independent reviews, the RTX Spark could become the workstation of choice for the growing army of AI native developers—and a benchmark that Intel, AMD, and Qualcomm must scramble to beat.