OpenAI VP Noam Brown: Inference-Time Scaling Will Make Memory Semiconductors the Next AI Battleground

OpenAI research vice president Noam Brown delivered a stark message to a packed symposium in Seoul on July 3, 2026: the future of artificial intelligence will be defined not by raw compute, but by memory. Speaking on the rapid evolution of frontier models, Brown explained that as AI systems spend longer “thinking” during inference—a trend known as inference-time scaling—the bottleneck shifts from processors to the speed and capacity of memory chips. The warning is a major inflection point for the $160 billion semiconductor industry and carries direct implications for the next generation of Windows-powered AI experiences.

Brown is no stranger to paradigm shifts. He co-created the AI systems that mastered poker and Diplomacy, and his research on test-time compute helped lay the foundation for large language models that can reason step-by-step. His Seoul appearance was part of a government-backed symposium on AI and semiconductors, drawing top executives from Samsung Electronics, SK Hynix, and Microsoft’s Azure hardware division.

What Is Inference-Time Scaling?

Inference is the act of running a trained AI model to produce an answer. For years, the focus was on making inference faster—reducing latency so users get instant responses. That changed with the rise of chain-of-thought reasoning, where models break complex problems into intermediate steps, effectively “thinking” for longer before outputting a final result. OpenAI’s o-series models and similar agents from Google and Anthropic can spend seconds, minutes, or even hours refining their internal reasoning chains.

That extra thinking time multiplies the volume of temporary data the model must juggle. Each reasoning step generates new tokens, partial calculations, and attention matrices that far exceed the original prompt. A model that loops through a hundred internal reasoning stages can consume 100 times more memory bandwidth than a simple one-shot answer. Compute remains essential, but if the data can’t be fed to the processors fast enough, the entire pipeline stalls. Memory becomes the new choke point.

Why Memory Matters More Than Compute

Modern AI accelerators from NVIDIA, AMD, and custom cloud chips already pack enormous FLOPS. The limiting factor in real-world deployments is increasingly memory bandwidth and capacity. High-bandwidth memory (HBM), which stacks DRAM dies vertically and connects them with through-silicon vias, is the current gold standard. Each new HBM generation—HBM3E, HBM4—doubles bandwidth and allows larger models to run without offloading data to slower storage.

Brown drew a direct line from inference-time scaling to memory demand. “When a model reasons for ten minutes instead of two seconds, the working set of data it needs to keep in near-compute memory explodes,” he told the symposium, according to attendees. “That doesn’t just require more memory; it requires much faster memory. Korean companies are uniquely positioned because they lead in both DRAM and HBM technology.”

His words immediately sent ripple effects through the financial markets, with shares of Samsung and SK Hynix ticking up in after-hours trading. The logic is straightforward: if next-generation AI services require memory chips that cost more and are in greater quantity, memory makers capture a growing share of AI infrastructure spending. Bernstein analysts estimate that memory already accounts for 40% of a high-end AI server cost, and that could rise above 60% as inference workloads dominate.

The Korean Memory Advantage

Samsung and SK Hynix control over 70% of the global DRAM market and an even larger slice of the nascent HBM segment. SK Hynix was the first to mass-produce HBM3E and supplies the memory for NVIDIA’s H200 and B200 GPUs. Samsung, after a rocky start, is regaining ground with its own HBM3E and has publicly demonstrated HBM4 prototypes with 2 TB/s bandwidth. Brown’s comments, while not an endorsement of any single supplier, validate the strategic bets Korean firms placed on advanced packaging and 3D stacking.

For Windows users, this supply chain matters. Microsoft’s Copilot+ PC initiative and the broader push toward local AI—running models directly on the device rather than in the cloud—hinges on efficient memory subsystems. Qualcomm’s Snapdragon X Elite, Intel’s Lunar Lake, and AMD’s Strix Point all integrate neural processing units (NPUs) that share memory with the CPU and GPU. If inference-time scaling trickles down to the edge, as analysts predict, laptops and desktops will need larger, faster memory pools to sustain reasoning-heavy tasks like document analysis, code generation, and real-time translation.

Microsoft has already started baking chain-of-thought reasoning into Copilot for Microsoft 365. A complex spreadsheet query might now trigger a model that thinks for 15 seconds rather than blurting out a guess. During that time, the system loads multiple intermediate datasets and stores partial computations. On a device with 16 GB of RAM, that could cause noticeable slowdowns unless memory architectures evolve. Expect future Windows hardware requirements to emphasize memory bandwidth alongside NPU TOPS.

The Data Center Ripple Effect

The real impact plays out in hyperscale cloud. Microsoft Azure, Amazon Web Services, and Google Cloud are racing to deploy servers packed with HBM-laden GPUs for inference. Azure’s “Project Forge,” detailed in leaked roadmaps, envisions clusters where memory scaling is decoupled from compute, allowing separate pools of high-capacity DRAM to serve multiple AI accelerators. Such designs would directly benefit from the kind of memory breakthroughs Brown alluded to.

Even more intriguing is the potential for Windows Server and hybrid AI workloads. Enterprises running private AI instances on Azure Stack HCI or Windows Server 2025 will face the same memory wall. Microsoft’s prompt engineering guidelines already encourage developers to use chain-of-thought techniques, which means inference-time scaling is on by default for many business workflows. The server memory market, long stagnant, could see a renaissance as IT departments upgrade to meet minimum memory per AI token requirements.

Noam Brown’s Track Record

Brown’s predictions carry weight because his career has consistently anticipated AI’s trajectory. As a PhD student at Carnegie Mellon, he created the Libratus poker bot that beat top humans by thinking through each hand for minutes. That approach—giving the model more time to compute a decision—was the seed of test-time compute. Later at Meta, he led Cicero, an AI that negotiated and strategized in Diplomacy by blending language with planning. At OpenAI, he contributed to the o1 and o3 reasoning models that first showed dramatically improved performance with increased inference time on math and science benchmarks.

He is not an outsider speculating; he helps build the models that define the frontier. When he says memory will be the bottleneck, he’s speaking from firsthand experience of watching HBM capacity constraints delay experiments and limit batch sizes.

What Comes Next

The symposium’s hosts, Korea’s Ministry of Science and ICT, quickly seized on Brown’s remarks to promote a new ₩10 trillion ($7.5 billion) chip R&D package that includes subsidies for HBM4 development and advanced packaging fabs. SK Hynix CEO Kwak Noh-jung, in a panel with Brown, said the company is already designing “inference-optimized” memory architectures that prioritize sustained random-access bandwidth over sequential throughput—a shift necessary for the sporadic, high-burst nature of reasoning workloads.

Meanwhile, Microsoft’s semiconductor division is evaluating memory-controller IP that could allow future Cobalt DPUs and Maia accelerators to directly address disaggregated memory pools over CXL (Compute Express Link). CXL 3.0, expected in 2027, promises cache-coherent memory sharing, potentially turning a server rack into a giant memory fabric. If successful, inference-time scaling could be handled by essentially infinite memory—at the cost of latency that only Korean-style HBM can currently mitigate.

For Windows enthusiasts, the takeaway is clear: the AI revolution has moved from a compute race to a memory race. The next PC you buy might not be judged by its CPU speed or GPU teraflops, but by the speed and capacity of its memory subsystem. As models learn to think longer and harder, the silicon that remembers will be the one that matters most. Brown’s Seoul bombshell isn’t just a forecast for data-center operators; it’s a preview of the memory wars coming soon to your desktop.