Phison's 4,500-Hour Investigation Fails to Reproduce Windows 11's SSD-Killing Bug

Microsoft's August 2024 cumulative update for Windows 11—tracked as KB5063878 in community records—has ignited a tense investigation after multiple testers reported that sustained heavy write operations can cause certain NVMe SSDs to abruptly vanish from the operating system. Yet the mystery deepened this week when Phison, a major controller maker whose chips are used in many affected drives, disclosed that it spent more than 4,500 hours testing and failed to reproduce the bug a single time. The conflicting signals—consistent reproductions in hobbyist labs, but clean bench results in a vendor-controlled environment—leave users and administrators caught between genuine data-loss risks and a vendor assurance that the problem isn’t widespread.

The incident first surfaced in late August, just days after Microsoft shipped its monthly servicing wave. Within the first week, a pattern emerged across multiple hardware forums and specialist outlets: users who pushed their SSDs with large, sequential writes—copying a 50 GB game folder, extracting a massive archive, or restoring a disk image—watched the target drive simply disappear. It would vanish from File Explorer, Device Manager, and even vendor-specific utilities like Corsair iCUE or Samsung Magician. A reboot sometimes brought it back; in other cases, the drive remained invisible, and files caught mid-write were left truncated or corrupted. Some unlucky users reported that the drive couldn’t be recovered without professional tools or a full firmware reflash.

Community testers quickly narrowed the trigger profile. The failures almost always occurred when the SSD was filled beyond 50–60% capacity, and the write operation was sustained for tens of gigabytes without pause. The victim drives tended to be cost-optimized, DRAM-less models—those that lean on the NVMe Host Memory Buffer (HMB) to borrow a chunk of system RAM for mapping tables and caching. Phison’s E19T and E21T controllers, along with a few from other vendors, figured prominently in early collations of affected hardware. One independent lab demonstrated the bug on a Corsair MP600 GS (Phison E21T) by repeatedly writing 40 GB files to a 1 TB drive that was 80% full; after three or four cycles, the drive would drop off the PCIe bus.

The systemic risk is not academic. When an NVMe drive disappears mid-write, the filesystem metadata—the NTFS master file table or journal—can be left in an inconsistent state. That can turn a single file loss into a cascade of directory corruption. In the worst cases, the drive’s internal NAND management structures may also be damaged, requiring a low-level reflash. For anyone who relies on a single large-volume backup or disk imaging operation, the August update introduces a material threat of irrecoverable data loss. And because the storage stack is a tight coupling of Windows’ storage drivers, the PCIe/NVMe subsystem, and thousands of lines of controller firmware, a small change in host timing or command ordering can surface a latent bug that had lain dormant for years.

Phison’s public response, posted in early September, both calmed and complicated the narrative. The Taiwan-based firm said it dedicated over 4,500 cumulative testing hours to the drives flagged as potentially impacted, spanning more than 2,200 test cycles. “We were unable to reproduce the reported issue, and no partners or customers have reported that the issue affected their drives at this time,” the company stated. Phison also warned that a fabricated document circulating online—a list of supposedly affected controllers—was entirely fake. The company urged users to ignore it and to practice good thermal management: use a heatsink or thermal pad on high-performance SSDs during extended workloads to avoid throttling.

Phison’s test campaign was thorough, but it does not close the case. The fact that a controller vendor with deep access to silicon, schematics, and diagnostic tooling couldn’t trigger the fault strongly suggests the bug is conditional—dependent on a narrow mix of firmware revision, drive usage state, motherboard BIOS version, and perhaps even the specific data pattern being written. Industrial labs can miss environmental permutations that exist in the field. Community testers, by contrast, had the luxury of trying dozens of motherboard and drive combinations, often with older or customized firmware that OEMs had tweaked. Phison’s clean bill of health is meaningful, but it doesn’t erase the growing pile of reproducible demonstrations from independent benches.

Microsoft, for its part, has been characteristically guarded. The company confirmed it was “aware of these reports” and is “investigating with our storage partners.” It requested affected users submit Feedback Hub logs and contact Support, emphasizing the need for telemetry to correlate host-side traces with controller logs. At the time of Phison’s statement, Microsoft said it had not detected a platform-wide telemetry signal indicating a spike in disk failures tied to the update. That absence is puzzling, given the community reproductions, but it aligns with Phison’s own customer data showing no unusual RMA activity. The disconnect points to a likely truth: the bug may be real but confined to a tiny fraction of Windows 11 machines, emerging only under a very specific workload and hardware combination.

Two technical hypotheses dominate the engineering discussion. The first is a host-driven regression in the NVMe storage stack. A change in how Windows 11 stages, flushes, or orders page-cache writes could alter the timing and cadence of NVMe commands. If a DRAM-less controller that depends on HMB receives a burst of commands in an unexpected order—say, while it is already busy relocating flash blocks—it might enter an unrecoverable hang state. The fact that the drive becomes unreadable at the PCIe level, not just within Windows, supports a low-level controller firmware fault triggered by a new host behavior. Past Windows 11 updates have indeed exposed similar HMB-related fragility on certain drive models, lending credence to this theory.

The second hypothesis centers on DRAM-less controller design and metadata pressure. SSDs without onboard DRAM rely on HMB to cache the flash translation layer (FTL) mapping tables. Sustained sequential writes stress these structures heavily; the controller must constantly update mapping entries while juggling SLC cache flushes, garbage collection, and wear leveling. If a host update changes the HMB allocation lifecycle—say, by altering how often the buffer is resized or reclaimed—it could push the controller past its resource limits, especially when the drive is nearly full and the SLC cache window is reduced. That could trigger a firmware race condition that hangs the controller and drops the SSD off the bus.

Neither hypothesis is proven. Both are consistent with the fingerprint reported by community testers: high-capacity writes to a full drive cause a sudden, low-level device removal. Definitive attribution will require Microsoft and controller vendors to jointly analyze paired host traces and controller firmware logs—a painstaking process that is reportedly underway.

While the investigation grinds on, users and IT administrators need practical steps to protect data. The conservative stance is clear: back up everything before trusting a Windows 11 August-updated system with large write operations. If you must move large files, consider splitting transfers into batches under 10 GB each, or use an external drive formatted with a simpler filesystem to reduce the load on the NVMe stack. For fleet admins, the update should be staged in a test ring that includes a representative sample of storage hardware. Run sustained write benchmarks—CrystalDiskMark’s sequential write test or a simple robocopy of a 50 GB folder—while monitoring for any disappearance events. Microsoft’s deployment controls allow you to delay the patch until the investigation is resolved.

Check your SSD vendor’s support page regularly. Corsair, Sabrent, Seagate, and Western Digital have all issued firmware updates for DRAM-less models in the past year, often silently fixing bugs that only manifested under specific workloads. If a firmware update appears for your SKU, apply it only after a full backup. And if a drive does vanish mid-write, stop all activity immediately. Do not attempt to initialize, format, or repartition the disk. Instead, capture system event logs and vendor utility output, and if you need to preserve evidence for RMA or recovery, create a read-only forensic image using a tool like DDRescue from a bootable USB. This preserves the possibility of data recovery and gives vendor engineers a snapshot of the corrupted state.

The industry’s response has strengths and gaps. On the positive side, rapid community triage produced reproducible test recipes within days, giving Microsoft and Phison a concrete target to investigate. Phison’s public testing campaign and Microsoft’s ongoing telemetry collection show that platform stakeholders are engaged. But the gaps are equally glaring. No single entity has yet published a joint root-cause analysis that ties host traces to controller forensic logs. Until that happens, attribution remains speculative. The variability across firmware SKUs, OEM configurations, and BIOS revisions means a universal reproduction test is elusive; industrial labs can keep missing the specific field permutation that triggers the bug. And the absence of a widespread RMA spike does not comfort the individual whose drive just ate a year’s worth of photos.

What to watch for next: The clearest signal that the investigation is nearing its end would be a joint Microsoft–vendor post-mortem, published on the company’s Windows release-health dashboard, that correlates telemetry from both ends of the stack and reproduces the failure in a controlled lab using vendor tooling. Firmware advisories for specific branded SKUs distributed through vendors like Corsair or Seagate would be the next milestone; a firmware patch validated by both the SSD maker and Microsoft would effectively close the case for users who apply it. If the root cause turns out to be entirely host-side, Microsoft might issue a Known Issue Rollback (KIR) for the August update, automatically disabling the problematic storage-stack change on affected configurations. And a spike in verified field RMA statistics tied to the update, reported across multiple vendors, would confirm that the community reproductions were not just bench curiosities.

This episode is a textbook illustration of modern storage-stack fragility. The system behavior we take for granted—write a file, and the file is safe—is the product of co-engineering between Microsoft’s kernel storage team, the NVMe driver, platform firmware, and controller microcode. A single errant timing change can cascade into data loss. The early community reproductions are a serious warning signal, and they are the right reason for Microsoft and Phison to remain engaged. Phison’s extensive testing and inability to reproduce the fault reduce the likelihood of a single, simple universal bug, but they do not exonerate any party. The plausible explanation is an interaction: a host behavior change exposed a latent firmware defect that only surfaces under narrow conditions. That class of bug is notoriously difficult to catch in a clean lab but can be prolific in the chaotic wild.

For now, the prudent path is clear. Back up your data, avoid heavy sequential writes on updated Windows 11 machines where possible, and watch for coordinated advisories from your SSD vendor. The incident also reinforces two perennial lessons: backups are never optional, and test rings that include storage stress tests are essential for catching rare but high-impact regressions before they hit production. This story is still unfolding. When Microsoft and its partners finally release their joint findings, that will be the moment to move from anxiety to action. Until then, keep your data safe and your firmware up to date.