Microsoft Designer Envisions HoloLens Successor as AI-Powered Earbuds That See and Speak

A Microsoft principal designer has sketched out a radical departure from bulky mixed-reality headsets: a pair of AI-driven earbuds that use stereo cameras to understand your surroundings and speak answers back through a voice assistant. The concept, dubbed Copilot Veja, reimagines HoloLens not as another screen on your face but as an always-available, context-aware audio companion—one that could finally make mixed reality feel invisible.

Microsoft officially ended production of HoloLens 2 last year, marking a strategic retreat from manufacturing first-party headsets. Support with security updates continues through December 31, 2027, but the company offered no public roadmap for a successor. Into that vacuum, Braz de Pina, a Microsoft designer focused on UI/UX, created Copilot Veja as an independent, fan-made vision of what might come next—a device that swaps visors and goggles for discreet, cheek-mounted wearables.

The Concept: Cameras on Your Ears, Not a Screen on Your Eyes

Copilot Veja proposes two earbud-like stems that clip behind each ear, each packing a camera, microphone, and physical controls. A dedicated Copilot button triggers the AI, a volume knob adjusts audio, a power button manages the device, and a camera shutter enables capture. The dual cameras work together to replicate stereoscopic depth perception, allowing the onboard AI to analyze the environment in three dimensions.

The critical design choice: no heads-up display. De Pina argues that most people already carry screens on their phones and smartwatches, making a visual overlay redundant. “With capable agentic AI, do I really need to see what the AI tells me? Or is it enough to just hear it?” he asked. The answer, for Copilot Veja, is voice. The device is designed to provide instant, contextual feedback via audio, turning Copilot into a perceptive assistant that can literally see what you see and talk you through tasks.

Why the Idea Matters Now: Two Trends Colliding

The mixed-reality market has stalled. High-end headsets like Apple Vision Pro and Meta’s Quest line have struggled to break beyond early adopters and enterprise pilots. Bulky form factors, social awkwardness, and high prices have kept headsets niche. Simultaneously, multimodal AI—models that fuse language, vision, and audio—has advanced rapidly. Copilot itself now processes images and speech in real time. Copilot Veja bridges these trends by betting that the next leap in mixed reality won’t require strapping a screen to your face; it will demand an AI that understands context and communicates as naturally as a human companion.

The concept targets a lower perceptual barrier. Earbuds are already a mass-market form factor, accepted in public and comfortable for hours-long wear. By embedding AI perception into something familiar, Microsoft (or any company) could sidestep the social friction that doomed Google Glass and has confined HoloLens to factory floors.

The Strengths: Less Friction, More Intelligence

Copilot Veja’s design scores on several practical fronts:

Social acceptability: Earbuds don’t obscure your face or scream “wearable computer.” They blend in.
Audio-first UX: Voice is efficient for many tasks—navigation, quick reference, language translation, step-by-step repair guidance—and leaves your hands and eyes free.
Contextual superpower: AI that sees your environment can answer questions like “What’s this component?” or “How do I fix this leak?” without you needing to pull out a phone, snap a photo, and type a query.
Potential cost savings: Removing the expensive optical engines and displays of a headset could drop the bill of materials significantly, making the device accessible to more people.
Dual-purpose hardware: The cameras can capture first-person content for creators, field workers, or anyone who wants hands-free recording with instant AI-driven metadata and editing suggestions.

These strengths align with where consumer tech is heading. Voice assistants are already the most-used interface on phones and smart speakers. Adding vision to that assistant—while keeping the interaction voice-driven—could make AI truly ambient.

The Hard Problems: Why Copilot Veja Isn’t a Product Yet

Between the glossy renders and a shippable product lies a minefield of engineering and ergonomic challenges. The forum and original reporting both highlight obstacles that would require breakthroughs in hardware, software, and regulation.

Ergonomics and Comfort

Earbuds that house cameras, batteries, processors, and multiple physical controls will be larger and heavier than today’s audio-only buds. The ear stem must remain stable during movement and comfortable for extended wear—a tall order when you add stereo camera alignment, which demands precise positioning. A slight shift in fit could throw off depth perception. Testing across diverse ear shapes and activity levels is non-trivial.

Battery Life, Heat, and Compute

Continuous stereo vision and real-time AI inference are power-hungry operations. Running a Copilot-class model on-device would drain a tiny battery in minutes. Offloading compute to a paired smartphone or the cloud introduces latency and privacy risks. The alternative—integrating an ultra-low-power neural accelerator into each stem—increases cost and complexity while generating heat. Dissipating that heat from a device pressed against sensitive skin is a safety concern, not just a comfort issue.

Sensing Quality and Robustness

The cameras must capture high-quality depth data across varying light conditions and fields of view limited by the ear’s geometry. Hair, clothing, and head movement can occlude lenses. Low-light performance, image stabilization, and calibration will be constant engineering fights. Without a visual HUD, the system must infer when to speak and when to stay silent—a tricky contextual awareness problem.

Privacy, Trust, and Regulation

Always-on cameras in a consumer wearable ignite immediate privacy alarms. People nearby may not consent to being recorded, and many jurisdictions restrict covert audio or video capture. The device would need unmistakable recording indicators (LEDs, haptics, audio cues), granular user controls to disable cameras or set geofenced restrictions, and on-device processing where possible to minimize raw video streams leaving the device. Building and proving those safeguards would be as much a design challenge as a technical one.

Interaction Paradigm Limitations

Voice is great for many tasks, but some mixed-reality workflows—spatial mapping, detailed repair overlays, architectural design—genuinely benefit from visual augmentation. Eliminating the display limits the device’s utility for those use cases. The system would need seamless fallbacks: for instance, automatically sending an annotated image to your phone when you need to see something too complex to describe verbally.

Where Copilot Veja Would Compete

The device wouldn’t compete directly with headsets like Vision Pro or XREAL glasses; it would carve a new niche at the intersection of several categories:

True wireless earbuds: Against AirPods, Galaxy Buds, and others, Veja would have to match audio quality, noise cancellation, and battery life while adding camera hardware.
Wearable cameras: Like GoPro or Snap Spectacles, but with an AI assistant as the primary differentiator.
AI assistant hardware: The physical embodiment of Copilot, tying into Microsoft’s broader AI ecosystem (Windows, Office, Edge).
Enterprise field tools: A lightweight alternative to HoloLens for guided workflows where audio instruction plus occasional visual capture suffices.

The unique value proposition is the fusion: an AI that sees context and talks back, in a package you already know how to wear.

What It Would Take to Move from Vision to Reality

Any vendor attempting something like Copilot Veja would need to follow a disciplined, phased approach. Based on the analysis from the community, the path might look like:

Modular prototyping: Build a development kit with detachable stems, high-quality external cameras, and a tethered phone for compute. Iterate on the core interaction model—query invocation, voice response style, depth sensing accuracy.
Privacy by design: Integrate physical camera shutters, hardware-level recording indicators, and per-app permission systems. Make local processing the default for basic recognition tasks.
Pilot in controlled environments: Start in enterprise settings—manufacturing, logistics, field service—where consent and data policies are already established, and the value of hands-free guidance is high.
Ergonomic validation: Conduct human factors studies across wide demographics, long sessions, and diverse activities. Solve the heat and battery puzzle with efficient accelerators and intelligent sampling (e.g., camera wake-on-motion).
Multimodal handoff design: Craft the experience so that when a user needs more than voice, the system smoothly transitions to a paired screen—phone, watch, or eventually a lightweight display.
Regulatory and social readiness: Proactively engage with privacy advocates, publish transparency reports, and design features like automatic face blurring to mitigate bystander concerns.

Risks That Could Kill the Idea

Even if the engineering hurdles are cleared, several systemic risks could derail adoption:

Ergonomic failure: If the device is uncomfortable after 30 minutes, users will simply switch back to their regular earbuds.
Privacy backlash: A scandal around covert recording could trigger bans and consumer rejection overnight.
Feature mismatch: For many professionals, a HUD is non-negotiable. An audio-only assistant may be perceived as a downgrade from existing headset-based tools.
Battery anxiety: A device that dies mid-shift or mid-commute is useless. Users won’t carry charging cases for another gadget.
Ecosystem lock-in: If Veja works only with Microsoft services, it limits the addressable market; if it tries to be platform-agnostic, the integration quality might suffer.

The Longer View: A Roadmap for Ambient AI

Despite the risks, Copilot Veja’s core insight is worth exploring. The next generation of personal computing may not be about bigger screens but about more perceptive, talkative AI that fades into the background. The concept sketches a world where:

A field technician asks, “What’s the torque spec on this bolt?” and hears the answer immediately.
A tourist walks through a museum and gets whispered facts about paintings without staring at a phone.
A commuter with visual impairment navigates a busy street with real-time audible cues.

These scenarios don’t require a headset; they need an AI that sees and speaks. Copilot Veja is not a product roadmap—it’s a provocation. It forces the industry to answer a fundamental question: if AI becomes truly perceptive, can we retire the screen and trust the voice?

The timeline for such a device is likely years, not months. Short-term, expect research prototypes and enterprise pilots. Medium-term, a niche consumer launch aimed at creators or specific professional workflows. Long-term, a seamless blend of audio Copilot with augmented glasses, once the technical and social barriers fall.

Final Analysis: The Value of a Question

Copilot Veja is a love letter to possibility, drawn by a designer who understands both the magic and the mess of hardware. It’s not officially endorsed by Microsoft, but it channels a real frustration: the best AI today lives in slabs of glass and metal, disconnected from our physical reality. By imagining an assistant that shares your view and speaks your language, de Pina has given the industry something more valuable than a product blueprint—he’s given it a new question to chase.

Whether Microsoft or a startup picks up the thread, the concept’s legacy may be in resetting expectations. Mixed reality doesn’t have to mean a visor; it can mean an AI companion that notices what you notice and helps without getting in the way. Copilot Veja isn’t the answer, but it’s a damn good question.