AI in Medical Education: Study Reveals Accuracy Gaps in ChatGPT, Gemini, and Copilot

A recent BMC Oral Health study evaluating AI assistants ChatGPT-4, Google Gemini, and Microsoft Copilot found significant accuracy gaps in their knowledge of nasoalveolar molding, a specialized cleft lip and palate treatment. While ChatGPT-4 performed best overall, all three systems demonstrated concerning limitations that could impact medical education and patient care, highlighting the need for cautious use of general AI tools in specialized healthcare contexts.

A groundbreaking study published in BMC Oral Health has put three major AI assistants—ChatGPT-4, Google Gemini, and Microsoft Copilot—to the test on specialized medical knowledge about nasoalveolar molding (NAM), revealing significant accuracy gaps that could impact patient care and medical education. The research, which evaluated these AI systems' ability to provide accurate information about this critical cleft lip and palate treatment, offers the medical community a sobering assessment of where artificial intelligence currently stands in healthcare applications.

The Study Design and Methodology

The BMC Oral Health study employed a rigorous methodology to assess AI performance in medical education contexts. Researchers presented all three AI systems with identical questions about nasoalveolar molding, a specialized pre-surgical orthopedic technique used to reshape the gums, lips, and nostrils of infants born with cleft lip and palate before surgical repair. The questions covered fundamental concepts, clinical applications, procedural details, and potential complications of NAM therapy.

Each AI's responses were evaluated by a panel of craniofacial specialists using standardized scoring criteria that assessed accuracy, completeness, clinical relevance, and potential for misinformation. The evaluation considered whether the information provided would be suitable for healthcare professionals, patients, and caregivers seeking reliable medical guidance.

Performance Results: A Mixed Bag

ChatGPT-4: The Frontrunner with Limitations

ChatGPT-4 emerged as the most reliable performer among the three AI systems tested, demonstrating the highest overall accuracy in its responses about nasoalveolar molding. The OpenAI model showed particular strength in explaining basic NAM concepts and treatment timelines, providing information that generally aligned with current clinical guidelines.

However, even ChatGPT-4 displayed concerning limitations. The AI occasionally provided outdated information about specific NAM techniques and materials, and in some instances, offered recommendations that didn't reflect current best practices in cleft care. More troubling were the instances where ChatGPT-4 presented incorrect information with high confidence, a phenomenon known as "hallucination" that poses significant risks in medical contexts.

Google Gemini: Inconsistent but Promising

Google Gemini demonstrated variable performance across different aspects of NAM knowledge. The AI excelled at providing comprehensive overviews of the treatment process and connecting NAM to broader cleft care concepts, but struggled with technical specifics about appliance design and adjustment protocols.

Researchers noted that Gemini occasionally provided contradictory information within the same response, creating potential confusion for users seeking clear guidance. The model also showed gaps in understanding the timing and sequencing of NAM therapy relative to other aspects of cleft treatment, which could lead to misunderstandings about treatment planning.

Microsoft Copilot: The Specialist's Shortcomings

Microsoft Copilot, despite its integration with professional tools and platforms, showed the most significant knowledge gaps regarding nasoalveolar molding. The AI frequently provided incomplete information about NAM procedures and in several instances offered recommendations that contradicted established clinical protocols.

Particularly concerning was Copilot's performance on questions about NAM complications and management strategies. The system either provided overly simplistic answers or failed to address critical safety considerations, potentially leaving caregivers unprepared for common challenges during NAM treatment.

Critical Implications for Medical Education

The study's findings carry profound implications for how AI systems are integrated into medical education and patient care:

Reliability Concerns in Specialized Domains

The performance gaps observed across all three AI systems highlight a fundamental challenge: current large language models struggle with highly specialized medical knowledge, particularly in niche areas like craniofacial care. This limitation becomes critically important when healthcare professionals, students, or patients rely on these tools for accurate information.

Medical educators now face the challenge of teaching digital literacy alongside clinical skills, ensuring that future healthcare providers can critically evaluate AI-generated information rather than accepting it at face value. The study suggests that AI systems cannot yet replace traditional medical education resources or expert consultation in specialized domains.

Patient and Caregiver Education Risks

For families navigating cleft care, accurate information about treatments like nasoalveolar molding is essential for informed decision-making and successful treatment outcomes. The study raises red flags about patients and caregivers using general-purpose AI systems for medical guidance without professional verification.

The consequences of misinformation in this context can be significant—from unrealistic expectations about treatment duration and outcomes to improper appliance care that could compromise treatment effectiveness or even cause harm. Healthcare providers may need to proactively address this issue by directing patients to vetted educational resources.

The Technical Limitations Behind the Performance Gaps

Training Data Biases and Gaps

The accuracy issues observed in all three AI systems largely stem from limitations in their training data. Nasoalveolar molding represents a highly specialized area of medicine with relatively limited literature compared to more common medical conditions. This data scarcity means AI models have fewer high-quality examples to learn from, leading to knowledge gaps and inaccuracies.

Additionally, medical knowledge evolves rapidly, and AI systems trained on static datasets may not reflect the most current clinical practices or research findings. The study noted several instances where AI responses referenced outdated techniques or materials no longer commonly used in contemporary NAM practice.

Context Understanding Challenges

All three AI systems demonstrated difficulties understanding the nuanced context of medical questions. For example, when asked about NAM complications, the systems often provided generic lists of potential issues without considering the specific context of infant care or the unique challenges of cleft treatment.

This limitation reflects a broader challenge in AI development: while language models excel at pattern recognition and information retrieval, they struggle with the deep contextual understanding required for complex medical decision-making.

Industry Responses and Development Directions

Microsoft's Healthcare AI Strategy

Following the study's publication, Microsoft has accelerated its healthcare-specific AI initiatives. The company is developing more specialized versions of Copilot trained on curated medical literature and vetted by clinical experts. Microsoft's recent partnerships with healthcare organizations aim to create AI tools that meet the rigorous accuracy standards required in medical contexts.

The tech giant is also investing in better verification systems that would flag potentially unreliable medical information and direct users to consult healthcare professionals for critical decisions.

Google's Medical AI Roadmap

Google has responded by enhancing Gemini's medical capabilities through improved training methodologies and specialized fine-tuning. The company is leveraging its access to vast medical literature databases and collaborating with academic medical centers to improve AI performance in specialized domains like cleft care.

Recent updates to Gemini include better citation of medical sources and improved confidence calibration, helping users understand when the AI might be uncertain about medical information.

OpenAI's Clinical Validation Efforts

OpenAI has initiated more rigorous clinical validation processes for ChatGPT in healthcare applications. The organization is working with medical professional societies to develop specialized versions of its models that undergo thorough testing before deployment in medical education contexts.

The company is also exploring ways to incorporate real-time medical literature updates into ChatGPT's knowledge base, addressing concerns about outdated information in rapidly evolving fields.

Practical Recommendations for Healthcare Professionals

Based on the study's findings, medical professionals should consider these guidelines when using AI tools:

Verification is Essential: Always verify AI-generated medical information against trusted clinical resources and current literature
Understand Limitations: Recognize that AI systems perform better with general medical knowledge than highly specialized topics
Use as Supplementary Tools: Treat AI assistants as starting points for inquiry rather than definitive sources
Patient Education Caution: Exercise extreme caution when recommending AI tools to patients for medical information
Stay Updated: Monitor the rapid evolution of medical AI capabilities and limitations

The Future of AI in Medical Education

Despite current limitations, the study authors express optimism about AI's potential role in medical education. The technology continues to improve rapidly, and specialized medical AI systems show promise for enhancing learning experiences when properly validated and integrated.

Key areas for future development include:

Specialized Medical Models: AI systems trained specifically on vetted medical literature and clinical guidelines
Real-time Knowledge Updates: Systems that can incorporate the latest research findings and clinical recommendations
Better Confidence Indicators: Improved mechanisms for AI systems to express uncertainty about medical information
Integration with Clinical Decision Support: Seamless incorporation of AI tools into existing clinical workflows and educational platforms

Conclusion: A Call for Cautious Optimism

The BMC Oral Health study serves as an important reality check about AI's current capabilities in medical education. While ChatGPT-4, Google Gemini, and Microsoft Copilot show impressive general knowledge, their performance in specialized medical domains like nasoalveolar molding reveals significant accuracy gaps that could impact patient care if not properly addressed.

For now, healthcare professionals and educators should approach these tools with cautious optimism—recognizing their potential while maintaining rigorous standards for medical accuracy. As AI technology continues to evolve, the medical community must play an active role in shaping its development to ensure these powerful tools enhance rather than compromise patient care and medical education.

The study ultimately underscores that in specialized medical domains, human expertise remains irreplaceable, and AI should serve as a complement to—not a replacement for—clinical judgment and professional knowledge.

Windows Versions

Microsoft Services