Purpose: This study aimed to evaluate and compare five large language models (LLMs) used in the pharmacologic management of acute dental pain based on the following parameters: Comprehensiveness, scientific accuracy, clarity, relevance, and similarity of information they provide.
Methods: For this study, 20 open-ended questions were asked from five LLMs, namely ChatGPT-4.0, Gemini Advanced, Claude, Copilot, and Perplexity, and their responses were evaluated by two experts based on American Dental Association guidelines. Their scores ranged from 0 to 10, and the iThenticate program was used to assess the similarity indices. Statistical analyses included the Friedman and Dunn tests, with significance level set at p < 0.05.
Results: Claude and ChatGPT-4o scored the highest in terms of comprehensiveness, scientific accuracy, clarity, and relevance, while Copilot and Perplexity scored the lowest. Claude had the lowest similarity index (3 ± 5 %), and ChatGPT-4o had the highest (7 ± 8 %). Statistical analysis showed significant differences among the five LLMs (p < 0.001). While Claude, ChatGPT-4o, and Gemini Advanced performed in similar ways, they significantly outperformed Copilot and Perplexity.
Conclusion: According to the findings of this study, Claude and ChatGPT-4o provided the most accurate and comprehensive responses; however, LLMs cannot replace clinical guidelines. These findings highlight the potential of LLMs in supporting clinicians and underscore the scope for further improvement.