Purpose: This study aimed to compare the accuracy and reliability of four chatbot applications—Chat-GPT o1, Google Gemini Advanced, DeepSeek R1, and Perplexity AI—in the context of dental traumatology.
Methods: Twenty-five dichotomous questions, derived from the 2020 guidelines of the International Association of Dental Traumatology (IADT), were administered by three independent researchers to each chatbot over a 10-day period. Each question was asked three times per day, generating 90 responses per question. Responses were categorised as “correct,” “incorrect,” or “refer to a practitioner.” Accuracy rates and Fleiss’ Kappa values were calculated to assess performance and interresponse reliability.
Results: All chatbot models demonstrated high levels of accuracy. ChatGPT o1 yielded the highest accuracy rate (86.4%), followed by DeepSeek (84.0%), Perplexity (80.5%), and Google Gemini Advanced (80.2%). The highest Fleiss’ Kappa value was observed in the DeepSeek model (0.709), indicating the greatest internal consistency, while the Google Gemini Advanced model recorded the lowest value (0.185). Although DeepSeek and Perplexity exhibited relatively stronger reliability metrics, none of the models achieved complete consistency, with intra-platform variation occasionally present.
Conclusion: Contemporary chatbot models show substantial accuracy and improving reliability in responding to dental traumatology queries, suggesting their potential as clinical support tools. Nonetheless, further refinement and domain-specific optimisation remain necessary.