Skip to main content
ASOHNS ASM 2026
Can Chat GPT-4 Vision match a custom otoscopic image classifier in analysing otoscopic images?
Poster

Poster

Themes

Default

Talk Description

Institution: Sydney Medical School, University of Sydney - New South Wales, Australia

Aims: Artificial intelligence (AI) systems are increasingly applied to otoscopic image interpretation, but it is unclear if general-purpose AI systems, including GPT-4 Vision, can match the accuracy of bespoke, task-trained CNNs developed using curated clinical datasets. This study compared the diagnostic accuracy of three image-based diagnostic models using otoscopic images alone: a domain-specific classifier, general-purpose model (GPT-4 Vision), and a hybrid system integrating both. Methodology: Routine otoscopic images were prospectively collected by nurses during community hearing assessments. Ground truth diagnoses were defined retrospectively by otolaryngologist review. A test dataset of 80 otoscopic images, without associated tympanometry or audiometry, was used to evaluate three image-based diagnostic systems. Area under the curve (AUC), Cohen’s kappa (κ), and diagnostic accuracy, were assessed. Results: The CNN ensemble classifier outperformed both the GPT-4o-mini vision model and hybrid fusion model (CNN with GPT-4o-mini outputs) across all performance metrics. The CNN ensemble classifier achieved 87.5% accuracy (AUC=0.959, κ=0.829), showing strong agreement with ground truth. GPT-4o-mini performed poorly (accuracy=33.75%, AUC=0.540, κ=0.096), reflecting only slight agreement beyond chance. The hybrid fusion model showed 77.5% accuracy (AUC=0.934, κ=0.695). Performance differences between all three models were statistically significant (p < 0.0001). Conclusion: The CNN ensemble classifier demonstrated highest diagnostic performance, underscoring the value of specialised classification models trained using otoscopic images. This suggests general-purpose vision-language models such as GPT-4 Vision do not yet match diagnostic accuracy of bespoke CNN classifiers. However, they may have potential for integrating multimodal clinical data, such as establishing inferences from demographic, symptom, or audiometry data to be an adjunct to image-based models.
Presenters
Authors
Authors

Dr Tony Lian - , Dr Al-Rahim Habib - , Dr Justin Eltenn - , Dr Ravi Jain - , A/Prof Narinder Singh -