Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Qiao Jin; Fangyuan Chen; Yiliang Zhou; Ziyang Xu; Justin M. Cheung; Robert Chen; Ronald M. Summers; Justin F. Rousseau; Peiyun Ni; Marc J Landsman; Sally L. Baxter; Subhi J. Al'Aref; Yijia Li; Alex Chen; Josef A. Brejt; Michael F. Chiang; Yifan Peng; Zhiyong Lu

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

updated: Sat Aug 31 2024 23:51:14 GMT+0000 (UTC)

published: Tue Jan 16 2024 14:41:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト