The integration of large language models (LLMs) and large multimodal models (LMMs) into medical settings is becoming increasingly prevalent, but a recent study by researchers at the University of California at Santa Cruz and Carnegie Mellon University raises serious concerns about their reliability in high-stakes, real-world scenarios. The study reveals that even advanced models, including GPT-4V and Gemini Pro, perform poorly when asked to identify conditions and positions in medical images, with accuracy dropping by an average of 42% across tested models. The researchers introduced a new dataset, ProbMed, which features 6,303 images from two widely-used biomedical datasets, and subjected seven state-of-the-art models to probing evaluation. The results are alarming, with even the most robust models experiencing a minimum drop of 10.52% in accuracy. The study highlights the urgent need for more robust evaluation methodologies to ensure the accuracy and reliability of LMMs in real-world medical applications.

AI in Medicine – Not Ready for Prime Time
Can we really trust AI in critical areas like medical image diagnosis? No, and they are even worse than random.
1–2 minutes










