LMSYS organization has launched the “Multimodal Arena,” a new leaderboard comparing AI models on vision-related tasks, garnering over 17,000 user preference votes across more than 60 languages in just two weeks. OpenAI’s GPT-4o took the lead, followed closely by Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro, highlighting the fierce competition among tech giants in the multimodal AI space. Interestingly, the open-source model LLaVA-v1.6-34B achieved scores comparable to proprietary models, suggesting a potential democratization of advanced AI capabilities. The leaderboard evaluates a wide range of tasks, from image captioning to meme interpretation, providing a comprehensive view of each model’s visual processing abilities. However, the CharXiv benchmark from Princeton University reveals a stark reality check: AI still significantly lags behind humans in complex visual reasoning, with GPT-4o achieving only 47.1% accuracy compared to human performance of 80.5%. This gap underscores the challenges and opportunities in advancing AI’s nuanced visual understanding, signaling the need for breakthroughs in AI architecture and training methods.

AI Vision Showdown – The Multimodal Arena Leaderboard Revealed
The Multimodal Arena leaderboard reveals the top AI models in vision-related tasks.
1–2 minutes










