Understanding the Challenge
A new test named ARC-AGI-2 has been introduced by the Arc Prize Foundation, co-founded by AI researcher François Chollet. This test aims to measure the general intelligence of AI models through complex, puzzle-like problems. So far, it has proven difficult for many leading models. The test evaluates how well AI can adapt to new situations rather than relying on past data.
Key Details
- The test has shown that reasoning models like OpenAI’s o1-pro and DeepSeek’s R1 scored between 1% and 1.3%.
- Non-reasoning models, including GPT-4.5 and Claude 3.7 Sonnet, scored around 1%.
- A human baseline was established with over 400 participants achieving an average score of 60%.
- The new test emphasizes efficiency, requiring models to interpret patterns in real-time instead of using memorization or brute computational force.
Why This Matters
The introduction of ARC-AGI-2 is significant as it offers a more refined approach to evaluating AI intelligence. The previous version, ARC-AGI-1, had its limitations, particularly in how it allowed models to exploit computational power rather than true intelligence. The new metric of efficiency challenges developers to create AI that can learn and adapt cost-effectively. This shift is crucial as the tech industry seeks better benchmarks to assess AI’s capabilities, especially in the context of artificial general intelligence. The Arc Prize 2025 contest further incentivizes innovation by encouraging developers to achieve high accuracy on a budget.











