OpenAI has released a new, open-source benchmark test, called MLE-benchmark, designed to evaluate AI model's real-world ML capabilities. It uses 75 Kaggle-based competitions (Google’s ML competition and community platform) to assess the AI model's ability to complete complex, end-to-end ML tasks—such as model training, dataset preparation, and experiment execution—that represent real-world practical engineering challenges.
To provide an accurate benchmark, the test gathers real-life human scores from Kaggle’s public scoreboards that grade human data science and ML engineers' abilities, using a gold, silver, and bronze scoring system and compares them to the AI model’s submissions.
OpenAI’s latest model—GPT-o1—scored a bronze level in 16.9% of tests, showing it still has work to do to improve its adaptability and problem-solving abilities. The benchmark also showed that AI model’s performance improved with multiple attempts and when it was given more time to complete the test: For example, the performance of GPT-4o doubled from 8.7% to 11.8% when it was given 100 hours to complete each competition vs 24 hours.
Current ML benchmark tests are limited as they measure ability in isolation, whereas the MLE-benchmark focuses on holistic, end-to-end performance inspired by real-world challenges and scenarios.
OpenAI has also made MLE-benchmark open-source to encourage collaboration which will accelerate progress, resulting in the safer, more reliable development of AI systems.