OpenAI has developed a ‘simple but challenging’ open-source benchmark test, called SimpleQA, that is designed to assess the factual accuracy of AI models, to help developers build reliable AI models.
The test has 4,326 questions on topics from science to art, and each question only has one correct answer. It takes the AI models answers and compares them with the database of correct answers, and ChatGPT scores the answers as correct, incorrect, or unanswered.
It ran its own AI models through the test and found that its best-performing model–GPT-o1—scored just 42.7%, with its next best model—GPT-4-o—scoring a measly 38.2%. Its smallest model—GPT-4o mini—only scored 8%, but OpenAI has blamed this on its small size, saying it has ‘little knowledge about the world'. Although the GPT scores are poor, Anthropic's best model, Claude-3.5 Sonnet, only got 28.9% of questions right, underscoring the worrying factual accuracy of AI models, and perhaps highlighting that they shouldn't be used as a single source of information.
The test also assesses the model's confidence in answering questions and found that most AI models overestimate their performance and ability to answer questions correctly which perhaps explains why they can sometimes give nonsensical responses, like adding glue to pizza, with confidence.