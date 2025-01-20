What's the story

A new study has found that artificial intelligence (AI) systems are failing to respond to complicated historical queries.

The research was conducted by a team from the Complexity Science Hub (CSH), an Austrian research institute.

They created a new benchmark, Hist-LLM, to evaluate three top large language models (LLMs)—OpenAI's GPT-4 Turbo, Meta's Llama, and Google's Gemini—on their accuracy in answering historical questions.