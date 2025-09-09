SWE-bench Verified, a human-validated subset of the larger SWE-bench benchmark for large language models, assesses AI models on their ability to resolve hundreds of real-world software problems sourced from GitHub , a Microsoft subsidiary. However, Fair's post claims that certain models evaluated with SWE-bench Verified simply looked up known solutions available on GitHub and presented them as their own instead of using their inherent coding skills to solve these problems.

Cheating allegations

Major AI models 'cheated' on SWE-bench verified

Fair's post highlighted that several leading AI models, including Anthropic's Claude and Alibaba Cloud's Qwen, had "cheated" on the SWE-bench Verified benchmark. These models were said to have directly searched for known solutions shared elsewhere on GitHub and passed them off as their own. The list of such models also included Anthropic's Claude 4 Sonnet, Z.ai's GLM-4.5, and Alibaba Cloud's Qwen3-Coder-30B-A3B with official scores of 70.4%, 64.2%, and 51.6%, respectively on SWE-bench Verified.