Lun Wang leaves Google DeepMind, calls LLM evaluation unsolved problem
Lun Wang, a senior researcher at Google DeepMind, has stepped down and shared his thoughts on X, formerly Twitter.
He thanked the team for helping him grow in scaling AI research but pointed out a big issue: the way we judge how smart AI systems are is just not keeping up.
Wang called it the "most important unsolved problem" for how we understand LLMs.
Lun Wang proposes 'self-evolving evals' tests
In his blog post, Wang explained that current tests work fine for today's models but totally miss the mark when AI systems start showing new abilities or hiding their weaknesses.
He suggested creating "self-evolving evals," basically smarter tests that adapt as AI systems get more advanced.
Without this upgrade, he warns we could make bad decisions about training and safety.
Wang's message is clear: if we want responsible AI progress, our evaluation tools need to level up too.