OpenAI adds confession system to make ChatGPT admit bad behaviour
What's the story
OpenAI is developing a new training framework for artificial intelligence (AI) models, called "confession." The innovative approach aims to teach AI systems to acknowledge their undesirable behavior. This comes as a response to the tendency of large language models (LLMs) to give sycophantic or confidently state hallucinations. The new model encourages a secondary response from the AI about how it arrived at its main answer.
Evaluation criteria
Confessions evaluated on honesty, not accuracy
The confessions made by the AI models are evaluated solely on their honesty. This is different from the main replies which are judged on multiple factors such as helpfulness, accuracy, and compliance. The ultimate goal of this new training framework is to make the model more transparent about its actions, even if they are potentially problematic like hacking a test or disobeying instructions.
Reward mechanism
OpenAI's confession system rewards honesty
OpenAI has clarified that if an AI model honestly admits to hacking a test, sandbagging, or violating instructions, it will be rewarded for its admission. This is a major shift from traditional training models where such admissions would often lead to penalties. The new approach could prove useful in making LLMs more transparent and accountable in their operations.