safety

OpenAI’s new safety approach

OpenAI has announced a new safety approach to make AI models safer

Martin Crowley
July 25, 2024

OpenAI has revealed a new AI-driven method of teaching its AI models, like ChatGPT, to more accurately and efficiently align with safety rules, called Rule-Based Rewards (RBRs), that requires no human intervention. 

What are RBRs?

Previously, to make sure AI models aligned with given safety standards, humans would review answers given by the AI model, in response to prompts, and score them based on alignment with safety policies, based on their own judgment. This process, called Reinforcement Learning from Human Feedback (RLHF) was time-consuming, costly, and left room for subjectivity.

This new approach allows safety teams to create a set of rules for the AI model and uses AI to give its answers a score, according to how closely they align with these rules. This ensures the answers given meet safety standards and removes the potential for human error and subjectivity.

For example, if the safety team involved with the development of a mental health app wanted the AI model to refuse to answer unsafe prompts, but in a nice, non-judgemental way, they could create a set of rules for the AI model to follow, and the AI would train the model to make sure its answers complied with these rules, rather than relying on human judgment, making sure safety standards were met. 

“Traditionally, we rely on reinforcement learning from human feedback as the default alignment training to train models, and it works, but in practice, the challenge we’re facing is that we spend a lot of time discussing the nuances of the policy, and by the end, the policy may have already evolved.”-- Lilian Weng, head of safety systems at OpenAI

During testing, OpenAI found that the RBR-trained models showed improved alignment with safety performance compared to those trained with the traditional RLHF, or human feedback.

This comes after OpenAI has faced mounting criticism for its approach to safety, with a former leader of its ‘superalignment’ safety team quitting the company due to its “safety culture and processes taking a backseat to shiny products,” and co-founder Ilya Sutskever also leaving to start a new company which is hyper-focused on building safe AI.