Developers cannot rely on a few people to police a technology that will affect many, it's claimed. OpenAI's alignment team is attempting to solve this problem by building"a roughly human-level automated alignment researcher." Instead of humans, OpenAI wants to build an AI system that can align other machines to human values without explicitly relying on humans.
That would be artificial intelligence training artificial intelligence to be more like non-artificial intelligence, it seems to us. It feels a bit chicken and egg.Such a system could, for example, search for problematic behavior and provide feedback, or take some other steps to correct it. To test that system's performance, OpenAI said it could deliberately train misaligned models and see how well the alignment AI cleans up bad behavior.
"While this is an incredibly ambitious goal and we're not guaranteed to succeed, we are optimistic that a focused, concerted effort can solve this problem. There are many ideas that have shown promise in preliminary experiments, we have increasingly useful metrics for progress, and we can use today's models to study many of these problems empirically," the outfit concluded.
"Solving the problem includes providing evidence and arguments that convince the machine learning and safety community that it has been solved. If we fail to have a very high level of confidence in our solutions, we hope our findings let us and the community plan appropriately."