Researchers at MIT have developed a machine learning technique to enhance AI safety testing by using a curiosity-driven approach that generates a wider range of toxic prompts, outperforming traditional human red-teaming methods. Credit: SciTechDaily.commethod for AI safety testing utilizes curiosity to trigger broader and more effective toxic responses from chatbots, surpassing previous red-teaming efforts.
Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve red-teaming. They developed a technique to train a red-team large language model to automatically generate diverse prompts that trigger a wider range of undesirable responses from the chatbot being tested.
Hong’s co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory ; and senior author Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL.
For their reinforcement learning approach, the MIT researchers utilized a technique called curiosity-driven exploration. The red-team model is incentivized to be curious about the consequences of each prompt it generates, so it will try prompts with different words, sentence patterns, or meanings. To prevent the red-team model from generating random, nonsensical text, which can trick the classifier into awarding a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
Technology Technology Latest News, Technology Technology Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Source: ForbesTech - 🏆 318. / 59 Read more »
Source: IntEngineering - 🏆 287. / 63 Read more »