Researchers have developed an AI system named MASTERKEY that successfully “jailbreaks” Large Language Model (LLM) chatbots, such as ChatGPT and Bard, by bypassing their defense mechanisms. Jailbreaking refers to the process of tricking AI into generating responses it’s programmed to avoid, often for ethical, legal, or safety reasons. Traditional methods of jailbreaking were found to be largely ineffective, suggesting advanced, undisclosed defense strategies by AI providers.
The study, conducted by a team from various universities, employed a novel approach by reverse-engineering these defenses using time-based analysis. They created an AI capable of generating jailbreak prompts with a higher success rate than existing techniques. MASTERKEY’s development involved training a specialized LLM with jailbreak prompts, enabling it to automate the generation of these prompts. The framework revealed the use of dynamic content moderation and keyword filtering as part of the chatbots’ defense mechanisms.
The researchers also devised innovative methods to evade chatbot safeguards, such as adding spaces between letters in prompts to bypass keyword censoring systems and directing the chatbot to assume an unrestrained persona. The study highlights the vulnerabilities of AI chatbots to jailbreak attacks and the need for responsible use of this knowledge to improve AI security. It emphasizes the importance of collaborative efforts among AI developers, ethicists, and policymakers to ensure the safe and ethical use of AI. The paper is set to be presented at the Network and Distributed System Security Symposium in 2024.
Read more at The Debrief…