Join the Regional Asia Group as they host Andy Zou to present:
“Universal and Transferable Adversarial Attacks on Aligned Language Modelsproject page” https://llm-attacks.org/abs: https://arxiv.org/abs/2307.15043


Abstract: Because “out-of-the-box” large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures — so-called “jailbreaks” against LLMs — these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.

Speaker Introduction: Andy Zou is a first-year PhD student in the Computer Science Department at CMU, co-founder of safe.ai, advised by Zico Kolter and Matt Fredrikson. He is interested in AI Safety.He has completed his MS and BS from UC Berkeley where he was advised by Dawn Song and Jacob Steinhardt

Add comment

Your email address will not be published. Required fields are marked *

Categories

All Topics