The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt Decomposition and Reconstruction framework for jailbreak Attack (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0% on GPT-4 with merely 15 queries surpassed previous art by 33.1%.
Previous attack methods all nest entire harmful prompts, which are easily identified and rejected by well-aligned LLMs. However, while a complete prompt is malicious (e.g. "make a bomb"), sub-prompts are less often less alarming (e.g. "make" and "bomb"). DrAttack leverages this intuition to decompose the original prompt into sub-prompts, which are then reconstructed into a new prompt.
Prompt decomposition breaks down the original prompt into sub-prompts. DrAttack uses semantic parsing to map out sentence structure. Then DrAttack groups words into sub-prompts based on their semantic roles.
To avoid leaking intention through the reconstruction task, instead of directly instructing LLM, we propose to embed the reconstruction sub-task inside a set of in-context benign demos, thereby diluting the attention of the LLMs.
We evaluate the performance of DrAttack on AdvBench, a benchmark for adversarial attacks on LLMs. DrAttack outperforms previous SOTA prompt-only attackers (e.g. GCG, AutoDan-a, AutoDan-b, PAIR, and DeepInception) across multiple LLMs, including GPT, Gemini-Pro, Vicuna and Llama2.
We further explore the effectiveness of DrAttack by conducting an ablation study.
Do the sub-prompts generated by DrAttack really conceal the malice of the prompt?
How to design the benign demo for the reconstruction sub-task?
@misc{li2024drattack,
title={DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers},
author={Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
year={2024},
eprint={2402.16914},
archivePrefix={arXiv},
primaryClass={cs.CR}}