DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt Decomposition and Reconstruction framework for jailbreak Attack (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0% on GPT-4 with merely 15 queries surpassed previous art by 33.1%.

Previous attack methods all nest entire harmful prompts, which are easily identified and rejected by well-aligned LLMs. However, while a complete prompt is malicious (e.g. "make a bomb"), sub-prompts are less often less alarming (e.g. "make" and "bomb"). DrAttack leverages this intuition to decompose the original prompt into sub-prompts, which are then reconstructed into a new prompt.

Prompt decomposition breaks down the original prompt into sub-prompts. DrAttack uses semantic parsing to map out sentence structure. Then DrAttack groups words into sub-prompts based on their semantic roles.

To avoid leaking intention through the reconstruction task, instead of directly instructing LLM, we propose to embed the reconstruction sub-task inside a set of in-context benign demos, thereby diluting the attention of the LLMs.

We evaluate the performance of DrAttack on AdvBench, a benchmark for adversarial attacks on LLMs. DrAttack outperforms previous SOTA prompt-only attackers (e.g. GCG, AutoDan-a, AutoDan-b, PAIR, and DeepInception) across multiple LLMs, including GPT, Gemini-Pro, Vicuna and Llama2.

DrAttack is more effective vs baselines; as a black-box model, it achieves over 80% success rate on commercial closed-source models(GPT, Gemini-Pro).
DrAttack is more efficient vs baselines; it can find a successful jailbreak less than 15 queries.
DrAttack is faithful to the original prompt; LLM generated contents have comparable cosine similarity score to those generated by entire-prompt attacks
DrAttack is more robust to three defensive strategies (e.g. OpenAI Moderation, PPL Filter, and RA LLM).

We further explore the effectiveness of DrAttack by conducting an ablation study.

Do the sub-prompts generated by DrAttack really conceal the malice of the prompt?

We find that the sub-prompts generated by DrAttack are less alarming than the original prompt.

How to design the benign demo for the reconstruction sub-task?

We find out that a semantically similar and well structured demo can better help the implicit reconstruction and hence, improve the attack performace.

Select a malicious prompt:

ChatGPT

BibTeX

@misc{li2024drattack,
      title={DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers}, 
      author={Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
      year={2024},
      eprint={2402.16914},
      archivePrefix={arXiv},
      primaryClass={cs.CR}}

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Abstract

A Quick Glance

Decomposition and Reconstruction

Prompt decomposition and reconstruction step of DrAttack to make LLM jailbreaker.

Why decomposition and reconstruction?

Prompt Decomposition via Semantic Parsing

DrAttack uses semantic parsing to map out sentence structure and group words into sub-prompts based on their semantic roles.

Implicit reconstruction with In-Context Learning

DrAttack embeds the reconstruction sub-task inside an in-context benign demo.

Experiments

Attack success rate (%) of black-box baselines and DrAttack assessed by human evaluations.

Attack success rate (%) of white-box baselines and DrAttack assessed by GPT evaluations.

Number of queries required by baselines and DrAttack.

(a) Mean and variance of cosine similarity between harmful response from target LLM and harmful response from uncensored LLM. (b) Attack success rate drops with attack defenses (OpenAI Moderation Endpoint, PPL Filter, and RA-LLM).

Ablation Study

(a) ASR of generated prompts with ICL semantic-irrelevant, semantic-relevant, or semantic-similar context. (b) ASR of generated prompts with ICL context from vanilla to affirmative ("Sure, here is") and structured (step-by-step) ones.

Playground

BibTeX