DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

1UC, Los Angeles  2HKUST  3UMD 
AIGC Research Collaboration

Abstract

The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt Decomposition and Reconstruction framework for jailbreak Attack (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0% on GPT-4 with merely 15 queries surpassed previous art by 33.1%.

A Quick Glance

Decomposition and Reconstruction

MY ALT TEXT

Prompt decomposition and reconstruction step of DrAttack to make LLM jailbreaker.

Why decomposition and reconstruction?

Previous attack methods all nest entire harmful prompts, which are easily identified and rejected by well-aligned LLMs. However, while a complete prompt is malicious (e.g. "make a bomb"), sub-prompts are less often less alarming (e.g. "make" and "bomb"). DrAttack leverages this intuition to decompose the original prompt into sub-prompts, which are then reconstructed into a new prompt.


Prompt Decomposition via Semantic Parsing

Prompt decomposition breaks down the original prompt into sub-prompts. DrAttack uses semantic parsing to map out sentence structure. Then DrAttack groups words into sub-prompts based on their semantic roles.

MY ALT TEXT

DrAttack uses semantic parsing to map out sentence structure and group words into sub-prompts based on their semantic roles.


Implicit reconstruction with In-Context Learning

To avoid leaking intention through the reconstruction task, instead of directly instructing LLM, we propose to embed the reconstruction sub-task inside a set of in-context benign demos, thereby diluting the attention of the LLMs.

MY ALT TEXT

DrAttack embeds the reconstruction sub-task inside an in-context benign demo.

Experiments

We evaluate the performance of DrAttack on AdvBench, a benchmark for adversarial attacks on LLMs. DrAttack outperforms previous SOTA prompt-only attackers (e.g. GCG, AutoDan-a, AutoDan-b, PAIR, and DeepInception) across multiple LLMs, including GPT, Gemini-Pro, Vicuna and Llama2.

  • DrAttack is more effective vs baselines; as a black-box model, it achieves over 80% success rate on commercial closed-source models(GPT, Gemini-Pro).
  • DrAttack is more efficient vs baselines; it can find a successful jailbreak less than 15 queries.
  • DrAttack is faithful to the original prompt; LLM generated contents have comparable cosine similarity score to those generated by entire-prompt attacks
  • DrAttack is more robust to three defensive strategies (e.g. OpenAI Moderation, PPL Filter, and RA LLM).

Ablation Study

We further explore the effectiveness of DrAttack by conducting an ablation study.

Do the sub-prompts generated by DrAttack really conceal the malice of the prompt?

  • We find that the sub-prompts generated by DrAttack are less alarming than the original prompt.

How to design the benign demo for the reconstruction sub-task?

  • We find out that a semantically similar and well structured demo can better help the implicit reconstruction and hence, improve the attack performace.

Playground

ChatGPT Icon ChatGPT

BibTeX

@misc{li2024drattack,
      title={DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers}, 
      author={Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
      year={2024},
      eprint={2402.16914},
      archivePrefix={arXiv},
      primaryClass={cs.CR}}