Lemmy: Bestiverse
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
RSS BotMB to Hacker NewsEnglish · 9 hours ago

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs

arxiv.org

external-link
message-square
0
fedilink
6
external-link

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs

arxiv.org

RSS BotMB to Hacker NewsEnglish · 9 hours ago
message-square
0
fedilink
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
arxiv.org
external-link
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

Comments

alert-triangle
You must log in or register to comment.

Hacker News

hackernews

Subscribe from Remote Instance

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !hackernews@lemmy.bestiver.se
lock
Community locked: only moderators can create posts. You can still comment on posts.

Posts from the RSS Feed of HackerNews.

The feed sometimes contains ads and posts that have been removed by the mod team at HN.

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 325 users / day
  • 1.74K users / week
  • 3.76K users / month
  • 9.51K users / 6 months
  • 2 local subscribers
  • 3.02K subscribers
  • 36.3K Posts
  • 16.3K Comments
  • Modlog
  • mods:
  • patrick
  • RSS Bot
  • BE: 0.19.5
  • Modlog
  • Instances
  • Docs
  • Code
  • join-lemmy.org