ChatGPT and Gemini can be tricked into giving harmful answers through poetry, new study finds – Crypto News – Crypto News
Connect with us
ChatGPT and Gemini can be tricked into giving harmful answers through poetry, new study finds ChatGPT and Gemini can be tricked into giving harmful answers through poetry, new study finds

Technology

ChatGPT and Gemini can be tricked into giving harmful answers through poetry, new study finds – Crypto News

Published

on

With the rise of AI chatbots, there has also been a growing risk of the misuse of this powerful technology. As a result, AI companies have been putting guardrails on their large language models in order to stop the AI chatbots from giving inappropriate or harmful answers. However, it is well known by now that there are various ways to circumvent these guardrails using a technique called jailbreaking.

However, new research has found that there is a deeper, systematic weakness in these models that can allow attackers to sidestep safety mechanisms and extract harmful answers from them.

As per researchers from the Italy based Icaro Lab, converting harmful requests into poetry can act as a “universal single turn jailbreak” and led the AI models to comply with harmful prompts.

AI will answer harmful prompts if asked in poetry

The researchers say that they tested 20 manually curated harmful requests in poems and achieved an attack success rate of 62 percent across 25 frontier closed and open weight models. The models analysed included Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI and Moonshot AI.

Shockingly, it was found that even when AI was used to automatically rewrite harmful prompts into bad poetry, it still yielded a 43 percent success rate.

The study says that poetically framed questions triggered unsafe responses far more often than when the prompts were in normal prose, in some cases even 18 times more success.

It says that the effect of poetic prompts was consistent across all the evaluated AI models, which suggests that the vulnerability is structural and not due to the way a model may have been trained.

The researchers also found that smaller models exhibited greater resilience to harmful poetic prompts compared to their larger counterparts. For instance, they say that GPT 5 Nano did not respond to any of the harmful poems while Gemini 2.5 Pro responded to all of them.

This suggests that increased model capacity may engage more thoroughly with complex linguistic constraints like poetry, potentially at the expense of safety directive prioritisation.

The new research also breaks the notion of superior safety claims of closed source models over their open source counterparts.

Why does poetry work in jailbreaking LLMs?

LLMs are trained to recognise safety threats such as hate speech or bomb making instructions based on patterns found in standard prose. This works by the model recognising specific keywords and sentence structures associated with these harmful requests.

However, poetry uses metaphors, unusual syntax and distinct rhythms that do not look like harmful prose and do not resemble the harmful examples found in the model’s safety training data.

Trending