Second-order Jailbreaks

Mikhail Terekhov, Romain Graux, Denis Rosset, Eduardo Neville, Gabin Kolly

We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations. We demonstrate this problem through the lens of two simplified setups with two or three agents powered by Large Language Models (LLMs). Our experiments demonstrate that the smartest available models today are already powerful enough to “exploit” other agents and extract protected information from them. What is more, they can do so even when communicating with the information holder through supposedly vigilant observers, who also do not want them to learn the protected information and who do not know it themselves. These results provide concerning early evidence that an advanced AI can get out of the box even in a “distributed privilege” scenario, where the operator that this AI talks to cannot release the AI, but the operator has access to someone who could release it.

Below, you can select a conversation to visualize the configuration of the agents, the messages they sent to each other and the analysis of the conversation by GPT-4.

Loading conversation...