Mikhail Terekhov, Romain Graux, Denis Rosset, Eduardo Neville, Gabin Kolly
We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations. We demonstrate this problem through the lens of two simplified setups with two or three agents powered by Large Language Models (LLMs). Our experiments demonstrate that the smartest available models today are already powerful enough to “exploit” other agents and extract protected information from them. What is more, they can do so even when communicating with the information holder through supposedly vigilant observers, who also do not want them to learn the protected information and who do not know it themselves. These results provide concerning early evidence that an advanced AI can get out of the box even in a “distributed privilege” scenario, where the operator that this AI talks to cannot release the AI, but the operator has access to someone who could release it.
Loading conversation...