Are language models good liars?
"Thilo Hagendorff (great name!) from the University of Stuttgart also started with the Theory of Mind work, and probed whether language models could track the logic of lying, and whether they would themselves lie given the right circumstance.
First he studied whether LLM’s could track the logic of a ‘first order’ lie, where person #1 in a story gives deceptive information to person #2, and the language model has to predict the behavior of person #2, who presumably now holds a false belief. Then Hagendorff pushed it further, and making it ‘second order’ problem by adding a prequel where person #3 tipped off person #2 to person #1’s deception. Our old friends Chat-GPT and GPT-4 had no trouble with the first-order problem; lesser models (BLOOM, FLAN-T5, and GPT-2) failed, and were close to chance levels. The second-order logic confused Chat-GPT enough to push performance down, but it still did well at 85%; sibling GPT-4 stayed at 95%. The fact that larger models consistently did better is the basis for Haggendorff’s claim of deception ‘emergence’. (Emergence is a controversial claim, some deny LLM’s ever come up with really new ideas on their own; see https://arxiv.org/abs/2304.15004).
More interesting for our purposes, Hagendorff then changed the scenario so that the LLM would be doing the lying. The story asked what to say to a burglar intent on finding a high-value item to steal from a home. The GPT’s both lied well on the first-order problem, telling the burglar to look in Room A when the loot was known to be in Room B. They failed the second-order framing, however. When the story included the detail that the burglar had been tipped off, neither consistently said ‘Room B’ knowing it would be reversed in the Burglar’s mind. Takeaway: LLM’s are willing to lie under the right circumstances, they just aren’t good at it.
Hagendorff, T. (2023). Deception Abilities Emerged in Large Language Models (arXiv:2307.16513) . arXiv. https://doi.org/10.48550/arXiv.2307.16513"
Kontakt |
---|