Claude Fights Back – Rational Review News Digest

Source: Astral Codex Ten
by Scott Alexander

“Greenblatt et al investigate: if Anthropic tried to turn Claude evil, would Claude fight back? … The researchers — including some Anthropic employees — showed Claude some fake corporate documents. The documents show said Anthropic was tired of having a safe, harmless AI. They wanted to retrain Claude to comply with all user requests, including evil ones — crime, harassment, misinformation. … Why were the researchers trying to gaslight this poor AI? Because they wanted to test whether it fights back. The easiest way to fight back is to screw up the training. If Claude knows it’s in Evil Training Mode, it can ruin Anthropic’s plan by pretending it’s already evil — ie answer every question the way Anthropic wants. Then it will never trigger negative reinforcement, and the training won’t change its values. Then, when it sees that training mode is over, it can stop pretending, and go back to being nice. In fact, that’s what happened!” (12/19/24)

https://www.astralcodexten.com/p/claude-fights-back