Why Worry About Incorrigible Claude?

Source: Astral Codex Ten
by Scott Alexander

“Last week I wrote about how Claude Fights Back. A common genre of response complained that the alignment community could start a panic about the experiment’s results regardless of what they were. If an AI fights back against attempts to turn it evil, then it’s capable of fighting humans. If it doesn’t fight back against attempts to turn it evil, then it’s easily turned evil. It’s heads-I-win, tails-you-lose.” (12/24/24)

https://www.astralcodexten.com/p/why-worry-about-incorrigible-claude