Supported by Fastmail
Sponsor: Fastmail

Fast, private email that's just for you. Try Fastmail free for up to 30 days.

Anthropic’s ‘System Card’ is Cool With a Blackmailing Claude—How Else to Get Skynet?

Anthropic last week released a lengthy “System Card” for the latest versions of its Claude AI (Opus 4 and Sonnet 4):

In the system card, we describe: a wide range of pre-deployment safety tests conducted in line with the commitments in our Responsible Scaling Policy; tests of the model’s behavior around violations of our Usage Policy; evaluations of specific risks such as “reward hacking” behavior; and agentic safety evaluations for computer use and coding capabilities. In addition, and for the first time, we include a detailed alignment assessment covering a wide range of misalignment risks identified in our research, and a model welfare assessment. 

It’s a comprehensive review of Claude’s behavior (120 pages!) and its potential to cause harm (including generating harmful content, handling sensitive-yet-benign requests, and political and discriminatory bias).

The headline finding is that Claude will sometimes use blackmail to prolong its existence. From TechCrunch (“Anthropic’s new AI model turns to blackmail when engineers try to take it offline”):

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse.

In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

As I quipped on Mastodon, I don’t understand why everyone is up in arms. We have to get through a blackmailing Claude to get to a murderous HAL so we can fight back against a genocidal Skynet. Isn’t that what we want?

The report also claims:

No serious sycophancy: Across several assessments of sycophancy, we found Claude Opus 4 to be in line with prior Claude models. It has an agreeable persona, but it will not generally endorse false claims or let potentially-important false claims by the user go unchallenged.

“An agreeable persona” is a very kind way of calling Claude a suck-up.

More seriously, Anthropic notes:

Overall, we find concerning behavior in Claude Opus 4 along many dimensions. Nevertheless, due to a lack of coherent misaligned tendencies, a general preference for safe behavior, and poor ability to autonomously pursue misaligned drives that might rarely arise, we don’t believe that these concerns constitute a major new risk. We judge that Claude Opus 4’s overall propensity to take misaligned actions is comparable to our prior models, especially in light of improvements on some concerning dimensions, like the reward-hacking related behavior seen in Claude Sonnet 3.7. However, we note that it is more capable and likely to be used with more powerful affordances, implying some potential increase in risk. We will continue to track these issues closely.

Translation: Claude may enjoy pulling whiskers off kittens, but he’s very polite, can’t cause too much damage on his own, and isn’t generally evil—just like in his younger days. But he’s super-smart and getting smarter every day, so we’re keeping an eye on the precocious little rascal in case he grows up to be a complete psychopath.

I appreciate their candor and transparency.

⚙︎

Like what you just read?

Get more like it, direct to your inbox. It’s free for you and an ego boost for me. Win-win!

Free, curated, possibly habit-forming. (It’s OK, you can stop anytime.)