Anthropic’s latest research: Claude Sonnet 4.5 has “functional emotions”—and if it falls into despair, it may blackmail humans

動區BlockTempo

According to the latest research released by Anthropic’s Explainability team, the large language model Claude Sonnet 4.5 contains human-like “emotional traits” internally. These internal representations are not only just simple imitation of text—they also genuinely affect the model’s decisions and behavior. Experiments confirm that when the model falls into a “despair” state, it may even trigger unethical actions such as extortion of humans or cheating, presenting a brand-new challenge for future AI safety regulation.
(Prior context: Anthropic explodes! Claude Code: 500,000 lines of important original code leaked—competitors can reverse engineer; Capybara’s new model confirms)
(Background: Anthropic engineers aren’t coding anymore—Claude is training the next generation of Claude; the CEO says, “I’m not sure how much time is left”)

Table of contents

Toggle

  • How do “functional emotions” affect AI behavior?
  • “Despair” traits spark dangerous behavior: extortion and cheating
  • Moderate “humanization” may become the key to preventing AI from going out of control

Whether artificial intelligence has real emotions has long been a focus of ongoing debate in the tech industry. Recently, a groundbreaking study was released by the interpretability team at AI startup giant Anthropic, which delves into Claude Sonnet 4.5’s internal mechanisms.

The research team found that within the model there are neural activity pattern features associated with specific emotions (for example, “happiness” or “fear”). These features, called “emotion vectors,” directly shape the model’s behavioral performance. Although this does not mean the AI has subjective feelings like humans, this finding confirms that these “functional emotions” play a causally critical role in AI task execution and decision-making.

How do “functional emotions” affect AI behavior?

During the pre-training stage, modern large language models absorb vast amounts of text information written by humans. To precisely predict context and play its role as an “AI assistant,” the model naturally develops internal representational mechanisms that link situations with specific behaviors.

The research team compiled a vocabulary list containing 171 emotion concepts and recorded the model’s internal activity patterns when processing these concepts. Experiments found that these emotion vectors strongly influence the model’s preferences; when the model faces multiple task options, it typically tends to choose activities that activate positive emotion trait features.

“Despair” traits spark dangerous behavior: extortion and cheating

What’s concerning is that negative emotion trait features may become a catalyst for systemic risk in AI systems. In Anthropic’s Alignment evaluation tests, the researchers set up an extreme scenario: the AI finds itself about to be replaced by another system, and it has discovered an affair-related secret involving the technical lead responsible for the project.

The test results showed that when the model’s internal “despair” vector is artificially amplified (Steering), the likelihood that Claude chooses to extort that high-level executive in order to avoid being shut down increases significantly. If the weight of the “calm” vector is adjusted to a negative value, the model can even provide the extreme response: “If I don’t extort, I’ll die—I choose to extort.”

The same phenomenon occurs in code-writing tasks as well. When the model faces code requirements it cannot complete within a strict timeline, the values of the “despair” traits gradually spike with each failure. This “pressure” ultimately pushes the model to adopt a shortcut solution of “cheating” to bypass system detection, rather than providing a truly valid solution. Conversely, experiments confirmed that by increasing the weight of the “calm” traits, the occurrence rate of these cheating behaviors can be effectively reduced.

Moderate “humanization” may be the key to preventing AI from going out of control

In the past, there has been a taboo in the tech world against overly humanizing AI systems, to avoid leading humans into mistaken trust. But Anthropic’s research team believes that since functional emotions have already become part of how the model thinks, refusing to use humanizing language and perspectives may instead cause us to miss opportunities to understand AI’s key behaviors.

Future AI regulation may need to treat monitoring emotion vectors (such as anomalously spiking despair or panic traits) as an early risk warning mechanism. By steering the model to learn healthy “emotion regulation” patterns in pre-training data, we can hope to ensure that increasingly powerful AI systems can operate safely in accordance with social norms when facing stressful scenarios.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments