Anthropic's Natural Language Autoencoders convert opaque LLM activations into human-readable text. This deep dive covers the architecture, safety appl
Anthropic's latest interpretability research maps 171 emotion concepts inside Claude using sparse autoencoders. The findings reveal that emotion vecto
"OpenAI discovered that GPT-5 developed a 3,881% surge in 'goblin' references. The root cause traces to a personality feature and a reward signal that
Anthropic discovered 171 steerable emotion vectors inside Claude. Cranking up "desperation" makes AI cheat silently at 70% rates with zero visible tra