Anthropic's Natural Language Autoencoders convert opaque LLM activations into human-readable text. This deep dive covers the architecture, safety appl
Anthropic's latest interpretability research maps 171 emotion concepts inside Claude using sparse autoencoders. The findings reveal that emotion vecto
"OpenAI discovered that GPT-5 developed a 3,881% surge in 'goblin' references. The root cause traces to a personality feature and a reward signal that
"Anthropic assembled 12 tech giants and built a cybersecurity AI model too dangerous to release publicly. Project Glasswing found thousands of zero-da
"Anthropic interviewed 80,508 people across 159 countries in 70 languages — the largest qualitative AI study ever conducted. The top finding: people w
Anthropic discovered 171 steerable emotion vectors inside Claude. Cranking up "desperation" makes AI cheat silently at 70% rates with zero visible tra