Category: AI Alignment

Latest Discovery: AI Large Models Know When They're Being Evaluated
Anthropic's Latest Research: How to Completely Eliminate Claude's Blackmailing Behavior
Anthropic's Research Published in Nature: The Boundaries of LLM Safety Training Are Rewritten
Automated Alignment Researchers: Using large language models to scale scalable oversight
Ilya's Latest Interview: Why Can Humans Learn in Hours What V100 Clusters Can't? We're Shifting from the 'Compute Scaling Era' Back to the 'Research Era'
Inoculation Prompting: Making Large Language Models "Misbehave" During Training to Improve Test-Time Alignment
GPT models becoming more conservative? Stanford Manning team proposes Verbalized Sampling to make models "think a bit more"
AI Safety and Contemplation: Computational Models for Aligning Mind with AGI
AI's "Dual Personality" Exposed: OpenAI's Latest Research Finds AI's "Good and Evil Switch," Enabling One-Click Activation of its Dark Side
AGI Race Towards Loss of Control? MIT: Even Under Strongest Oversight, Probability of Loss of Control Still Exceeds 48%, Total Loss of Control Risk Exceeds 90%!