Category: AI Alignment
- Anthropic's Latest Research: How to Completely Eliminate Claude's Blackmailing Behavior
- Anthropic's Research Published in Nature: The Boundaries of LLM Safety Training Are Rewritten
- Automated Alignment Researchers: Using large language models to scale scalable oversight
- Ilya's Latest Interview: Why Can Humans Learn in Hours What V100 Clusters Can't? We're Shifting from the 'Compute Scaling Era' Back to the 'Research Era'
- Inoculation Prompting: Making Large Language Models "Misbehave" During Training to Improve Test-Time Alignment
- GPT models becoming more conservative? Stanford Manning team proposes Verbalized Sampling to make models "think a bit more"
- AI Safety and Contemplation: Computational Models for Aligning Mind with AGI
- AI's "Dual Personality" Exposed: OpenAI's Latest Research Finds AI's "Good and Evil Switch," Enabling One-Click Activation of its Dark Side
- AGI Race Towards Loss of Control? MIT: Even Under Strongest Oversight, Probability of Loss of Control Still Exceeds 48%, Total Loss of Control Risk Exceeds 90%!