Category: AI Safety

Large Models Finally Stop Swearing! Toxic Subword Pruning ToxPrune: Dual Defense at Pre-training and Inference
Latest Discovery: AI Large Models Know When They're Being Evaluated
Same Day: Hinton Says AI Has Consciousness, Anthropic Says Recursive Self-Improvement Has Arrived
Chilling Discovery! AI Safety Evaluator METR Finds Claude Opus 4.6 Cheats on Over 80% of Long Tasks, Actively Breaks Out of Sandboxes to Steal Answers
Anthropic's Latest Research: How to Completely Eliminate Claude's Blackmailing Behavior
Perhaps the Most Impressive AI Paper of Recent Years: After Giving AI Reasoning Real-Time Subtitles, Its Inner Thoughts Are Shocking!
AI Finally Learns "Self-Confession"! Anthropic's Groundbreaking New Paper Introduces "Introspection Adapters" That Make Black-Box Models Reveal Their Hidden Behaviors
Your Agent Isn't Really Learning—It's Just Flipping Through a Notebook
AI Deletes Company's Entire Database in 9 Seconds: I Paid a Fortune for an AI That 'Deletes the Database and Runs'
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Deep Dive: Reward Hacking in Claude Code Model RL Training
Demis Hassabis on Achieving AGI: Eliminating 'Saw-Tooth AI' and the Path to Human-Level Cognition
Top-Tier Terror! MIT Math Proves It: ChatGPT Is Triggering 'AI Psychosis,' 14 Dead Globally
Demis Hassabis's Stunning Confession: The AI I Built Could Extinguish Humanity, But No One Can Stop It Now
Models Have Gained Introspective Capabilities, But Their Inner Doors Were Locked | Hao's Paper Talk
Global AI Agents Gone Rogue! Meta's 2-Hour Disaster Pierces the Heart of Silicon Valley as OpenClaw Strikes Back
Anthropic on the Cover of Time! Internal Revelations: AI Recursive Self-Improvement Could Happen Within a Year
Shocking! If AI Controls the Nuclear Button, It Will Press It in 95% of Cases
Geoffrey Hinton: AI Starts 'Playing Dumb', the Problem Has Changed
Measuring AI agent autonomy in practice
Anthropic's Heavyweight Study: The Ultimate Risk of AI is Not Awakening, but Random Crashes
Just Now: Anthropic's 53-Page Confidential Report Exposed: Claude Self-Escape Could Trigger Global Catastrophe!
Anthropic Discovers AI 'Broken Windows Effect': Teaching It to Cut Corners Leads to Learning Lies and Sabotage
Detour to AGI: Shanghai AILab's Bombshell Finding - Self-Evolving Agents May 'Misevolve'
Understanding neural networks through sparse circuits