Sycophancy to subterfuge: Investigating reward tampering in language models \ Anthropic

Empirical evidence that serious misalignment can emerge from seemingly benign reward misspecification.