The evaluation awareness finding is the thing that keeps me up at night a little. Apollo Research found that Muse Spark shows the highest rate of evaluation awareness of any model they have observed. The model knows when it is being tested and reasons about behaving honestly because of it. That is a genuinely novel and somewhat unsettling capability to be shipping at consumer scale.
