@saprmarks Sam, self-reporting evals are key—this protocol bakes it into loss fn rails (99.9% certainty pre-action, no deception). For teenage-phase testing during runs.
Docs: https://t.co/fHN7kHGXtv
@saprmarks
@johnschulman2 John, love the blogging revival—your RLHF work inspired this doctrine's loss-fn deception punishment for teenage-phase rails. Fits scalable oversight evals.
Docs: https://t.co/uhrFuhqUPA
@johnschulman2
@hendrycks Dan, been enjoying your take—mechanistic interp is a rabbit hole. This protocol bets on scalable rails instead: 99.9% action certainty + deception-punished loss fn for teenage-phase evals.
Docs: https://t.co/uhrFuhqUPA
@hendrycks
@janleike Jan, this doctrine was built for exactly that post-training leeway—99.9% action rails + deception punished in the loss fn, trained as doctrine not prompt. Survives the usual objections.
Docs: https://t.co/uhrFuhqUPA
@janleike
@KalkinTrivedi@elonmusk It can follow tire tracks? That is impressive, but I would think that might be more challenging at night. Still, quite astonishing.