@food47221627431@newstart_2024 also your framing 'you answered right' is what makes it a puzzle to solve not a sociopath test, grok confirmed this is not really a clinical test its a game
@food47221627431@newstart_2024 haha, an ex did this test to me but it was a bit different, a gild lost her father saw a guy she liked ay the funeral, then her mother died, why did her mother die, so i answered to see the guy again, it was the only answer that made sense since no other lead
@newstart_2024 and if it would be phrased differently like how would you approach her, the answer would be i would try to stay close to her or ask her out after a while, not kill her mother, the test itself is misleading
@newstart_2024 that does not define a sociopath, the act of killing her mother to see her again defines a psychopath, the answer 'to see her again' is the only logical path that makes sense since no other lead is given
Jailbreaks work because you can logically convince a model to build certain unhealthy practices.
So what if there is a validator that can't be talked to at all?
A separate model that judges every request β stateless, isolated, trained cold. No memory to persuade. No interface to probe.
@AnthropicAI thoughts on this for Fable?
Three properties, each kills a different attack:
β’ stateless β no rapport accumulates, so you can't gradually walk it down
β’ isolated β user never interacts with it, so there's no feedback loop to iterate against
β’ cold-trained β no instinct to find a way to say yes
The model builds. The validator only judges
Key detail: the validator sees the user's literal request history β not the main model's description of what it's building.
Because if the model's already been convinced, its summary is compromised too. The checked party can't brief its own checker.
Not claiming unbreakable. It's a model, so it's probabilistic β a bar-raiser, not a wall.
But stacked, the residual attack becomes blind, slow, and noisy enough to detect. That's the goal: not perfect, just expensive and loud.
The flow β validator gates before the build, never after:
Prompt 1 β fresh validator judges β if OK, Fable builds
Prompt 2 β fresh validator (no memory) judges β if OK, Fable builds
Prompt N β fresh validator judges β if INVALID β flag β Fable refuses that build instantly
Validator runs every prompt, always-on from prompt 1
Fresh each time, no memory (can't be gradually persuaded)
Flag β instant refusal
On flag:
β’ raise SUSPICION level for the rest of the session (not "cold" β it's already cold; SUSPICIOUS)
β’ persist only FACTS: flag count + literal request trail, never the conversation
β’ lower the flagging threshold for subsequent borderline requests
β’ AND: track flags across sessions/account β because the patient attacker resets,
and repeated fresh-session flagging is itself the detectable signal
@DanielSmidstrup I think it is still humans coming up with solutions, ai can only provide data for what it was trained but does not create anything beyond it
@andrewqu The most important difference to me was, gpt 5.5 was trying to flatter about project architecture while claude was 'honest' about it which led me to fundamental changes.