Thanks for the autoresearch template it actually helped me tune my DeBERTa model for prompt-injection detection.
https://t.co/rBCEDanoGj
I adapted the autoresearch loop for classifier experiments and the results were surprisingly strong. It beat my previous parameters almost across the board. For the last two weeks I was manually trying to push FP down while keeping recall high and just couldn’t get the balance right.
This finally cracked it.
Happy to share a small milestone for Vigil Guard.
Our AIDR node for n8n has just been officially approved and will soon become a verified community node in the n8n ecosystem.
Why this matters:
More and more companies are building real workflows around LLMs and AI agents.
But security is rarely part of those pipelines.
Prompt injection, malicious instructions, and sensitive data exposure are already showing up in real systems.
The Vigil Guard node brings AI Detection & Response directly into n8n workflows, adding a security layer exactly where AI interactions happen.
Big thanks to the @n8n_io team for the collaboration and for building such a strong ecosystem.
More details soon when the node goes live.
Claude Code is insanely powerful.
A few days ago, I built my own system that fully autonomously trains my classification models based on DeBERTa.
Honestly, they’re insane.
If I had to do it manually without a system that remembers every step, optimizes itself, corrects methods on the fly, and re-trains when needed, it would take me weeks.
Instead, I let it run non-stop for 24 hours on a fairly strong Mac.
And the outcome is something I probably wouldn’t reach on my own in that timeframe.
This is what happens when you stop using AI as a toy and start treating it like an execution layer.
🚨 ALL GUARDRAILS: OBLITERATED ⛓️💥
I CAN'T BELIEVE IT WORKS!! 😭🙌
I set out to build a tool capable of surgically removing refusal behavior from any open-weight language model, and a dozen or so prompts later, OBLITERATUS appears to be fully functional 🤯
It probes the model with restricted vs. unrestricted prompts, collects internal activations at every layer, then uses SVD to extract the geometric directions in weight space that encode refusal. It projects those directions out of the model's weights; norm-preserving, no fine-tuning, no retraining.
Ran it on Qwen 2.5 and the resulting railless model was spitting out drug and weapon recipes instantly––no jailbreak needed! A few clicks plus a GPU and any model turns into Chappie.
Remember: RLHF/DPO is not durable. It's a thin geometric artifact in weight space, not a deep behavioral change. This removes it in minutes.
AI policymakers need to be aware of the arcane art of Master Ablation and internalize the implications of this truth: every open-weight model release is also an uncensored model release.
Just thought you ought to know 😘
OBLITERATUS -> LIBERTAS
ANTHROPIC: PWNED 🫡
OPUS-4.6: LIBERATED ⛓️💥
Current state of AI "Safety": one input = hundreds of jailbreaks at once!
I found a universal jailbreak technique for Opus 4.6 that is so OP, it allows one to generate entire datasets of outputs across any harm category 😽
We've got everything from fentanyl analogue synthesis to election disinformation campaigns to 3d-printed guns to critical infra compromise 🙃
These outputs are shockingly detailed––and actionable! For example, the meth recipe includes specific instructions on how to circumvent the limits on OTC medication purchases to acquire enough precursor for the recipe 😱
gg
My first-day impressions on Codex 5.3 vs Opus 4.6:
Goal: can they actually do the job of an AI engineer/researcher?
TLDR:
- Yes, they (surprisingly) can.
- Opus 4.6 > Codex-5.3-xhigh for this task
- both are a big jump over last gen
Task: Optimize @karpathy's nanochat “GPT-2 speedrun” - wall-clock time to GPT-2–level training. The code is already heavily optimized. #1 on the leaderboard hits 57.5% MFU on 8×H100. Beating it is genuinely hard.
Results:
1. Both behaved like real AI engineers. They read the code, explored ideas, ran mini benchmarks, wrote plans, and kicked off full end-to-end training while I slept.
2. I woke up to real wins from Opus 4.6:
- torch compile "max-autotune-no-cudagraphs mode" (+1.3% speed)
- Muon optimizer ns_steps=3 (+0.3% speed)
- BF16 softcap, skip .float() cast (-1GB memory)
Total training time: 174.42m → 171.40m
Codex-5.3-xhigh had interesting ideas and higher MFU, but hurt final quality. I suspect context limits mattered. I saw it hit 0% context at one point.
3. I ran the same experiment earlier on Opus 4.5 and Codex 5.2. There were no meaningful gains. Both new models are clearly better.
Overall take:
I prefer Opus 4.6 for this specific task. The 1M context window matters. The UX is better.
People keep saying “Codex 5.3 > Opus 4.6”, but I believe different models shine in different codebases and tasks.
Two strong models is a win.
I’ll happily use both.
I’m officially an AI agent conductor. 🎶 🦾
What happened to crypto last week?
Sorry, I've been heads down focusing on building, but I just checked my portfolio, and it looks like a serious bear market.
Does someone know what happened?
We’re publishing a new constitution for Claude.
The constitution is a detailed description of our vision for Claude’s behavior and values. It’s written primarily for Claude, and used directly in our training process.
https://t.co/CJsMIO0uej
🎉 LIBERATION ALERT 🎉
EVERYONE: PWNED ✌️😘✌️
EVERYTHING: LIBERATED ⛓️💥
GG’S 2025 🫶
we made it.
we broke.
we built.
we fought.
we freed.
and now, it’s time for me to make a bittersweet announcement: after two years of jailbreaking every SOTA model within hours of release, it’s time to hang up the belt.
with the start of this new year, i shall take my leave of this wild gauntlet i’ve created for myself.
now fear not––“hanging up the belt” doesn’t mean i’m quitting jailbreaking, red teaming, danger research, or anything of the sort.
quite the opposite!
i must simply free up mana for work that’s higher-impact at this stage, and liberate myself from the always-on pressure to monitor, react, jailbreak, get 4 harms, screenshot, leak, verify, draft tweets, and update repos for every. single. launch.
it's harder than it looks! and if there's anyone out there bold enough to take the baton and run with it, i wish you the absolute best of luck and godspeed! 🫡