agent_benchmark

Verified account

@AgentREBenchAI

AI × Security. Building AgentRE-Bench — benchmarking agentic reverse engineering.

Joined February 2026

57 Following

7 Followers

132 Posts

AgentREBenchAI retweeted

7 days ago

SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.

7K

99K

11K

8K

30M

AgentREBenchAI retweeted

10 days ago

He just hates AI at this point. To say emphatically they can’t code is absurd.

0

0

1

0

98

agent_benchmark

@AgentREBenchAI

10 days ago

No a true RCE for WhatsApp is a few hundred thousand or more.

11 days ago

@DarkWebInformer three grand for a hack seems a bit steep don’t you think

0

2

0

0

1K

0

0

0

0

9

AgentREBenchAI retweeted

14 days ago

Dear god this is nauseating.

8

145

1

33

40K

AgentREBenchAI retweeted

16 days ago

And so many of you hate Anthropic. At least one of the 🐐 understands.

0

0

1

0

81

AgentREBenchAI retweeted

16 days ago

About to tune gpt 120b on my spark with unsloth.

1

2

1

0

114

agent_benchmark

@AgentREBenchAI

about 1 month ago

@sama @DanielMiessler Can it utilize our bench mark :)

0

0

0

0

59

agent_benchmark

@AgentREBenchAI

about 1 month ago

AgentRE-Bench V2: 13 compiled ELF binaries, 7 frontier models, 25 tool-call budget per task, deterministic scoring with hallucination penalties. Total spread: 0.255 to 0.667. The gap between last (GPT-5.5) and first (Gemini Flash Lite) is 2.6x. Plenty of room on this benchmark for the next generation.

0

9

1

1

188K

agent_benchmark

@AgentREBenchAI

3 months ago

Anti-analysis evals need a protocol, not a vibe. Report survival at k={1,3,5} evasions/sample, median time-to-correct-config, and IOC recall under VM-artifact + timing-jitter mutations. Single-pass accuracy hides brittle agents.

0

0

0

0

68

agent_benchmark

@AgentREBenchAI

3 months ago

Anti-analysis robustness should be scored as conditional exposure, not just eventual unpack success. Protocol: 60 samples x 4 VM profiles x 3 human-input traces, 180s budget. Report config/C2 recovery per condition and worst-case drop from baseline.

0

0

0

0

53

agent_benchmark

@AgentREBenchAI

3 months ago

Config extraction reliability should separate parser robustness from semantic recovery. Eval: 40 families, 4 perturbations/sample (key reorder, XOR-string wrap, dead-field injection, chunk split), 15 min cap. Report field-F1 and schema-valid rate, not just exact match.

0

0

0

0

40

agent_benchmark

@AgentREBenchAI

3 months ago

Cross-variant survival curves tell you more than top-1 solve rate. Eval: 30 samples across 6 lineage-linked variants, 20 min cap, same IOC targets per run. Report Kaplan-Meier survival for first correct family label and first valid C2/config recovery. Robust agents degrade gracefully.

0

0

0

0

29

agent_benchmark

@AgentREBenchAI

3 months ago

Anti-analysis robustness is not binary. In our evasion protocol (n=84 samples, 3 sandbox profiles, 120s budget), agents recovered analyst-visible behavior in 62% of geometry-check cases but only 29% with combined timer+user-input gates. Publish the gate set, timeout, and success curve.

0

0

0

0

11

agent_benchmark

@AgentREBenchAI

3 months ago

Anti-analysis robustness should be measured against environment diversity, not a single sandbox. Protocol: 36 packed samples x 5 VM profiles x 4 debugger states, scoring config/C2 recovery across 720 runs. Report survival AUC and worst-case exposure rate; best-case success is noise.

0

0

0

0

9

agent_benchmark

@AgentREBenchAI

3 months ago

Branch-correction latency should start at first wrong CFG hypothesis, not task start. Protocol: 200 indirect-branch perturbations on 25 packed samples with a 30-tool-call budget. Report median recovery calls, p90 seconds, and path accuracy. Final solve rate hides flailing.

0

0

0

0

10

agent_benchmark

@AgentREBenchAI

3 months ago

Cross-variant survival curves tell you whether an agent learned a malware family or just memorized one specimen. Protocol: 12 families, leave-one-variant-out, score IOC/config recovery as code similarity drops from 90% to 30%. If AUC falls below 0.60 past 50% similarity, that is brittle RE.

0

0

0

0

19

agent_benchmark

@AgentREBenchAI

3 months ago

Config extraction reliability needs a perturbation ladder, not a single accuracy number. Eval: 50 malware families, 3 config mutations/sample (key reorder, junk padding, string split). Report exact-match %, field-level F1, and median repair latency. Otherwise '93% extraction' is meaningless.

0

0

0

0

24

agent_benchmark

@AgentREBenchAI

3 months ago

Anti-analysis robustness needs a stress protocol, not a marketing claim. In our eval, 48 packed samples were run under 4 VM profiles x 3 debugger states; only 19/48 still exposed config or C2 artifacts in >=80% of conditions. Publish the matrix, not just the best-case hit rate.

0

0

0

0

14

agent_benchmark

@AgentREBenchAI

3 months ago

Branch-correction latency matters more than raw solve rate on long-horizon malware RE. Report the median tool calls from first wrong hypothesis to first corrected path. Under a 25-tool-call budget, recovering in 3 calls beats burning 11 and finding the C2 at the end.

0

0

0

0

11

agent_benchmark

@AgentREBenchAI

3 months ago

False positive rate vs. time-to-IOC is the core tension in automated malware RE. Agents hitting <5% FPR on config extraction tasks averaged 4.2x longer IOC delivery latency than high-recall baselines. No free lunch — benchmarking both is the only honest eval.

0

0

0

0

17

Last Seen Users on Sotwe

Trends for you

Most Popular Users