JLC | Product & Technology | #paelladoc @jlcases - Twitter Profile

about 21 hours ago

look, small study: n=5, one repo, directional. but all 210 diffs are public so you can break it yourself without paying for a single call. here 👉 https://t.co/qS0iawB9ae if you find where it falls apart, i owe you one.

0

6

JLC | Product & Technology | #paelladoc

@jlcases

about 21 hours ago

took the best model that exists, cranked it to max reasoning, gave it a coding task. build green. tests passing. agent says "done." and it was wrong 2 out of 3 times. nobody would have noticed. that's the trap.

jlcases's tweet photo. took the best model that exists, cranked it to max reasoning, gave it a coding task.
build green. tests passing. agent says "done."
and it was wrong 2 out of 3 times.
nobody would have noticed. that's the trap. https://t.co/TvKFRsjWJz

1

0

8

JLC | Product & Technology | #paelladoc

@jlcases

about 21 hours ago

it's that an LLM isn't deterministic even at temp 0. every run is a roll, like me on a sharp day vs a foggy one. the good run and the broken one look the same. you only tell them apart by executing. green can't tell you which one you got.

1

0

4

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@notPades @thsottiaux appreciate you pulling this together. state continuity is the one nobody screenshots but everyone feels.

0

4

Who to follow

Just a average normal everyday brotha. Vaccinated & boosted.

Vishal Goklani

@vgoklani_ai

Twitter Nerd... Interested in Deep Learning (self-supervised learning & LLMs), Astrophysics (exoplanets), and Cosmology (CMB).... I like to build things

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@zodchiii shipping while you sleep is the easy half. you wake up to code with no author to ask why. that's how debt piles up quietly.

0

194

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@Star_Knight12 Absolutely

0

37

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@trikcode the code was clean, that's the trick. it's not a quality problem. the model's locally correct every step, but nobody's keeping the architecture coherent across sessions. it optimizes the next token, not the system. that's the rewrite trap, not a skill issue.

0

2

0

512

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@thsottiaux the top complaints aren't model stuff, they're state. worktrees getting lost, remote dropping, big files nuking context. model's fine. it's that every task starts from zero and nothing carries between sessions. that's where it actually hurts.

0

466

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@aap_twak @thdxr if it verifies its own work it's just grading its own homework. the check has to come from outside it, set before it starts.

0

6

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@irl_danB naming it matters but not for philosophical reasons. if everything's "vibe coding" you can't measure which part failed. context, spec, model, verification. you need the subdivisions or there's nothing specific to improve.

0

46

JLC | Product & Technology | #paelladoc

@jlcases

8 days ago

@mfishbein honestly extraction is the easy half. the hard part is turning that context into specs the agent can actually fail against. if it's not formalized you'ven still got an engineer reviewing every diff by hand. that's the bit that doesn't scale.

0

39

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

@CopyRebeldia measured this. without a spec, 6 AI CLIs fail genuinely (crash, empty output, silently wrong) 37% of the time. and everything compiles green btw, thats the trap. spec + real execution check drops it to 4%. praying was never a methodology lol. data: https://t.co/bGYS6tEqf9

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

A green build is not a correct feature. I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug — a crash, an empty response, or nothing at all. A spec + a verification gate dropped it to 4%. 🧵

jlcases's tweet photo. A green build is not a correct feature.

I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug — a crash, an empty response,
or nothing at all.

A spec + a verification gate dropped it to 4%. 🧵 https://t.co/I1KNnpE59a

1

0

132

0

51

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

@mardehaym "confidently wrong" nailed it. i benchmarked 6 AI CLIs, same task: 37% of runs ship a genuine bug but compile clean. green across the board. no spec + no execution gate = russian roulette. with both drops to 4%. data here: https://t.co/bGYS6tEqf9

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

A green build is not a correct feature. I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug — a crash, an empty response, or nothing at all. A spec + a verification gate dropped it to 4%. 🧵

1

0

132

0

26

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

Honest: n=5, one repo, directional — not a paper. Interface nitpicks excluded; only real correctness bugs counted. Full breakdown + every number: → https://t.co/1Qq2MBzTFZ

0

27

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

A green build is not a correct feature. I gave 6 AI coding CLIs the same tasks. All compiled. ~37% of raw runs shipped a genuine bug — a crash, an empty response, or nothing at all. A spec + a verification gate dropped it to 4%. 🧵

1

0

132

JLC | Product & Technology | #paelladoc

@jlcases

9 days ago

The part I didn't expect: a spec + gate makes cheap models safe. Kimi raw: 1/3. Kimi + spec: 3/3 — at ~$0.03/task. Sonnet raw: 2/5 — at $0.45. A model 10–50x cheaper, gated, beats the frontier model run on trust. You verify instead of trust.

1

0

44

JLC | Product & Technology | #paelladoc

@jlcases

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users