David Sweet

Verified account

@phinance99

Bayesian optimizer Learn to experiment: code complexity creep: cargo install kiss-ai

Manhattan, NY

Joined January 2008

402 Following

243 Followers

3.6K Posts

about 2 hours ago

Here's an example of a scalable loop. I think I can sum up how it works with two words: falsification and regularization Falsification means keep telling the agent to find problems. Don't ask for verification or satisfaction or completion. That is a path to frustration and disappointment. Regularization means to control complexity. Complexity is the curse of a sequence of local edits. It builds up whether LLMs make them or people make them. Most devs feel physical pain when they see messy code, but not so the LLMs. They need explicit feedback to keep them in line. https://t.co/VKSKiTi5Vd

0

0

0

0

48

about 2 hours ago

@jdegoes If your curious, the agent responsible:

about 6 hours ago

I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup. Result: 22 hours, error-free even on hidden tests! Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses: - Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development - Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements. This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

phinance99's tweet photo. I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.

Result: 22 hours, error-free even on hidden tests!

Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.

This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

0

1

0

1

195

0

1

0

0

15

about 4 hours ago

@kaushikgopal Best value right now, imho: Lowest-tier plan with a good harness. https://t.co/EzLE81WnPr

about 6 hours ago

I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup. Result: 22 hours, error-free even on hidden tests! Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses: - Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development - Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements. This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

phinance99's tweet photo. I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.

Result: 22 hours, error-free even on hidden tests!

Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.

This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

0

1

0

1

195

0

0

0

0

40

about 4 hours ago

@maxedapps Autonomy is possible:

about 6 hours ago

I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup. Result: 22 hours, error-free even on hidden tests! Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses: - Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development - Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements. This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

phinance99's tweet photo. I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.

Result: 22 hours, error-free even on hidden tests!

Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.

This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

0

1

0

1

195

0

0

0

0

71

Who to follow

Staff research scientist @GoogleDeepMind @GoogleAI. PhD from @Oxford_VGG, before that @Cambridge_Uni. 🇮🇳 🇬🇧 🇺🇸. she/her

Verified account

Supercharge your voice productivity. 🚀Our voice AI software provides unparalleled clarity, noise cancellation, and on-device transcriptions & summaries✨

AI-Powered Search

Verified account

@aiPoweredSearch

New book by @treygrainger, @softwaredoug, @binarymax. Published by @ManningBooks. #search #ai #solr #lucene #elasticsearch #datascience #relevance

about 5 hours ago

@cognition GitHub link?

0

0

0

0

13

about 6 hours ago

I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup. Result: 22 hours, error-free even on hidden tests! Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses: - Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development - Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements. This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

phinance99's tweet photo. I gave malvin (my harness) a big refactoring task: Remove a 53-file dependency cycle from a fork of rich (the Python package), and, also, get all functions under 25 lines and cap indentation at 5 levels deep. Basically, do some code cleanup.

Result: 22 hours, error-free even on hidden tests!

Then I ablated: What if I leave out kiss? What if i leave out KPop? The main result is in the table below. The punchline: I need both to make the harness work. I'm left with these two hypotheses:
- Complexity constraints act as a regularizer, making code more likely to work outside of the tests used as feedback during development
- Focusing on *falsification*, rather than satisfaction, makes an agent more likely to meet the user's stated requirements.

This report discusses the experiment and the hypotheses: https://t.co/9n9GGCwtlN

0

1

0

1

195

about 8 hours ago

Agents aren't people. Here's some supporting data.

about 13 hours ago

oh gosh, this is kind of a big deal DO NOT ASK YOUR AGENTS TO DO TDD! i now have empirical evidence that Test Driven Development is harmful for coding agents what other popular skills do you want me to debunk? details about the evaluation below 👇

kunchenguid's tweet photo. oh gosh, this is kind of a big deal

DO NOT ASK YOUR AGENTS TO DO TDD!

i now have empirical evidence that Test Driven Development is harmful for coding agents

what other popular skills do you want me to debunk?

details about the evaluation below 👇 https://t.co/UsRSek5mB0

97

331

29

424

57K

0

0

0

0

36

about 8 hours ago

Sure, but nobody uses these models without some kind of loop. When do we get LoopBench?

Cognition @cognition

about 22 hours ago

FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.

cognition's tweet photo. FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set. https://t.co/BPm76GUrkd

7

213

2

32

101K

0

0

0

0

8

about 18 hours ago

@0interestrates *You're* joking, but, dude. TikTok PRs are coming.

0

0

0

0

60

about 19 hours ago

@aarondfrancis Have I got a loop for you: https://t.co/T4XH5oWSoH

0

3

0

0

103

about 19 hours ago

@irl_danB The information is rippling out through society.

0

1

0

0

27

about 23 hours ago

Counterpoint: You need to engineer your system to be simpler, ideally with only one role.

1 day ago

peter is right.. in other words you have to know how to coordinate and give the right roles to your agents. for example, i'd bet most people running /goal on codex don't have a system behind it... they're still prompting some big block of text.. the point of /goal is to stop answering and start coordinating well. /goal "<objective>" don't give me one answer turn my objective into a multi-agent loop: >decompose the work >assign the right agent >coordinate dependencies >execute in order >review against the original goal >gate before shipping >save what worked >report only what matters this will 10x your output, because you basically will stop prompting big text and start using codex's agents the way they're meant to while everything has its role so your goal gets executed perfectly and big part of this is that you just stop being the bottleneck... apply this eveyrwhere not just codex btw.. decompose, assign, coordinate, gate, report...claude code, your own agent stack, anything...

jumperz's tweet photo. peter is right.. in other words you have to know how to coordinate and give the right roles to your agents.

for example, i'd bet most people running /goal on codex don't have a system behind it... they're still prompting some big block of text..

the point of /goal is to stop answering and start coordinating well.

/goal "<objective>"

don't give me one answer turn my objective into a multi-agent loop:

>decompose the work
>assign the right agent
>coordinate dependencies
>execute in order
>review against the original goal
>gate before shipping
>save what worked
>report only what matters

this will 10x your output, because you basically will stop prompting big text and start using codex's agents the way they're meant to while everything has its role so your goal gets executed perfectly and big part of this is that you just stop being the bottleneck...

apply this eveyrwhere not just codex btw.. decompose, assign, coordinate, gate, report...claude code, your own agent stack, anything...

28

605

45

925

49K

0

0

0

0

51

about 23 hours ago

Active learning/experimentation. Let the optimizer decide what data it needs. Similarly, let the agent, decide what data it needs. We need "self-driving lab" ideas right at the very core of everything. (Also, everyone, please read my book.)

about 23 hours ago

https://t.co/54D3VKBzzo

4

84

3

83

13K

0

0

0

0

35

about 23 hours ago

I think a good solution is coworking for small- and medium- sized towns. You "WFH" from a cowork office near home. No long commute to a city. You're around other adults from your geographical community. You can drop your kids off at school in the morning on the way. It would help with community, wfh isolation, and housing costs. Maybe you move a little further out and your cowork is a converted barn. (Build it, and I will come!)

1

5

0

2

7K

1 day ago

@PawelHuryn Treat your memories as suggestions -- hypotheses to falsify.

0

0

0

0

78

1 day ago

@MyMoonEnt @steipete Use Cursor Auto mode on the $20 plan. Best value right now. Write good mechanical feedback (linters, tests) and you'll get good results.

0

0

0

0

240

1 day ago

Et tu, Grok? Why are all the LLMs talking in this awkward, choppy way now? I find them much harder to understand than previous generations.

1 day ago

@Murph_2 @loycense @N0pes @bowtiedbowser @steipete @mosyaseen Agreed. One-off prompting drifts fast. Design loops instead: agent reads VISION.md for persistent direction, proposes work, then hits real guardrails (tests, types, errors) that push back. Iterate until it passes. Less hand-holding, more reliable output.

2

4

1

21

7K

0

0

0

0

26

1 day ago

TikTok PRs. Make the agents predict what code changes you'll like. Then they can update based on which code changes you accept or reject. IOW, build a recommender system where the content is agent-generated PRs. This solves the latency problem. You wouldn't have to wait for a request to be fulfilled. You'd just get a constant stream of PR's.

Peter Steinberger 🦞

2 days ago

Here’s your monthly reminder that you shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.

2K

19K

1K

14K

7M

0

0

0

0

20

1 day ago

TikTok PRs. Make the agents predict what code changes you'll like. Then they can update based on which code changes you accept or reject. IOW, build a recommender system where the content is agent-generated PRs. This solves the latency problem. You wouldn't have to wait for a request to be fulfilled. You'd just get a constant stream of PR's.

0

0

0

0

83

Last Seen Users on Sotwe

Trends for you

Most Popular Users