@relizarov Excruciatingly, and the memory usage is out of this world
But you get very high correctness levels and that might be worth it
Have you tried optimizing clangd instead e. g. by linking a better allocator?
Hypothesis: for every task that is unverifiable, there is a set of related tasks which are verifiable with the property that hillclimbing on them generalizes to the original task
@badlogicgames@texoport__ I think this is missing something, which is that
agents are gullible and naive, so you can easily get implementations of bad ideas whereas a good software engineer knows when to say “this is not a good direction to do”
agents don’t get when to step back and rewrite subsystems
@theodorvaryag Not sure about clang, but clangd at least claims to have a way to output perfetto-formatted jsons. Not as good as tracy or perfetto protos but eh
@theodorvaryag Hey me too lmao, I think that’s why we followed each other initially
You should know this small of a benchmark is meaningless though. Make Claude write you a Python script that dumps thousands of lines of template instantiations for instance lol. It’s like one prompt
@static_assert_0 making things fast is what I do for a living but I'm not super duper interested in clang internals so you're welcome to pick up the benchmark and profile it yourself
I recommend adding tracy instrumentation to the compiler
benchmarking without profiling is worse than juju spaghetti-walling. you're just guessing and making decisions based on potentially misleading or broken data. true even for micro-benchmarks.
@theodorvaryag 1. O3 is kinda ass, it makes things slower sometimes too without pgo. Don’t recommend
2. Clang has -fsyntax-only. Worth benchmarking too
3. Remember the motto Chris: “no performance claims will be accepted without benchmarks, and no benchmarks will be accepted without profiling
@PippengerHarlo *Clavicular walks into the room* Where is he? Where is ASU Frat Leader? *Mog Club goons look at each other confused* sir… are you joking?
@geofflangdale It did lol, I was going to say something to the effect of
Today’s applications are stuff like “review this code by reading it”, but that leaves it way open to creative guessing about correctness. I’m more interested in deeper integration with fuzzers for example
This is precisely backwards
The only axis along which LLMs have not improved at all is in being confidently wrong about highly specialized details. If you are an expert in nothing you will feel unstoppable with LLMs while confidently producing broadly incorrect code
Turns out with claude code, my decades long strategy of NOT deeply learning:
- regexs
- sql
- nginx confs
- elaborate shell commands
- advanced shell scripting
- any javascript framework
- perf optimization
- webpack, cdns, bundlers
- 1000 other things
...was entirely correct.
@geofflangdale Generating code IMO is one of the most milquetoast and least convincing qualities (though adverse business incentives mean they will likely swallow this part of the job whole for the less quality-sensitive sectors)
Putting them in loops with irrefutable evidence like
@geofflangdale To produce sequences of words that represent thoughts leading up to a successful implementation/breakthrough. They’re unstable dynamical systems in very complex fields that can mislead themselves into total nonsense, or occasionally get things right
@geofflangdale I wish it was possible to have a nuanced discussion. The behaviors of coding agents are really interesting! Some of them are extremely high quality sometimes, and different ones do very differently. Unfortunately the “reasoning” style is a ruse; they are actually learning