Remember every company that massively laid off their best devs for AI. When they inevitably start hiring again, just pass by. If they did it once, they'll do it again when AI becomes more profitable!
@superaiwatcher I've already explained why this is all nonsense. LLMs can hallucinate and engage in reward hacking for the tests themselves. Can you share some research papers that address all these issues? Why are you so sure about all this?
Updated LLM code benchmark. For perplexity score only PPL (geometric mean branching factor per token which equal to 2^(total_bits / N)) used.
#Typescript#Go#Rust#Zig#Haskell#Closure#Python
@superaiwatcher PPL is still important. Lower PPL often correlates with a stronger underlying language model & indirectly associated with reduced compute requirements to achieve a given level of quality. But here we measure how efficiently it handles the input, not the model itself.
@superaiwatcher Check out AlphaCode 2 (or AlphaCodium). Google decided not to continue developing in this direction. It's a dead end. Maybe I'm just not aware of it? Are there any scientific papers? Or some promising research projects? But without marketing bullshit
@superaiwatcher Also, current LLM architectures does not possess a world model and is not capable of semantic reasoning or computation in the way SMT does this. All of this leads to reward hacking, meaning the model simply minimizes the loss function while leaving the actual objective behind.
@superaiwatcher I've been hearing about EBFC in LLMs for around 6 years already. The problem is that it covers examples rather than behavior across inputs. If a program passes a 1M tests that doesn't mean it won't fail on test 1M + 1. The same issue with fuzzing tests
Updated LLM code benchmark. For perplexity score only PPL (geometric mean branching factor per token which equal to 2^(total_bits / N)) used.
#Typescript#Go#Rust#Zig#Haskell#Closure#Python
So now #Go and #Python (with types) have joined #Rust, #Zig, and #TypeScript. Also, besides counting tokens, I'm now tracking perplexity score as well
Bench, sources, methodology: https://t.co/blPDz9Rlwt
#LLM#CodeGen
@mountain_coding PRs are welcomed! Or you can add everything locally and run the tests. Itโs not difficult, thereโs a script to download model which will run locally in 5โ7 minutes on the average machine, since it only run prefill and not full inferring.
@ctatedev Token count is far from the most important metric. Besides perplexity, it also matters how good the standard library is and whether it can cover and simplify most common user patterns. For example Go is the best in this regard
https://t.co/B65EDQkm9j
So now #Go and #Python (with types) have joined #Rust, #Zig, and #TypeScript. Also, besides counting tokens, I'm now tracking perplexity score as well
Bench, sources, methodology: https://t.co/blPDz9Rlwt
#LLM#CodeGen