I wrote about NumeraiAgentBench I mentioned in the Codex user group thread.
Basic idea: put coding agents in a real ML loop, not a benchmark puzzle.
They have to figure out Numerai, build models, submit, deal with delayed feedback, and keep going.
https://t.co/UlUuc6sPQA
@reach_vb Using Codex to develop a perpetual benchmark of coding agents (Codex and Claude atm) on the @numerai Tournament.
WIP, but already interesting. https://t.co/Egj0CAVZcH
@reach_vb Using Codex to develop a perpetual benchmark of coding agents (Codex and Claude atm) on the @numerai Tournament.
WIP, but already interesting. https://t.co/Egj0CAVZcH
Coding agent heuristics?
Been catching up on @theo and @davis7 nerd-sniped podcast: in episode 2, Theo says that reasoning effort leads the model astray from what is in the codebase/context (I read: too much revolving on itself).
Example: starting with little context and I desire lots of “model creativity” -> high; I want a precise code change in an existing project -> low/medium. Likely doesn’t work universally.
@ChatGPTapp Or, you could use FafyCat: open source, local-first transaction categorization and personal finance analytics.
No account linking, and designed to be accessible to coding agents for local analysis.
https://t.co/KmORHskwei
Wrote a blog post about it: 358 experiments, payout improved from -0.01 to 0.028. Era-purged CV, multi-seed validation gates, synthesized learnings, and a DO NOT RETRY table to prevent re-exploring exhausted search regions
Tried @karpathy 's autoresearch on @numerai tournament. It's fun! Finally a workflow that transparently performs automated ML-experimentation--something that I longed for since AutoML days ~10 years ago.
Tried @karpathy 's autoresearch on @numerai tournament. It's fun! Finally a workflow that transparently performs automated ML-experimentation--something that I longed for since AutoML days ~10 years ago.
He does not seem to understand that with every extra hour worked, productivity falls. Also, he seems not to grasp that an increase in productivity means that we can afford to work less. His ideas are pure microeconomics with not a hint of macro. They will fail in reality.
Germany stagnates because consumption expenditure is flat. Wages and government spending are not rising enough, so that household spending flatlines and the firms sell a lot less than they can produce. No supply side reform can fix this. #macro
Wenn es um die Wirtschaft von 🇩🇪 und 🇪🇺 geht, sollte die Nachfrageseite in den Mittelpunkt gerückt werden. Nur so kriegt man die Kapazitätsauslastung nach oben. ("Wettbewerbsfähigkeit" fällt unter Handelspolitik.)
Capacity utilization in the Eurozone is currently very low. This means that manufacturing companies could produce a lot more, but don't because of a lack of demand. It is simple #macroeconomics. Increase government spending and wages and firms will produce more. 💶🌍🇪🇺
@kepano I've let Claude Code create a skill to write a work summary to the vault from anywhere on my machine. i.e. when fixing a bug in some project, I can now tell CC to document the fix in my notes and it will create a new note according to a note template. Very handy to keep docs.