authoring a benchmark and this is quite hard. Because I can’t really predict in advance what task it’ll fail at. For example, the main task of adding a giant feature was completely oneshotted pretty much in under 50k tokens (not counting sub agents). But the smaller one took roughly 200k more before the agent gave up on it entirely. The fix was quite simple too and I think it missed it because it didn’t read the files itself.
authoring a benchmark and this is quite hard. Because I can’t really predict in advance what task it’ll fail at. For example, the main task of adding a giant feature was completely oneshotted pretty much in under 50k tokens (not counting sub agents). But the smaller one took roughly 200k more before the agent gave up on it entirely. The fix was quite simple too and I think it missed it because it didn’t read the files itself.
cutedsl is so fun to write wtf I dont have to tile shit shit anymore and the indexing just magically falls out of shapes and strides it makes me wanna actually write the kernel
learning QR decomposition for the GPU MODE competition but i notice it’s so much more difficult to learn in a TUI to just opening up an chat with an LLM and just seeing the latex and everything else directly inline, GUIs > TUIs.