I remembered that I have a favorite programming benchmark, the one that I previously used to evaluate programming languages, that I can now use to evaluate coding models! I'm talking, of course, about ICFP Programming Contest 2006 virtual machine.
So I asked models to read https://t.co/CTOdGyYx8T and to implement the UM virtual machine capable of running `sandmark.umz`.
I've made a few runs with a few models that are +- in the same price range on Opencode Zen / Go: DeepSeek V4 Pro, GLM-5.1, Kimi-2.6, MiniMax-M3, Claude Haiku 4.5 and GPT-5.4 mini (well, DeepSeek is way cheaper, actually).
So, GLM, DeepSeek, Kimi: usually use C, usually get pretty close, but then invariably are confused about "self-decompressing" wording on the web page, which makes them run `sandmark` not directly on their VM (as they should), but using `https://t.co/ltjWFRa4Al` (UM emulator written in UM) -- which, of course, is terribly slow. Then they get stuck trying random useless ideas to speed things up. I had to interrupt them at that point.
Minimax-M3: usually uses C, is never confused about "self-decompressing", but invariably screws up op 13 every time. Sometimes it manages to dig itself out of that hole and delivers good results; other times it finds a way to dig itself deeper with some other mistakes.
Haiku just did the whole thing in python (with terrible performance) and with some additional error causing `sandmark` to terminate early and without producing full output. Haiku then proudly declared that it has successfully completed its task.
And GPT-5.4 mini just one-shots the whole thing every time, in Go or Rust, pretty fast, too -- definitely making it look too easy. It's a clear winner, and it's not close.
I tried warning DeepSeek and Kimi about `https://t.co/ltjWFRa4Al`; after that, DeepSeek one-shotted the task pretty fast, too, and Kimi decided that it has to ignore my advice, and ended up stuck trying useless performance tricks again.
I tried warning MiniMax-M3 about op 13, but it found some new way to screw up the implementation.
So yes, we have our irreconcilable differences with China about rights/freedoms. It makes it easy to forget that it *is* one of the great civilizations, and that it's one of those that are very much based on a huge corpus of written texts -- which LLMs thrive on.
Case in point: I have a medical condition for which I find the Traditional Chinese Medicine view quite helpful. I would argue that it might be better to discuss that part with something like Deepseek or Kimi than with whatever Western model is white-hot these days.
Makes me wonder what other topics are like this. I suppose Chinese models would be more familiar with e.g. Chinese classical Chan texts ("zen" before it came to Japan). I'd also expect them to understand those texts better, since they are notoriously hard to translate, and Chinese models would just have a deeper understanding of the language of the original, see more original comments, etc.
Of course, there must be many other topics like this.
Hot take: quite often, the most useful mode of using an AI coding agent is peer programming mode. That's strong evidence that the most useful mode of using a human coding agent is often also peer programming mode. The main difference is that humans are so damn expensive.
...still, when it works, when I'm not bogged down by friction of the mundane, not decision-fatigued from a thousand everyday nanodecisions, free to give it all to pursuing my big life goals -- do I then get the amazing results that *are*, indeed, the ultimate joy? Also no.
Do I enjoy the low-key ongoing struggle of keeping my everyday life and the space around me orderly and well-organized? No. But when it works, when I manage to get that predictable, efficient, frictionless flow -- do I enjoy *that* enough to make it all worth it? Also no. But ...
@arkenoi Как же ты, Петька, дошёл до жизни такой, что спрашиваешь меня, своего боевого командира, почему люди, глушащие GPS, не бегут, роняя тапки, делать систему, которая будет как GPS, но которую нельзя заглушить?
@TheCinesthetic Rewatched this scene 4-5 times initially because I was absolutely sure there must be a moment there when shadows or background features behind Vader look like two giant mouse ears around his head. There isn't one. Unreal self-restraint from Gareth Edwards.
@adworse@BTobotras Я для такого когда-то сооружал себе сетап, который локально делал inotify + rsync сорса на remote, и одновременно пробрасывал nrepl socket прозрачно туда же на remote.
@jurbed I absolutely do agree, on some level, in some sense, that everyone deserves a lot of things. It's just that we very often don't get what we deserve. Which is sad, but has nothing to do with one's "right" to take something from others by force.
@oleksandr_now then you surrender your chance to influence which parts of "you" die and when and how and which ones get to keep on living. you can get a lot of mileage from properly dying just the right amount all the time.
@nikitonsky The idea is that the last Terminator has got her a promotion deal so good she'll either have no time to have a kid or too much money to raise him a good leader of the Resistance.
@TomasForgac Oh yeah, my wife does that. And if, God forbid, I hesitate to pick one for a few seconds, she's gonna get nervous and add another 7 options.