@WenjieYang00 Agreed, though I think at least part of the issue is aggressive post-training for tool use which masks some vision capabilities.
Thanks for reading:)
For a while now, I wanted to check how well frontier models can look at an image and handwrite on top of it. So I did a thing:
InkSlop: Vibe-coded benchmark for Spatial Reasoning with Digital Ink
Link and key points below👇
The finding: Models really prefer tool use to actually looking. And when the tools are taken away, performance drops for most task & model combinations (ex. 0% on mazes)
@giffmana Mostly Fig. 4, but sometimes Fig. 6 for a bunch of trivial parallel changes (because switching in tmux >> switching sessions in CC extension)
@NandoDF Depends on the performance on the task, whether this is the largest model, and how easy it is to synthesize the prompts, I guess?
SFT on most if the model has not seen a lot of this type of data before, distill a larger model, maybe with synthetic prompts if doable, RL otherwise?
Well, we actually did it. We digitized scent. A fresh summer plum was the first fruit and scent to be fully digitized and reprinted with no human intervention. It smells great.
Holy moly, I’m still processing the magnitude of what we’ve done. And yet, it feels like as we cross this finish line we are instantly at a new starting line. I’ll have more to share about what’s in store that we’re building on top of this.
A huge HUGE congrats to the entire team across scientific, engineering, operational, and creative disciplines. It takes a village named Osmo to do this.
I don’t know if this is embarrassing, but I carry the plum scent with me a lot of places and smell it constantly. It makes me smile.
I’m curious, if y’all want to smell it? If we made a limited release fragrance of the first teleported scent and dedicated the proceeds to science, would you want it?
Today we describe a model taught to read and write so it can extract and digitize the strokes of handwriting without the need for specialized equipment. It then outputs realistic looking digital handwriting that can be handled like standard digital text. https://t.co/y3hj54ONkP
@francoisfleuret Isn't this somewhat identical to having one layer do f(x)=-x, and the second one writing down some completely new content there? Is the hypothesis that having zeroes in the mask makes it easier from the optimization perspective somehow?