if you're a vibe coder trying to promote your app, brace yourself, you'll see these on loop:
what are you building this weekend?
what are you building?
sell me your product in one sentence
sell me your product in one word
how would you explain your startup to a 5 year old
anyway, what are you building this weekend?
@supabase A picture is worth a thousand words. To an AI agent, the raw screenshot costs ~1,500 tokens and the same picture as annotated JSON costs ~700. Same thousand words, less than half the cost.
@zach__johnston@figma Figma pro here. why wait for figma though. best part of 2026 is you can build the exact tool yourself. i did (https://t.co/OVmSi1nS9j, screenshots not recording). a screen recorder is very doable solo
@ApplyWiseAi@Jchammond_ vision model, not OCR. it forms an impression of the pixels instead of extracting text, so it's usually right but not guaranteed, and it slips on dense or similar UI. wrote up the difference here: https://t.co/IyEVHrzMGq
@howdycarter You just described the schema I built https://t.co/OVmSi1nS9j around, screenshot + extracted text + source app + region + intent. The one Codex should cite back before acting: which element it thinks you meant. That's the assumption that's wrong most often on a busy screen.
@GideonShalwick@0xShoosh Smart hack. The visual log is Codex rebuilding layout it could've been handed. You already annotate in a separate app, if those marks came in as data (element + position + intent) it'd skip the log and act on the exact element. That's the bit I build https://t.co/ra5WkjYVPP
@markproduct Depends. 1-3 products, sure, build them yourself. But with money and 100 ideas you can't ship them all in parallel, you're one founder, not a render farm. A designer with taste earns their keep there. @hakan_ertann nailed it. Future = AI operators with taste :)
@robjama Screenshots cut the yapping, agreed. A marked one cuts the rest, a raw screenshot still makes the model guess which part you meant. Mark the element + one line of intent, nothing left to explain. CleanShot gets the picture, the mark removes the guess. https://t.co/ra5WkjYVPP
@ramonpiano_ you've got Shottr + Cursor + Codex but nothing bridging them. SlimSnap turns an annotated screenshot into JSON your agent reads, so it changes the exact element you marked instead of guessing from pixels. free, Mac: https://t.co/ra5WkjYVPP (I make it, so biased)
Ah, my mistake, I read it as the diff loop. Scenario 1 with 0 feedback calls actually makes the number more interesting: 270k for one page with no re-screenshotting at all, just the one-shot agent rebuilding the whole structure from a flat reference. That's the input cost, not the loop cost. And agreed on production looking different, I think that's the biggest lever right there, feed the agent structure instead of pixels and the one-shot gets cheaper before you even add a diff.
The part founders underestimate is that Claude is great at execution and terrible at reading your mind. On a busy screen it genuinely can't tell which button you mean. The ones getting real output learned to point at the exact element and spell out what changes. That's the "guidance" you're describing. A designer still wins on judgment. Claude wins on speed, but only once you're that specific.