Shout out for this project. I made an app last week that uses the core idea contained here. To drill in to a topic you send a VLM the whole image with a zoomed in second image with a visible target overlay of specifically where earw pointing Works really well with the latest chatgpt model since it can render infographics perfectly.
I wrote a program that allows you to click on any image and get more information or explain about the specific region of that image. If you want text returned you can use a gemma model locally. If you want an amazing image returned send it to Vision 2 from Open AI and get back an infographic.
Works with hand written text too.
In this example clicking on a portion of the image (blue) returns more information about that particular topic (right).
My dear front-end developers (and anyone whoβs interested in the future of interfaces):
I have crawled through depths of hell to bring you, for the foreseeable years, one of the more important foundational pieces of UI engineering (if not in implementation then certainly at least in concept):
Fast, accurate and comprehensive userland text measurement algorithm in pure TypeScript, usable for laying out entire web pages without CSS, bypassing DOM measurements and reflow
@_chenglou What about accessibility and WCAG? I tried the demo. The page had no structure. How will this work for people who rely on the assistive technology the browser provides?
Yesterday Google released a massive update to kv memory. 12 hours later there is a working implementation on YouTube that demonstrates the concept works. That's massively fast!!
I'm being asked to submit a proposal for a conference in four months.
What am I possibly going to say?