Find out more about me, my award winning charity work creating Covid Tech Support and have a play with an interactive demo of the in-browser AI I spent 5 years working on @ContextScout at https://t.co/6KFtJFpBBY
The Dev Mode MCP Server is now available in beta. Access Dev Mode data directly in your agentic coding workflow
→ VS Code with Copilot
→ Cursor
→ Windsurf
→ Claude Code
🚀 Exciting news! The Moon Lunar Landscape has been featured in an article on @SPACEdotcom! 🌕✨
https://t.co/TyvdgSh6qY
Bringing this lunar vision to life in LEGO bricks has been an incredible journey. Grateful for the support of this amazing community!
Excited to join @figma as a Product Manager, working on Dev Mode with a focus on Design to Code. Looking forward to advancing AI-empowered design-to-code workflows with Figma’s amazing tools. Thrilled to be part of such a talented team! #ProductManagement#AI#DesignToCode
"The Moon: Lunar Landscape"🌖a Lego Art space poster available now on @LEGOIdeas
Add your support https://t.co/sBTYhqKy4s and retweet to help get it to 10k and make it a real set! #LEGO#space#moon#AFOL#legoideas
You, Me and the Moon 🌙👩❤️👨🌃 My final entry in the Lego Ideas Picture Perfect Memories challenge. A brick built Polaroid of the moon hanging over a city skyline at night during a first date. If you like it, head over to the Lego Ideas page and comment 😊 https://t.co/zARLeO2fQl
@joshm@browsercompany Super excited to see this @joshm, my company attempted to do this 7-8 years ago but the technology simply wasn't there at the time (you can read more here https://t.co/lTOVpC88G3 ). Can't wait to get my hands on this.
Family Fireworks 🎆👪 My latest entry in the @LEGOIdeas Picture Perfect Memories challenge. A brick built Polaroid of fireworks exploding over excited onlookers. If you like it, head over to the Lego Ideas page and comment 😊 https://t.co/Ok8u9Ptd6O
#lego#ideas#legoideas#afol
Surf's Up! 🏄☀️ My new submission to the @LEGOIdeas Picture Perfect Memories challenge. A brick built Polaroid photo of the sun setting over crashing waves. What do you think?
Check it out on https://t.co/bAGEK0d8IH
#lego#legoideas#polaroid#afol#picture#perfect#memories
Generalist web agents may get here sooner than we thought---introducing SeeAct, a multimodal web agent built on GPT-4V(ision).
What's this all about?
> Back in June 2023, when we released Mind2Web (https://t.co/eF4ZzVrP7S) and envisioned generalist web agent, a language agent that can work out of the box on any given website, my projection was that it would take at least several years to see such an agent that is anywhere near usable in practice.
> Why wouldn't I? The most powerful LLM at the time (perhaps still is today), GPT-4, was pretty terrible at this---its end-to-end success rate was around 2% (!!) HTML of modern websites are too long and noisy for LLMs. It's like finding a needle in a haystack. And a long-horizon task can take 10+ actions, so an LLM needs to successfully find 10+ "needles" in a row (!!!) to complete a task.
What's changed in just a few months?
> Large multimodal models. The end of 2023 marked a major milestone for LMMs, with GPT-4V, Gemini, and many good OSS LMMs released.
> Multimodal web agents. Websites are designed to be visually rendered and consumed. Visuals are much more clean and intuitive than HTML, 10x more efficient in terms of token counts. Plus, a pretty unique property of websites is that we have the correspondences between visual elements and HTML code! Such perfectly aligned multimodality is a gold mine for modeling.
> Online evaluation. The final piece of the secret recipe is online evaluation on live websites. Mind2Web initially only supported offline eval on cached websites. We developed a new tool to support running and evaluating web agents on live websites. Both LLMs and LMMs get a big boost, because now they don't have to follow exactly the reference plan in offline eval but are rather free to explore alternative plans to achieve the same goal.
SeeAct
> SeeAct is a generalist web agent built on LMMs like GPT-4V. Specifically, given a task on any website (e.g., “Compare iPhone 15 Pro Max with iPhone 13 Pro Max” on the Apple homepage), the agent first performs action generation to produce a textual description of the action at each step towards completing the task (e.g., “Navigate to the iPhone category”), and then performs action grounding to identify the corresponding HTML element (e.g., “[button] iPhone”) and operation (e.g., CLICK, TYPE, or SELECT) on the webpage.
Main results
> SeeAct can successfully complete up to 50% of tasks on live websites, substantially outperforming GPT-4
(20%) and FLAN-T5 (18%), if oracle action grounding is provided.
> However, grounding is still a major challenge. It turns out that GPT-4V can often accurately describe in text what action should be taken, but has trouble grounding the action to the exact HTML element and operation on the webpage. Existing grounding strategies like set-of-mark prompting turns out not very effective for web agents. Our best grounding strategy leverages the correspondences between visuals and HTML.
> SeeAct w/ GPT-4V shows many interesting capabilities such as speculative planning, world knowledge (e.g., airport codes), and some sort of "world model" (for websites at least), that it can correctly predict the state transitions on a website (e.g., what would happen if I click this button)
Fun fact
Initially we were hoping to show that even GPT-4V would still be insufficient for generalist web agents and we may still need fine-tuning, but we kept getting blown away by its incredible capability as a web agent. Such pleasant surprises are why I enjoy doing AI research so much these days. I also look forward to test Gemini Ultra and see whether its strong performance on MMMU would transfer.
Conclusion
Practically useful web agents could be coming soon. Buckle up and start thinking about what new applications will be enabled.
📌Website: https://t.co/r9v8eRSseY
📌Paper: https://t.co/SLESONX8rt
📌Code: https://t.co/79swHiS2J2
Work led by my amazing students @boyuan__zheng@BoyuGouNLP from @osunlp, joint with Jihyung Kil and @hhsun1. Hire them for internships!
Required reading for PMs further down the food chain in a corporate hierarchy looking to get things done: "Switch: How to change things when change is hard" by Chip and Dan Heath.
Time to herd some elephants... 🐘 #ProductManagement
Does anyone else use ChatGPT for Rubber duck debugging of problems? Articulating my issue to it often leads me to the solution, plus I get GPT's insights 🦆💡 #ChatGPT#ProblemSolving
This app is Shazam, but to understand why a baby is crying.
It records and analyzes more than 20,000 sounds of crying babies, and gives you the reason why your baby is crying in just 5 seconds.
Just finished reading "The Build Trap" @lissijean, made a big impact on me, with practical insights I'll actually use. Highly recommend for anyone looking to rethink their own and their orgs approach to #ProductManagement 👏