Winnie Yeung

@vionwinnie

MLE @ Adobe. Table tennis player and serving 2 cats daily.

Bay Area, CA

Joined June 2009

262 Following

50 Followers

326 Posts

vionwinnie retweeted

Andrew Lampinen @AndrewLampinen

23 days ago

What are the real problems to be solved in continual learning? In my latest post, I tackle this question — reviewing where I think the field went astray in the past, how language models changed things, and where the real challenges remain. 1/2

AndrewLampinen's tweet photo. What are the real problems to be solved in continual learning? In my latest post, I tackle this question — reviewing where I think the field went astray in the past, how language models changed things, and where the real challenges remain. 1/2 https://t.co/i2F3SibcHG

754

825

111K

vionwinnie retweeted

Yongyuan Liang @cheryyun_l

23 days ago

I have to say I totally agree that VLA training is a multi-objective optimization problem balancing VL capacity and action/trajectory generation...

173

108

25K

vionwinnie retweeted

AVB

@neural_avb

2 months ago

Extraordinary scenes on my TL Guys there’s more than one way to optimize AI on a task. If you’re working on harnesses try to slowly add all these in your bag. The classic way is to update the weights (RL)… The modern way is to optimize prompts/context (Dspy optimizers/GEPA)… and the hypermodern way is to self evolve the codebase itself (auto research/alpha-evolve/darwin-godel variants) All of them need an eval dataset of prompts/task scenarios, a rubric of success, and an initial forward pass (harness+model) to learn. They just update different things to get your system to better evals. There’s nuance to each. There’s a time and place for all of them.

neural_avb's tweet photo. Extraordinary scenes on my TL

Guys there’s more than one way to optimize AI on a task. If you’re working on harnesses try to slowly add all these in your bag.

The classic way is to update the weights (RL)… The modern way is to optimize prompts/context (Dspy optimizers/GEPA)… and the hypermodern way is to self evolve the codebase itself (auto research/alpha-evolve/darwin-godel variants)

All of them need an eval dataset of prompts/task scenarios, a rubric of success, and an initial forward pass (harness+model) to learn. They just update different things to get your system to better evals.

There’s nuance to each. There’s a time and place for all of them.

270

241

19K

vionwinnie retweeted

Oier Mees @oier_mees

3 months ago

The recording of @xiao_ted's guest lecture on "Three Eras of Robot Learning" at @ETH is now live on YouTube! He shares unique insights on scaling physical AGI, drawing from his experience at @GoogleDeepMind . 📽️ YouTube: https://t.co/9NiKJaUPyo 📚 Course: https://t.co/QJcfXJRfX8

oier_mees's tweet photo. The recording of @xiao_ted's guest lecture on "Three Eras of Robot Learning" at @ETH is now live on YouTube! He shares unique insights on scaling physical AGI, drawing from his experience at @GoogleDeepMind .
📽️ YouTube: https://t.co/9NiKJaUPyo
📚 Course: https://t.co/QJcfXJRfX8 https://t.co/6zSwCSuWSR

231

203

10K

Who to follow

Guy Herrell

@GuyHerrell

Hard working Patriot, traveling all over Indiana fixing Retail Gas Station fueling systems.

Nadine Epstein

@NadineEpsteinDC

Editor-in-Chief and CEO of @MomentMagazine author of RBG’s Brave and Brilliant Women, a collaboration with Justice Ginsburg

Susanna Bonini

@Subonini

Giornalista e TV Producer, ha lavorato in trasmissioni televisive, radio, magazine e agenzie stampa, in Italia e all'estero.

vionwinnie retweeted

Tongzhou Mu 🤖🦾🦿 @tongzhou_mu

3 months ago

Everyone is talking about "World Models" for robotics, following the buzz from GTC 2026. But the research landscape is shifting so fast it’s difficult to keep up. In my view, here are the two dominant paradigms currently grounding the video world models in robot control. --- Paradigm 1: Use the Video Model as a Simulator The first major approach is using video world models to simulate reality. In this framework, the model predicts "what happens next" in either pixel space or latent space, conditioned on text prompts or robot actions. Much like traditional analytical simulators (e.g., IsaacSim, MuJoCo, ManiSkill), these learned simulators are used for data synthesis, planning, and evaluation. 1.1 Synthesizing Data for Policy Training A representative work is DreamGen [1]. Given an initial frame and a language instruction, a fine-tuned video model synthesizes clips of a robot completing a task. An inverse dynamics model then labels these videos with actions to train a separate robot policy. GR00T N1 [2] uses a similar strategy. Alternatively, models can act as interactive simulators where agents (like UniSim [4]) or humans (like Interactive World Simulator [3]) generate data through interaction. Key Advantages: Thousands of hours of "synthetic experience" at a lower cost and the ability to safely simulate rare, dangerous edge cases. 1.2 Inference-Time Planning Instead of following a fixed path, robots can use video models to "imagine" multiple future outcomes. In V-JEPA 2 [5], an action-conditioned video model evaluates different action sequences to find the best next step. This "imagination-based planning" is also a core theme in CLASP [6], SWIM [7], VLP [8], GPC [9], DreamDojo [10], and Cosmos Policy [11]. The challenge remains fitting this heavy computation into real-time control budgets. 1.3 Policy Evaluation Video models allow us to test policies before they ever touch physical hardware. Veo Robotics [12] demonstrates that these models can accurately predict relative performance and perform "red teaming" to expose safety violations. This approach is also seen in IRASim [13], 1XWM [14], Ctrl-World [15], and others. Summary of Paradigm 1: While powerful, there is no "free lunch." These methods depend on prediction accuracy. Our physical world is complex, and teaching video models to handle every edge case without hallucinating physics remains a significant challenge. --- Paradigm 2: Use the Video Model as a Policy The second, more integrated paradigm is using the generative video model as the policy (decision-maker) itself. Because the native outputs are videos rather than robot actions, several methods have been developed to obtain control signals. 2.1 Generating Video and Action Jointly A straightforward idea is to add an action decoder to the video model backbone and run video and action denoising jointly during inference. Representative works include DreamZero [16], Cosmos Policy [11], Motus [17], PAD [18], GR-1 [19], and GR-2 [20] (note that the GR series are not diffusion models). This method leverages the rich spatiotemporal priors of pre-trained models with minimal architecture changes. 2.2 Extracting Visual Representations for Action Generation Rather than full generation, many methods use video models to extract deep visual representations to guide action generation. Example works include VPDD [21], VPP [22], UVA [23], UWM [24], Video Policy [25], and DiT4DiT [26]. A major advantage here is that you don’t necessarily need to run multiple denoising steps on giant models, making real-time control easier, though it remains unclear if the full potential of the video models is being utilized. 2.3 Open-loop Video Generation + Video-to-Action Translation A rising trend involves generating a "desired future" video and using a separate inverse dynamics model to translate that video into actions. UniPi [27] pioneered this, followed by This&That [28], TesserAct [29], and 1XWM Self-Learning [30]. Some methods generate videos of humans completing tasks (Dreamitate [31], Gen2Act [32], LVP [33]) and translate those to robot actions. This approach allows video models to do exactly what they were trained for: video generation. 2.4 Closed-loop Video Generation + Video-to-Action Translation Open-loop generation often leads to hallucinations: the model might "see" the robot picking up an apple that isn't actually there. Closed-loop generation avoids this by constantly conditioning on the latest real-world observations, replacing generated frames with real ones in the next call. Recently, mimic-video [34] and LingBot-VA [35] reached real-time speeds using KV caching and partial denoising. Most notably, the DVA [36] model released this month manages real-time generation with full video denoising, which means denoising pure noise all the way to clean video for every step. This approach seems really promising to me, because it reduces robot control into a problem of real-time video generation, which can directly benefit from large-scale video pre-training. --- To me, the key takeaway from this evolution is how we have begun bridging the gap between the digital and physical worlds. Instead of trying to manually program every physical law, we are leveraging the implicit physics embedded in billions of web videos. Whether we use these models as simulators or as direct policies, the objective is the same: providing robots with a “physical common sense.” By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence. [References in the comment]

tongzhou_mu's tweet photo. Everyone is talking about "World Models" for robotics, following the buzz from GTC 2026.

But the research landscape is shifting so fast it’s difficult to keep up.

In my view, here are the two dominant paradigms currently grounding the video world models in robot control.

---

Paradigm 1: Use the Video Model as a Simulator

The first major approach is using video world models to simulate reality. In this framework, the model predicts "what happens next" in either pixel space or latent space, conditioned on text prompts or robot actions. Much like traditional analytical simulators (e.g., IsaacSim, MuJoCo, ManiSkill), these learned simulators are used for data synthesis, planning, and evaluation.

1.1 Synthesizing Data for Policy Training

A representative work is DreamGen [1]. Given an initial frame and a language instruction, a fine-tuned video model synthesizes clips of a robot completing a task. An inverse dynamics model then labels these videos with actions to train a separate robot policy. GR00T N1 [2] uses a similar strategy. Alternatively, models can act as interactive simulators where agents (like UniSim [4]) or humans (like Interactive World Simulator [3]) generate data through interaction.

Key Advantages: Thousands of hours of "synthetic experience" at a lower cost and the ability to safely simulate rare, dangerous edge cases.

1.2 Inference-Time Planning

Instead of following a fixed path, robots can use video models to "imagine" multiple future outcomes. In V-JEPA 2 [5], an action-conditioned video model evaluates different action sequences to find the best next step. This "imagination-based planning" is also a core theme in CLASP [6], SWIM [7], VLP [8], GPC [9], DreamDojo [10], and Cosmos Policy [11]. The challenge remains fitting this heavy computation into real-time control budgets.

1.3 Policy Evaluation

Video models allow us to test policies before they ever touch physical hardware. Veo Robotics [12] demonstrates that these models can accurately predict relative performance and perform "red teaming" to expose safety violations. This approach is also seen in IRASim [13], 1XWM [14], Ctrl-World [15], and others.

Summary of Paradigm 1: While powerful, there is no "free lunch." These methods depend on prediction accuracy. Our physical world is complex, and teaching video models to handle every edge case without hallucinating physics remains a significant challenge.

---

Paradigm 2: Use the Video Model as a Policy

The second, more integrated paradigm is using the generative video model as the policy (decision-maker) itself. Because the native outputs are videos rather than robot actions, several methods have been developed to obtain control signals.

2.1 Generating Video and Action Jointly

A straightforward idea is to add an action decoder to the video model backbone and run video and action denoising jointly during inference. Representative works include DreamZero [16], Cosmos Policy [11], Motus [17], PAD [18], GR-1 [19], and GR-2 [20] (note that the GR series are not diffusion models). This method leverages the rich spatiotemporal priors of pre-trained models with minimal architecture changes.

2.2 Extracting Visual Representations for Action Generation

Rather than full generation, many methods use video models to extract deep visual representations to guide action generation. Example works include VPDD [21], VPP [22], UVA [23], UWM [24], Video Policy [25], and DiT4DiT [26]. A major advantage here is that you don’t necessarily need to run multiple denoising steps on giant models, making real-time control easier, though it remains unclear if the full potential of the video models is being utilized.

2.3 Open-loop Video Generation + Video-to-Action Translation

A rising trend involves generating a "desired future" video and using a separate inverse dynamics model to translate that video into actions. UniPi [27] pioneered this, followed by This&That [28], TesserAct [29], and 1XWM Self-Learning [30]. Some methods generate videos of humans completing tasks (Dreamitate [31], Gen2Act [32], LVP [33]) and translate those to robot actions. This approach allows video models to do exactly what they were trained for: video generation.

2.4 Closed-loop Video Generation + Video-to-Action Translation

Open-loop generation often leads to hallucinations: the model might "see" the robot picking up an apple that isn't actually there. Closed-loop generation avoids this by constantly conditioning on the latest real-world observations, replacing generated frames with real ones in the next call. Recently, mimic-video [34] and LingBot-VA [35] reached real-time speeds using KV caching and partial denoising. Most notably, the DVA [36] model released this month manages real-time generation with full video denoising, which means denoising pure noise all the way to clean video for every step. This approach seems really promising to me, because it reduces robot control into a problem of real-time video generation, which can directly benefit from large-scale video pre-training.

---

To me, the key takeaway from this evolution is how we have begun bridging the gap between the digital and physical worlds. Instead of trying to manually program every physical law, we are leveraging the implicit physics embedded in billions of web videos.

Whether we use these models as simulators or as direct policies, the objective is the same: providing robots with a “physical common sense.” By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.

[References in the comment]

578

641

39K

vionwinnie retweeted

Thariq

@trq212

3 months ago

https://t.co/45C3gKydTK

388

16K

44K

Winnie Yeung @vionwinnie

3 months ago

@KyleVedder @gpt_alex would love more pointers :)

vionwinnie retweeted

Kyle Vedder

@KyleVedder

3 months ago

im often asked “how do i break into robot learning without a phd?” buy an SO-101 ($300) and do something interesting understand the stack, train models, implement a paper, do serious on-robot evals, document your work the pool of talent with real experience is v small

KyleVedder's tweet photo. im often asked “how do i break into robot learning without a phd?”

buy an SO-101 ($300) and do something interesting

understand the stack, train models, implement a paper, do serious on-robot evals, document your work

the pool of talent with real experience is v small https://t.co/VTv1Uhh2yJ

136K

vionwinnie retweeted

Maxime Labonne

@maximelabonne

4 months ago

A big thank you to @itsmaddox_j for the invitation, and to professors @niclane7 and @contactrika for having me! https://t.co/Bwzr3mL0Ip

169

305

11K

vionwinnie retweeted

Boris Cherny

@bcherny

5 months ago

I'm Boris and I created Claude Code. I wanted to quickly share a few tips for using Claude Code, sourced directly from the Claude Code team. The way the team uses Claude is different than how I use it. Remember: there is no one right way to use Claude Code -- everyones' setup is different. You should experiment to see what works for you!

924

51K

103K

vionwinnie retweeted

Sayak Paul

@RisingSayak

4 months ago

Editing images is a series of state transitions between the source image and the edited image that we want. Yet, the existing paradigm doesn't explicitly include any transitioning priors in the editing process. This becomes particularly prevalent for edits, involving causal dynamics (e.g., refraction, deformation). To model this kind of physics-informed information, we leverage the rich priors present in videos and introduce PhysicEdit 🔥 TL;DR: We fine-tune QwenImage Edit on a curated dataset of videos with reasoning traces and fixed-length transition queries to do solid physics-aware image editing! In the process, we introduce a cool dataset "PhysicTran38K", consisting of 38K transition trajectories across five physical domains and devise a method to provide supervision from it QwenImage Edit. Hop in to learn more ⬇️

$RisingSayak's tweet photo. Editing images is a series of state transitions between the source image and the edited image that we want. Yet, the existing paradigm doesn't explicitly include any transitioning priors in the editing process. This becomes particularly prevalent for edits, involving causal dynamics (e.g., refraction, deformation). To model this kind of physics-informed information, we leverage the rich priors present in videos and introduce PhysicEdit 🔥 TL;DR: We fine-tune QwenImage Edit on a curated dataset of videos with reasoning traces and fixed-length transition queries to do solid physics-aware image editing! In the process, we introduce a cool dataset "PhysicTran38K", consisting of 38K transition trajectories across five physical domains and devise a method to provide supervision from it QwenImage Edit. Hop in to learn more ⬇️$

361

212

51K

vionwinnie retweeted

Geoffrey Litt

@geoffreylitt

9 months ago

If you're thinking about AI-generated UIs, recommend checking out JELLY by @YiningCao3, @peilingjiang, and @HaijunXia. My favorite kind of work: both a compelling system/demo AND a bigger idea that people can build on! talk video: https://t.co/Enk5MD1JSb paper: https://t.co/FhTkoev9nP tldr: vibe-coded UIs aren't ideal for users generating software, because it's hard to steer the generation and keep things consistent. They propose solving this by first generating a more structured model of the user's needs, including a data schema that the user can see/edit. Then UIs get generated based on this schema, but it feels more like fluidly composing premade widgets in a task-specific way than building a new "application". Reminds me of @alexobenauer's work on an itemized OS and @jasonyuan's Mercury concept, as well as the Embark system that I worked on. The demos feel compelling and magical, but there's also enough technical meat to see how this is actually feasible today with LLMs. Really cool. Things I'm not so sure about: - I like formality on demand: super unstructured representations (text, drawings) and only adding structure when needed. It seems like Jelly jumps straight to rigid relational models. Good fit for some tasks but not all. I wonder about fitting in less-structured bits and then structuring on-the-fly with LLMs. (As a mitigating factor: the fact that you can edit the schema live on the fly does help a lot, blurring the line between using and creating the software. And structure is really useful for things like different views of the same info) - I'm curious how much the exposed schema ends up really being useful to users for understanding. Their own user study found the majority of users just relied on the UI rather than the schema. Feels like there's a lot more work to do here to achieve deeper interpretability. The challenge of "how do you tell users what software does without showing code" is endlessly deep...

geoffreylitt's tweet photo. If you're thinking about AI-generated UIs, recommend checking out JELLY by @YiningCao3, @peilingjiang, and @HaijunXia. My favorite kind of work: both a compelling system/demo AND a bigger idea that people can build on!

talk video: https://t.co/Enk5MD1JSb
paper: https://t.co/FhTkoev9nP

tldr: vibe-coded UIs aren't ideal for users generating software, because it's hard to steer the generation and keep things consistent. They propose solving this by first generating a more structured model of the user's needs, including a data schema that the user can see/edit.

Then UIs get generated based on this schema, but it feels more like fluidly composing premade widgets in a task-specific way than building a new "application". Reminds me of @alexobenauer's work on an itemized OS and @jasonyuan's Mercury concept, as well as the Embark system that I worked on.

The demos feel compelling and magical, but there's also enough technical meat to see how this is actually feasible today with LLMs. Really cool.

Things I'm not so sure about:

- I like formality on demand: super unstructured representations (text, drawings) and only adding structure when needed. It seems like Jelly jumps straight to rigid relational models. Good fit for some tasks but not all. I wonder about fitting in less-structured bits and then structuring on-the-fly with LLMs. (As a mitigating factor: the fact that you can edit the schema live on the fly does help a lot, blurring the line between using and creating the software. And structure is really useful for things like different views of the same info)

- I'm curious how much the exposed schema ends up really being useful to users for understanding. Their own user study found the majority of users just relied on the UI rather than the schema. Feels like there's a lot more work to do here to achieve deeper interpretability. The challenge of "how do you tell users what software does without showing code" is endlessly deep...

307

353

22K

vionwinnie retweeted

elie

@eliebakouch

10 months ago

The technical report of @Meituan_LongCat LongCat-Flash is crazy good and full of novelty. The model is a 560B passive ~27B active MoE with adaptive number of active parameters depending on the context thanks to the Zero-Computational expert. 1) New architecture > Layers have 2 Attention blocks and both FFN and MoE, that way you can overlap the 2 all-to-all coms. (also it's only 28 layers but you have to take into account the 2 attention blocks). > They add the zero-computational expert that tokens can choose and do nothing, kinda like a "sink" for easy tokens. > For load balancing, they have a dsv3-like aux loss free to set the average real/fake expert per token. They apply a decay schedule to this bias update. They also do loss balance control. 2) Scaling > They made changes to MLA/MoE to have variance alignment at init. The gains are pretty impressive in Figure 5, but i don't know to what extent this has impact later on. > Model growth init is pretty cool, they first train a 2x smaller model and then "when it's trained enough" (a bit unclear here how many B tokens) they init the final model by just stacking the layers of the smaller model. > They used @_katieeverett @Locchiu and al. paper to have hyperparameter transfer with SP instead of muP for the 2x smaller model ig. 3) Stability > They track Gradient Norm Ratio and cosine similarity between experts to adjust the weight of the load balancing loss (they recommend Gradient Norm Ratio <0.1). > To avoid large activations, they apply a z-loss to the hidden state, with a pretty small coef (another alternative to qk-clip/norm). > They set Adam epsilon to 1e-16 and show that you want it to be lower than the gradient RMS range. 4) Others > They train on 20T tokens for phase 1, "multiple T of tokens" for mid training on STEM/code data (70% of the mixture), 100B for long context extension without yarn (80B for 32k, 20B for 128k). The long context documents represent 25% of the mixture (not sure if it's % of documents or tokens, which changes a lot here). > Pre-training data pipeline is context extraction, quality filtering, dedup. > Nice appendix where they show they compare top_k needed for different benchmarks (higher MMLU with 8.32, lower GSM8K with 7.46). They also compare token allocation in deep/shallow layers. > They release two new benchmarks Meeseeks (multi-turn IF) and VitaBench (real-world business scenario). > Lots of details in the infra/inference with info on speculative decoding acceptance, quantization, deployment, kernel optimization, coms overlapping, etc. > List of the different relevent paper in thread 🧵

eliebakouch's tweet photo. The technical report of @Meituan_LongCat LongCat-Flash is crazy good and full of novelty.
The model is a 560B passive ~27B active MoE with adaptive number of active parameters depending on the context thanks to the Zero-Computational expert.

1) New architecture
> Layers have 2 Attention blocks and both FFN and MoE, that way you can overlap the 2 all-to-all coms. (also it's only 28 layers but you have to take into account the 2 attention blocks).
> They add the zero-computational expert that tokens can choose and do nothing, kinda like a "sink" for easy tokens.
> For load balancing, they have a dsv3-like aux loss free to set the average real/fake expert per token. They apply a decay schedule to this bias update. They also do loss balance control.

2) Scaling
> They made changes to MLA/MoE to have variance alignment at init. The gains are pretty impressive in Figure 5, but i don't know to what extent this has impact later on.
> Model growth init is pretty cool, they first train a 2x smaller model and then "when it's trained enough" (a bit unclear here how many B tokens) they init the final model by just stacking the layers of the smaller model.
> They used @_katieeverett @Locchiu and al. paper to have hyperparameter transfer with SP instead of muP for the 2x smaller model ig.

3) Stability
> They track Gradient Norm Ratio and cosine similarity between experts to adjust the weight of the load balancing loss (they recommend Gradient Norm Ratio <0.1).
> To avoid large activations, they apply a z-loss to the hidden state, with a pretty small coef (another alternative to qk-clip/norm).
> They set Adam epsilon to 1e-16 and show that you want it to be lower than the gradient RMS range.

4) Others
> They train on 20T tokens for phase 1, "multiple T of tokens" for mid training on STEM/code data (70% of the mixture), 100B for long context extension without yarn (80B for 32k, 20B for 128k). The long context documents represent 25% of the mixture (not sure if it's % of documents or tokens, which changes a lot here).
> Pre-training data pipeline is context extraction, quality filtering, dedup.
> Nice appendix where they show they compare top_k needed for different benchmarks (higher MMLU with 8.32, lower GSM8K with 7.46). They also compare token allocation in deep/shallow layers.
> They release two new benchmarks Meeseeks (multi-turn IF) and VitaBench (real-world business scenario).
> Lots of details in the infra/inference with info on speculative decoding acceptance, quantization, deployment, kernel optimization, coms overlapping, etc.
> List of the different relevent paper in thread 🧵

850

148

741

250K

vionwinnie retweeted

Humphrey Shi

@humphrey_shi

11 months ago

New: Paper + Code — T2I-Copilot, a training-free multi-agent text-to-image system for agentic co-creation. ICCV 2025: https://t.co/gXCn6T8nVb Multi-agent coding systems (e.g., Claude Code) are sweeping the world like a storm this summer. The success rests on a simple but fundamental idea in “Biological Scaling Laws”: intelligence emerges not only from scaling a single mind/model, but even more from effectively orchestrating many specialized models/agents into a society/civilization that can perceive, reason, plan, act, and collaborate. The same multi-agent scaling principle applies to multimodal AI. We’re open-sourcing T2I-Copilot — a training-free, multi-agent generative AI system we’ve been building since last year. T2I-Copilot turns text-to-image into agentic co-creation and significantly boosts quality and controllability using open-source base models (e.g., FLUX.1-dev) with help from other multimodal agents — comparable to top industry APIs (Imagen 3, Recraft V3) at the time of submission in March 2025. In T2I-Copilot, three agentic specialists — Input Interpreter → Generation Engine → Quality Evaluator — bridge human intent and model behavior with pre-generation disambiguation and post-generation iterative improvement. On GenAI-Bench (VQAScore) we’re comparable to Imagen-3, and +6.17% over FLUX1.1-pro at ~16.6% of its cost. With human-in-the-loop, results improve further. Open-Source Code & Paper: https://t.co/gXCn6T8nVb The evolution of Agentic AI—and our human PhD students—never stops. We’ll keep iterating and add newer open-source models before @ICCVConference (Hawaii, Oct 2025). Big shout-out to @ChiehYun6, @flying_lynx, and Eric for spearheading this effort. 🐝🚀

vionwinnie retweeted

Howard Pinsky

@Pinsky

11 months ago

This is nuts! Harmonize just landed in the Photoshop beta and compositing will never be the same 🤯

134

227K

vionwinnie retweeted

Bilawal Sidhu

@bilawalsidhu

11 months ago

Photoshop’s new harmonize feature looks genuinely useful — effectively making complex compositing tasks just one click. Seems Adobe has productized Project Perfect Blend from their sneaks presentation.

539

263

48K

vionwinnie retweeted

Andrej Karpathy

@karpathy

about 1 year ago

A number of people asked If I can share the convo and yes sure - these were the 4 convos with my super noob swift questions lol: 1 starting the app https://t.co/TMyPAK2RhZ 2 enhancements https://t.co/vWnkwMrMe8 3 adding AppStorage to persist state over time https://t.co/NVxc7p1uVH 4 deploy to phone https://t.co/e4xo4cmcWR and this is what it looks like late last night https://t.co/7B8Qp4L0gN I'm already happily using it today for tracking, and will probably hack on it more on this fine sunday.

282

404K

vionwinnie retweeted

Andrej Karpathy

@karpathy

over 1 year ago

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question: "Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please." Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day... The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ "What's up with the upcoming Apple Launch? Any rumors?" - ✅ "Why is Palantir stock surging recently?" - ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?" - ✅ "What toothpaste does Bryan Johnson use?" - ❌ "Singles Inferno Season 4 cast where are they now?" - ❌ "What speech to text program has Simon Willison mentioned he's using?" ❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

karpathy's tweet photo. I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check.

Thinking
✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question:

"Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please."

Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not.

❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message.

❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro.

✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails.

I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day...

The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at.

DeepSearch
Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went:

- ✅ "What's up with the upcoming Apple Launch? Any rumors?"
- ✅ "Why is Palantir stock surging recently?"
- ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?"
- ✅ "What toothpaste does Bryan Johnson use?"
- ❌ "Singles Inferno Season 4 cast where are they now?"
- ❌ "What speech to text program has Simon Willison mentioned he's using?"

❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI).

The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...).

Random LLM "gotcha"s

I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on.

✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this.
✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it.
✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly).
❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse.
❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying.
❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training.

Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

664

17K

vionwinnie retweeted

Matthew Berman

@MatthewBerman

over 1 year ago

OpenAI just dropped a paper that reveals the blueprint for creating the best AI coder in the world. But here’s the kicker: this strategy isn’t just for coding—it’s the clearest path to AGI and beyond. Let’s break it down 🧵👇

MatthewBerman's tweet photo. OpenAI just dropped a paper that reveals the blueprint for creating the best AI coder in the world.

But here’s the kicker: this strategy isn’t just for coding—it’s the clearest path to AGI and beyond.

Let’s break it down 🧵👇 https://t.co/NJwbq4kxRs

150

929

11K

Winnie Yeung @vionwinnie

over 1 year ago

@iScienceLuvr Hoping to read the blog post once the website is back up :)

Winnie Yeung

@vionwinnie

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users