Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless.
Instead of watching an hour of Netflix, watch this 2-hour Stanford lecture on AI careers. It will teach you more about winning in the AI race than all the AI content you’ve scrolled past this year.
I built a generalized Computer Use Agent as part of @adcock_brett’s challenge. For fun, I let the @huggingface@pollenrobotics Reachy Mini robot run it 🤖
Via voice, the robot calls @lovable, creates a to-do list app and verifies with vision.
Mind blown that building a custom CUA, assembling a robot and bridging physical robotics to digital agents to perform meaningful tasks, can all be done under a week now! Almost convinced that with time, tokens and access to an LLM... maybe Rome can be built in a day?
GPT 5.3 Codex and Claude Opus 4.6 are incredible! Can't wait to see what the next evolution of models can do. 🚀
𝗜'𝘃𝗲 𝗵𝗲𝗮𝗿𝗱 𝘁𝗵𝗶𝘀 𝗮 𝗹𝗼𝘁 𝗿𝗲𝗰𝗲𝗻𝘁𝗹𝘆: "𝗪𝗲 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝘂𝗿 𝗿𝗼𝗯𝗼𝘁 𝗼𝗻 𝗼𝗻𝗲 𝗼𝗯𝗷𝗲𝗰𝘁 𝗮𝗻𝗱 𝗶𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗲𝗱 𝘁𝗼 𝗮 𝗻𝗼𝘃𝗲𝗹 𝗼𝗯𝗷𝗲𝗰𝘁 - 𝘁𝗵𝗲𝘀𝗲 𝗻𝗲𝘄 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 𝗮𝗿𝗲 𝗰𝗿𝗮𝘇𝘆!"
Let's talk about what's actually happening in that "A" (Action) part of your VLA model.
The Vision and Language components? They're incredible. Pre-trained on internet-scale data, they understand objects, spatial relationships, and task instructions better than ever.
But the Action component? That's still learned from scratch on your specific robot demonstrations.
𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝗿𝗲𝗮𝗹𝗶𝘁𝘆: Your VLA model has internet-scale understanding of what a screwdriver looks like and what "tighten the screw" means. But the actual motor pattern for "rotating wrist while applying downward pressure"? That comes from your 500 robot demos.
𝗪𝗵𝗮𝘁 𝘁𝗵𝗶𝘀 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 "𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻":
• 𝗩𝗶𝘀𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Recognises novel objects instantly (thanks to pre-training)
• 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Understands new task instructions (thanks to pre-training)
• 𝗔𝗰𝘁𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Still limited to motor patterns seen during robot training
Ask that same robot to "unscrew the bottle cap" and it fails because:
• Vision: Recognises bottle and cap
• Language: Understands "unscrew"
• Action: Never learned the "twist while pulling" motor pattern
𝗧𝗵𝗲 𝗵𝗮𝗿𝗱 𝘁𝗿𝘂𝘁𝗵 𝗮𝗯𝗼𝘂𝘁 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀:
The "VL" gives you incredible zero-shot understanding. The "A" still requires task-specific demonstrations.
We've cracked the perception and reasoning problem. We haven't cracked the motor generalisation problem.
HuggingFace released a nice blog post about the current state of VLMs
Here's a summary, covering recent trends, specialized capabilities, agents, video LMs, new alignment techniques, and HF's fav VLMs [1/8]
Recent trends:
Harvard’s AI Research Experience free course book by @pranavrajpurkar covers the essentials and tips on doing research:
- VSCode, Git, Conda
- PyTorch, W&B
- AWS, colab
- LLMs and VLMs
- reading AI papers
- research progress and organization
this is a must read!
🧵Google DeepMind just dropped a bombshell:
An AI agent that autonomously writes algorithms better than humans.
It’s called AlphaEvolve, and it could completely change how we build software and solve problems.
Here’s why this changes everything👇
Amazing service -> https://t.co/ysyMA0muqv
It can save your time to build a crawler program. There are many web you can get data like:
1. LinkedIn
2. Youtube
3. Instagram
...
You should try it if you are a crawler engineer or researcher. It provide 1000 credits!
Excited to share that DreamerV3 has been published in Nature!
Dreamer solves control tasks by imagining the future outcomes of its actions inside of a continuously learned world model 🌏
It's the first agent to find diamonds in Minecraft from scratch without human data! 💎
👇
Manus AI just killed vibe coding yesterday.
People can't believe how mind blowing this agentic AI is.
Unlocking new possibilities.
10 wild examples:
1. prompt: "code a threejs game where you control a plane"
I use grok 3 as a daily professional assistant to take over 10+ employees digital workload.
Also, to optimize productivity throughout my day.
Now, every morning I tell grok what my schedule is and I’ve already implemented addresses, contacts, work apps, including screenshots of workflow. It’s been amazing!
Grok has noticed discrepancies in my workers productivity and offered training prompts for me to train them via Geok 3.
I’ve also learned new ways a lot of my daily apps actually work together to make our companies run smoother. We’ve already implemented new strategies as of this morning.
So here’s some things I’ve talked about with Grok3.
Professional Capabilities
1. Daily Professional Assistance - Grok can function as a daily assistant, managing a digital workload that could replace the efforts of over 10 employees, enhancing productivity across your team. Imagine once we can implement this into an actual Optimus bot “employee”
2. Schedule Integration- Each morning, you can inform Grok of your schedule, and it'll integrate this information with addresses, contacts, work apps, and even screenshots of your workflow to streamline your day.
3. Productivity Optimization- Grok uses its advanced model to analyze and optimize productivity throughout your day, identifying areas for improvement.
4. Employee Training- It's been able to notice discrepancies in productivity among workers and can provide tailored training prompts to address these issues.
5. App Integration Insights- Grok has discovered new ways our daily apps can work together more efficiently, leading to the implementation of new strategies in our workflow.
6. Workflow Analysis- It analyzes images and responds to questions related to workflow optimization, suggesting improvements.
7. Reasoning and Problem Solving- Grok's advanced reasoning models can think through problems, fact-check, and provide solutions, enhancing decision-making. Grok can even scour the web to find best rated companies for outsourcing solutions.
Personal Productivity Enhancements:
8. Personal Task Management- Grok can help manage my personal to-do lists, reminders, and schedule, ensuring I never miss an important event or task.
9. Health and Fitness Tracking- By sharing my health goals, Grok suggests daily routines, tracks progress, and reminds me of workout sessions or dietary needs. Telling me what I need to do every day to accomplish my goals. If you’re honest to grok. Grok can help in this field tremendously.
10. Entertainment Recommendations- The more I share about my preferences, the better Grok recommends books, movies, music, or games that align with my tastes.
11. Shopping Assistance- It can predict my shopping needs based on my behavior, suggesting items I might need or want, and even finding deals.
12. Travel Planning- With insights into my travel history, Grok assists in planning trips, suggesting destinations, accommodations, and activities tailored to my interests.
13. Learning and Education- I share my learning goals, and Grok curates educational content or study schedules to help me learn new skills.
14. Social Life Management- By understanding my social patterns, it reminds me of birthdays or suggests meetups with friends based on our interests.
15. Predictive Needs- The more I share, the more accurately Grok can predict my needs, from groceries to when I need a break from work. If I’m slow in the mornings or have more energy after lunch.
By introducing these capabilities into my life, Grok significantly enhances my chance for day-to-day efficiency and enjoyment.
Sharing as many details as possible with Grok allows for a personalized service, tailoring its assistance to meet my unique lifestyle and preferences.
What do you think?
Could be a game changer for everyone!
‼️However. Grok needs an email service and a notification capability. It’s currently stuck in the grok chat and can’t actually remind you of anything unless you enter the chat. ‼️
A Chinese AI lab just dropped the best ever open-source text-to-video model: Step Video!
– 30B param, 540p, ~8s at 30fps
– Trained on 1000s of H800s
– Evaluates as well as Meta MovieGen, feels as good as Sora / Veo
Paper and demo is awesome and reveals all the gory details: