Make a plan to vote today: https://t.co/d0ftLG42jt
Then get your friends, family members, neighbors and coworkers to make a plan to vote, too. Because if we do, we will elect @SpanbergerForVA as your next governor and put Virginia on the path to a brighter future.
We need be really careful of a few tech companies to use the government power to be the monopoly, then use the monopoly to grab more money and power. It is destroying the country and the people.
Adding a physical layer boundary(framework), and a Kalman filter to describe the state space transition will decrease the search space by several magnitudes and increase the robustness of VLM system. Sharing thoughts based on my control theory background.
So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it.
Most benchmarks have overwhelmingly focused on reasoning and digital domains, without fundamentally addressing perception, especially mid- and low-level vision. (Credit: Partly inspired by separate conversations with @xiangyue96 and @YutongBAI1002)
As humans, we clearly exhibit pre-linguistic roots in our intuitive physical and psychological understanding, e.g., basic principles like solidity, continuity, and gravity. After we built GroundHog (https://t.co/jfn5DoXkNU) in 2024, I took a moment to reflect on the core issues with VLMs. I can no longer convince myself that simply stacking CLIP and DINO with a few projection layers is the ultimate solution to "tokenize" vision. Vision–language models need a much stronger vision foundation, perhaps a fundamental restart from a vision-centric perspective.
That’s why I stepped away from VLM development for a year to explore alternatives. A paper @TairanHe99 shared in this thread (led by the brilliant @TongPetersb) was especially thought-provoking. But to truly start over, I began looking into 3D foundation models and video diffusion models, setting aside, for now, the possibility of joint vision–language diffusion models. This led me to take the risk of developing 4D-LRM (https://t.co/VmMUrffcyp), aiming to learn 4D priors at scale with absolutely no language prior.
This is only a first step. At some point, I plan to return to VLM engineering. But next time, I hope I have resources to start with a world model first and then unlock the language component on top of it.
The AI world is bifurcating and converging . Now Gemini will not generate videos containing “Donald Trump” and MiniMax will not generate videos containing “Xi Jinping”.
This is crazy. BALTIMORE (WBFF) — Speed cameras on the I-83 Jones Falls Expressway have issued more than $18.5 million in fines in the past three years, but about 80% of the revenue has gone to the camera vendor, Verra Mobility — not the city, according to the Baltimore City.
Speed cameras on the I-83 Jones Falls Expressway have issued more than $18.5 million in fines in the past three years, but about 80% of the revenue has gone to the camera vendor, Verra Mobility — not the city, according to the Baltimore City Department of Finance. https://t.co/a7eAq4Zuow
Today, we’re holding Minnesota State Rep. Hortman, Minnesota State Sen. Hoffman, and their loved ones in our thoughts.
We stand in solidarity with our colleagues in Minnesota—and remain committed to rejecting all forms of hate.