Congratulations @twelve_labs, @seo_minjoon and @ai_den_lee for the achievement! Folks, check out the tech report on Pegasus-1 17B. We ask some really interesting questions to the model and the results are not too shabby ๐โค๏ธ
More to come!
๐ We're excited to share the technical report of Pegasus-1, our 17B-parameter VLM, setting new benchmarks in video understanding.
It surpasses larger models like Gemini Pro and Ultra in video conversation, QA, summarization, and temporal understanding.
https://t.co/HHOovLyin1
๐ฃ๏ธ โWhenever I think about meeting founders, one of my questions I ask is, is this a problem they're obsessed with?โ says NEA Partner @lucktm. โAnd when I met Twelve Labs, 100%, that came through.โ
@twelve_labs co-founder @_jae_lee joins Tiffany as part of our Founder Forward interview series, where NEA speaks with the leaders of the startups weโve partnered with about the technology and market trends driving their businesses.
Check out the full conversation below. ๐๐ฝ
https://t.co/GkipysE8N6
#ai #llm #data
Have you ever done a documentary?
โฆthey asked
nope! but thereโs a first time for everything ๐
playing hypewoman to @_jae_lee@twelve_labs for Korean TV, stay tuned
We've been working with @twelve_labs for more than a year now. These guys took a different strategy than most other AI startups and focused on gen AI and video. That bet is paying off now! Think "RAG but for video"
Ready to unlock powerful insights from your video content? Join @twelve_labs and Backblaze TODAY at 10 a.m. PT! Learn how to seamlessly integrate AI for deep video analysis with Backblaze B2. Expect practical tips and real-world examples!
Register now: https://t.co/yfiHorMCFJ
Pegasus-v1
This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting
Join us with @twelve_labs on April 24 to learn how to extract actionable insights from your video content stored in Backblaze B2 using AI.
Register free: https://t.co/teiOoT7fwE
Let me clear a *huge* misunderstanding here.
The generation of mostly realistic-looking videos from prompts *does not* indicate that a system understands the physical world.
Generation is very different from causal prediction from a world model.
The space of plausible videos is very large, and a video generation system merely needs to produce *one* sample to succeed.
The space of plausible continuations of a real video is *much* smaller, and generating a representative chunk of those is a much harder task, particularly when conditioned on an action.
Furthermore, generating those continuations would be not only expensive but totally pointless.
It's much more desirable to generate *abstract representations* of those continuations that eliminate details in the scene that are irrelevant to any action we might want to take.
That is the whole point behind the JEPA (Joint Embedding Predictive Architecture), which is *not generative* and makes predictions in representation space.
Our work on VICReg, I-JEPA, V-JEPA, and the works of others show that Joint Embedding architectures produce much better representations of visual inputs than generative architectures that reconstruct pixels (such as Variational AE, Masked AE, Denoising AE, etc).
When using the learned representations as inputs to a supervised head trained on downstream tasks (without fine tuning the backbone), Joint Embedding beats generative.
See the results table from the V-JEPA blog post or paper:
https://t.co/mfLvtvk8jj
Our co-founder @soyoungacorn gave a talk at the #SFAIMeetup event we sponsored earlier this month.
Give it a watch if you are curious to learn about our journey from the incubation at ๐ฐ๐ทcyber command!
Our latest article from @le_james94 reviews how far video understanding research has come, what potential remains untapped, and where it is headed in the future.