I just came back from a trip with 250+ videos and photos in my camera roll. I wanted to post a video capturing the best moments, but manually scrubbing through gigabytes of raw footage to find the perfect clips and photos is a massive bottleneck, and time consuming also.
So, I'm building a solution: Watch It. If you are interested , Follow along to watch me build this. 🚀
Here is the first look:
The goal for Watch It is to give everyone a professional video editor in their pocket. This Beat Engine is the "brain" that makes it possible.
If you’re building in the GenAI/Video space, I’d love to hear how you’re handling AV-sync. Drop a comment! 🚀 #BuildInPublic#AI #ComputerVision
Most AI-generated videos feel "uncanny" or boring. Why?
Because the visual energy is decoupled from the audio intent. If the beat drops but the camera doesn't move, the human brain flags it as low-quality.
I just finished the "Beat Engine" for Watch It. Its designed in a way that it doesn't just listen for beats; it analyzes musical intent.
Lyrrics extractor is also ready. Lyrrics analysis is important to encapsulate emotional feel of assets and place right assets at right time. For example, a nice photo of sunset will enhance the emotional impact many times, if lyrric line says something 'sun goes down' etc. Spolier: lyrrics extractor is just a LLM call😊
I just came back from a trip with 250+ videos and photos in my camera roll. I wanted to post a video capturing the best moments, but manually scrubbing through gigabytes of raw footage to find the perfect clips and photos is a massive bottleneck, and time consuming also.
So, I'm building a solution: Watch It. If you are interested , Follow along to watch me build this. 🚀
Here is the first look:
The engine now outputs a Normalized Intensity Score.
This score acts as a bridge between the raw audio signal and the Video Model. Instead of just "Cut here," the engine sends instructions like:
High Intensity: Trigger a high-magnitude Zoom Punch + Flash.
Medium Intensity: Execute a hard cut to a new perspective.
Low Intensity: Maintain the shot but adjust the camera "drift."
The result? Transitions will be much better.
The Fourier Transform is hard to wrap your head around.
Most explanations are buried in complex math and dense equations. But I just found a blog post that explains FFT with pure, vector based and visual intuition.
If you work with data, signals, or ML, bookmark this, this is worth a read.
resource is in the comment.
This is a heavy multimodal challenge, and I'm building the entire pipeline in public.
I’ll be posting my dev logs, architecture breakdowns, interesting insights and actual friction of making audio and vision models play nice.
Follow along to watch me build this. 🚀
I just came back from a trip with 250+ videos and photos in my camera roll. I wanted to post a video capturing the best moments, but manually scrubbing through gigabytes of raw footage to find the perfect clips and photos is a massive bottleneck, and time consuming also.
So, I'm building a solution: Watch It. If you are interested , Follow along to watch me build this. 🚀
Here is the first look:
The hardest (and most fun) part I'm tackling right now is the "Matchmaker" agent.
I'm building it to parse lyrics, pair high-impact visuals with the right audio cues, and inject smart transitions and motion effects so the final output actually feels like it was paced by a human editor.
Hot take:
Image models won’t win on aesthetics alone anymore.
They’ll win on:
• fidelity (don’t touch what matters)
• controllability (edit specific regions)
• consistency (multi-image coherence)
On those axes, ChatGPT Images 2.0 looks very strong.
And that’s what production teams actually care about.
I’ve been using Nano Banana 2 for product creatives.
Biggest issue I kept hitting:
It re-renders text on the product itself.
If your product is text-heavy (labels, packaging),
the model subtly changes it, breaks brand accuracy.
Just tried ChatGPT Images 2.0.
And this is the first thing that stood out:
- It preserves product text far more reliably.
No unwanted “creative reinterpretation” of labels.
This is a bigger deal than it sounds.
Because for real-world pipelines:
• Packaging text = compliance
• Branding = non-negotiable
• Even small changes = unusable asset
Most models still fail here.
Second big unlock:
Localized editing actually works.
You can:
→ Change background
→ Adjust composition
→ Keep product untouched
Earlier models struggled with this balance.
What this means in practice:
You can now:
• Keep product fidelity
• Iterate on creatives around it
• Avoid full regeneration loops
That’s a major workflow improvement.
So is it “better than Nano Banana 2”?
Too early to say definitively. Gonna need more testing.
But for:
→ Product-heavy creatives
→ Text-sensitive assets
→ E-commerce pipelines
ChatGPT Images 2.0 already feels like a step ahead.
The real shift:
We’re moving from
“generate everything again”
to:
“edit precisely without breaking what matters”
That’s where these models start becoming production-ready.
As someone building in the AI space, the GPT image to mask feature is a total game-changer. I've been holding out for reliable mask support since Nano Banana 1. Precise control over generation is finally here! 🚀🪄 #MachineLearning#genai#agenticAI