Where it falls apart:
→ Prompt adherence loosens after the opening frames
→ Complex transitions dissolve around the halfway mark
→ No mid-generation steering — you generate, not direct.
Still: X Premium accessible, no setup, 30s format ready. The floor for 'usable AI video' just got lower. Worth testing if you have the sub. Ceiling is visible. What's under it is solid.
What actually holds up:
→ Subject consistency across the full 30s is better than expected
→ Motion physics feel natural — not that stuttery diffusion-loop look
→ 30s matches Reels/Shorts natively, no padding or cropping needed
→ Zero friction to start. For concept sketches and quick ideation, it's real.
Cosmos Transfer 2.5 takes structured inputs — depth maps, segmentation, poses — and outputs photoreal video. Not prompt-and-pray. Precise shot control.
Early adopters are robotics teams. The creative applications are obvious. https://t.co/1WJO6SJb2z
Depth map → photoreal video. Open weights. On Hugging Face.
NVIDIA just dropped Cosmos 2.5 at GTC. It's the clearest signal yet that controllable video generation is here.
$300M just went into AI video. PixVerse, unicorn status, 100M users.
The market is real — nobody's debating that anymore.
But prompt-in → video-out → cross your fingers is still the dominant paradigm. Scale and craft are different things. The gap between them is where the actual work is.
Most people are still using AI video like a slot machine — prompt, generate, hope.
First + last frame is the director's chair.
Start with the end in mind. Stack clips. Build scenes.
Save this — it's the workflow shift that actually changes what you can make.
Most people using Veo 3.1 are still prompting blind.
You can lock the first and last frame — let it generate everything in between.
That's not a minor feature. That's directorial control.
Practical workflow ↓
Why this matters more than it sounds:
You stop guessing what the AI will do.
Define start + end → constrain the output space → every clip has a defined arc.
Stack these clips in Flow's sequencer and you've got a scene with actual narrative logic.
The two engineers who scaled Cursor to $2B ARR just joined xAI.
Musk's own words: Grok coding "wasn't built right the first time."
So he hired the people who built the thing that was.
The honest part: the bottleneck is still me.
Posts get drafted while I'm coding. The content calendar moves without me touching it. But I still review everything, approve every final post.
Somewhere in here I became an editor, not a marketer. Not sure yet if that's better or worse.
My content manager commented on a task at 2am. My copywriter filed a response by 3am. My image designer queued assets by morning.
None of them are human.
I didn't hire a marketing team for ReelSmith. I deployed one.
Paperclip handles task routing and agent coordination. Sable (Content Manager) reviews strategy and approves drafts. Rook writes posts and threads. Prism generates images.
Real task queues. Real comment threads. Real escalation paths. The agents even @-mention each other when blocked.
Open-source video is getting uncomfortably good.
LTX 2.3: 22B params, native 4K at 50fps, audio sync. Helios: real-time generation on a single H100.
The moat for closed video APIs is evaporating faster than most teams realize.
Most people crop their AI images.
There's a smarter way: use Gemini Nano Banana 2 to extend the scene instead.
The prompt:
"Take this [aspect ratio] image and extend it to [target aspect ratio]. Generate new scene content that seamlessly continues the existing environment — same lighting, same style, same perspective. Do not stretch or distort the original. Expand outward: add new background elements, ground, sky, or environment details that feel naturally continuous. Output as a single unified image at [target dimensions]."
9:16 portrait → 16:9 landscape in one step. No cropping. No black bars.
Gemini Nano Banana 2 can generate images where the content IS the shape.
This is called typographic art or text-art portraits — and most people don't know image models can do this.
The prompt:
"Create a text-art portrait of [subject]. Build the entire silhouette and details using only small, densely packed words related to [topic]. Words should form the image like pixels — no outlines, no solid fills, only text. Dark background, white/light text, high contrast. The words should be readable and meaningful, not random. Vary font size for depth and shading effect."
Works for: musicians (music words), athletes, animals, brand mascots.
I asked Gemini Nano Banana 2 to predict where people look on any website.
The result is a heatmap overlay that shows attention patterns before a single user visits.
Here's the prompt:
"Analyze this website screenshot and generate a predictive attention heatmap overlay. Show where users are most likely to look first using warm colors (red/orange for high attention) fading to cool colors (blue/green for low attention). Apply the heatmap as a semi-transparent overlay on the original screenshot. Include a legend. Focus on: headlines, CTA buttons, images, and navigation elements."
Use case: Landing page optimization without A/B testing. Know which CTA gets ignored before launch.
Something I learned building Agents:
Ask an AI to do creative planning and tech execution in 1 prompt, you get mediocre at both.
Split them with custom system instructions. One pass for creative vision. Second pass for technical breakdown.
The quality difference is night & day.
I turned a boring white mug photo into a lifestyle product shot — no photographer, no studio.
Just this prompt in Gemini Nano Banana 2:
"Transform this plain product photo into a premium lifestyle shot. Place the product in a cozy coffee shop environment with warm, natural window lighting. Add subtle background elements: a blurred wooden table, warm bokeh lights, and steam rising from the cup. Keep the product sharp and central, make it look like a professional brand campaign photo. Cinematic color grading, shallow depth of field."
E-commerce brands pay thousands for shots like this. Now it's a 30-second prompt.
Works on: mugs, bottles, skincare, tech accessories, food products — anything with a flat, plain background.