Demystifying Vision Language Action (VLA) models! 🤖🚗
We believe the best way to grow is to learn from each other. In a recent internal session, Naren and Navvrat took the team through the mechanics of VLA architecture.
Watch as Naren explains how these advanced models take visual and text inputs, process them through an LLM "brain," and output precise trajectory waypoints for autonomous navigation.
A huge shoutout to Rithika and Denina for organizing these "Thirsty Thursdays" all about AI innovation. ! 🙌
#FastCodeAI #MachineLearning #ArtificialIntelligence #TechInnovation #VLAModels #TeamCulture
“Compute the ground truth.”
Fifty-odd minutes into the class, my voice is the only one I can hear.
A tiled wall of black rectangles and muted mics stares back at me. Everyone seems tired. Or bored. Or both.
We’ve gone from handwritten digits to linear classifiers to softmax and cross-entropy. I’ve drawn the loss curve three times. I’ve said “scalar” and “vector” so often that even I’m slightly sick of the words.
I pause, take a breath, and ask the question again.
“So. What is the loss a function of?”
Silence.
Zoom silence is different from classroom silence. In a room, you at least get shuffling, eye contact, guilt. On Zoom, you get a frozen grid and the suspicion that half your class has gone to make Maggi.
“Okay,” I say, smiling into the void. “I’m going to start picking names, or this becomes a monologue.”
I scroll through the participant list.
“Rohit?”
There’s a tiny delay, then a click. His mic unmutes.
“Yes, sir.”
“Can you tell me, in your own words, what the loss takes as input?”
There’s a pause. I can almost hear the gears turning on the other side of the screen.
“It’s… the output,” he says slowly. “And your… ground truth that we have computed. Both can be vectors. And the loss is a scalar.”
I should be happy. This is, structurally, almost perfect. Output, ground truth, vectors, scalar. He’s paying attention.
But my mind has already stopped at four words.
Ground truth that you compute.
Something in me tightens. I could let it go.
But this is a first class. They’re going to build the rest of their understanding on the phrases they pick up today.
I choose.
“You don’t compute the ground truth,” I hear myself say, more sharply than I intended. “Vocabulary is important. You label it. Ground truth is something the world gives you, or a human annotates.” I soften my tone.
“So the loss is a function of output and target. The target is our ground truth label. We don’t compute it from the model. We bring it from outside.”
“Yes, sir,” he says.
Because in the end, that is how we compute anything together: not just with code and numbers, but with the vocabulary we share.
How we communicate.
How we connect.