Last week we introduced the Streaming API along with a live demo so you can see it in action.
By popular request, we've now written a blog post explaining how the demo works.
Full blog post: https://t.co/J6sdO3vnqQ
Demo: https://t.co/zx4kawshOY
When we launched Inter-1 a month ago, we said streaming inference was next.
Today it's live!
The Streaming API brings the full set of Inter-1 capabilities to live video, delivering events as the conversation unfolds.
Read the full write up: https://t.co/TLSXo3Aq3s
Gave a talk at @clawcon on how we use Noodle (our OpenClaw coworker) at @InterhumanAI
Side quest from the talk: We are giving away a MacMini
Build something with the Interhuman API, post it on X, tag @InterhumanAI.
Winner announced June 1.
👉 https://t.co/ks2esFF3Ir
⇨ @InterhumanAI is a social intelligence API that enables AI products to understand human behavior by analyzing signals like hesitation, engagement, and confusion across voice, facial expressions, body language, and text.
“Social signal” and non-verbal social intelligence can be abstract.
Real Talk Studio uses Interhuman AI to evaluate not just what people say, but how they show up empathy, composure, and real engagement under pressure.
In this video, Toby from Real Talk Studio is demoing how non-verbal feedback works in practice.
Try it yourself: https://t.co/dTW51HJlgo
Full case study: https://t.co/ct9FTFpjjz
Most AI still reduces people to a few “emotions.”
Real interaction is a stream of social signals: hesitation, confusion, engagement, skepticism, stress, expressed through words, voice, and body.
Inter‑1 doesn’t guess a single label. It reads 12 social signals grounded in concrete cues across video, audio & text (gaze shifts, pauses, prosody, wording, posture, etc).
We built a formal ontology of these signals + hundreds of cues, and trained Inter‑1 to reason over that structure.
From “emotion detection” to evidence‑grounded social signal understanding.
Full write‑up: https://t.co/jlaPNYJdqJ
When you show an AI model a video of someone speaking and ask what's going on, most outputs look like a summary of what was said.
But in many cases, they miss a very important part:
Someone pauses before answering.
Looks away mid-sentence.
Changes tone slightly on a key point.
Those signals shape how the message is received.
They often carry more meaning than the words themselves.
Inter-1 is built to capture that layer.
It processes video, audio, and text together, in temporal alignment, and detects social signals like hesitation, confusion, engagement, and uncertainty.
Inter-1 was trained on a dataset that combines in-the-wild video with targeted synthetic data to cover a wider range of signals, modalities, and interaction settings.
One challenge with this work is data. A lot of datasets in affective computing are organized around basic emotion labels, which wasn’t a good fit for the interaction signals we wanted to model. So we built our own to train inter-1.
For each social signal, it also returns a structured rationale showing which cues it used across modalities and how they contributed to the result. So instead of just getting an label, you can see how the model arrived at it.
Inter-1 is built on a different ontology: 12 social signals, derived from behavioral science research on how humans communicate intent, engagement, affect, and relational dynamics through verbal, paraverbal and nonverbal channels.
A lot of current systems work mainly from transcripts. That captures the words, but it misses a lot of the signal: pauses, timing, tone, gaze, posture, and how behavior changes over the course of an interaction.