Julia Kim

@_juliakeem

Co-founder & CEO @OpenGraph_Labs, building multimodal data infrastructure | FFC W26 @pearvc | Zhejiang University

Korea

Joined October 2022

212 Following

1K Followers

86 Posts

Pinned Tweet

Julia Kim

@_juliakeem

3 months ago

Data can’t just be outsourced🤯 To iterate fast, robotics teams must own their data infrastructure Introducing SyncField: turnkey data infrastructure for in-the-wild data collection (Best for UMI-style & Embodied human) #Robotics #UMI #DataCollection

136

15K

_juliakeem retweeted

Youngsun Wi @WiYoungsun

13 days ago

TactAlign was accepted to RSS 2026! Huge thanks to the reviewers for their thoughtful feedback. See you on the other side of the world 🤓🙌

13K

_juliakeem retweeted

Jerry Han

@JerryHan_og

about 1 month ago

Physical AI needs human data, but human data capture is still way too hard. Not because pressing record is hard. Because the moment you add cameras + sensors, everything gets messy: Every device has its own clock. Streams can silently fail. Recording health has to be checked. Start / stop has to line up. Synchronization has to be solved after. SyncField Desktop turns it into one workflow. Auto-discover cameras + sensors. Connect streams with aliases. Drag, arrange, and monitor panels. Record everything in one click. Review synchronized playback. Get frame-aligned data on disk. No handclaps. No LED flashes. No sync scripts. No file wrangling. Just humans doing real tasks, captured cleanly. If you're working on human data for Physical AI, reach out: https://t.co/MR4Mljaut5

489

Julia Kim

@_juliakeem

about 1 month ago

@Kensuke_ee_JP @rayanboukhanifi was really nice meeting you! see you in sf soon!

Who to follow

#Privacy #WEB3 #ZOLDER #XOCIETY

_juliakeem retweeted

about 1 month ago

Small Shenzhen meetup hosted with @rayanboukhanifi is done glad that some people showed up even though i know very little people in Shenzhen!

Kensuke_ee_JP's tweet photo. Small Shenzhen meetup hosted with @rayanboukhanifi is done

glad that some people showed up even though i know very little people in Shenzhen! https://t.co/87dKKboXHX

Julia Kim

@_juliakeem

about 2 months ago

Contact-rich manipulation depends on contact information (tactile sensing and force magnitude) and its importance grows with dexterity

Chris Paxton

@chris_j_paxton

about 2 months ago

When it comes to manipulation tasks, tactile and force data are really important.

14K

_juliakeem retweeted

Jerry Han

@JerryHan_og

about 2 months ago

https://t.co/cmQBKGe1Cl

Julia Kim

@_juliakeem

2 months ago

world models aren't just bigger video models What we truly need: (1) multimodal environments (2) structure-based reasoning (geometry, physics, affordances, spatial & symbolic reasoning) (3) Physics-aware interactions (4) Continuous real-world data loops

Fan-Yun Sun ✈️CVPR

@sunfanyun

2 months ago

@chrmanning and I went on @latentspacepod to talk about world models. https://t.co/xAtMrXCNU3

56K

11K

Julia Kim

@_juliakeem

2 months ago

@junfanzhu98 @BostonDynamics @Stanford @AGIBOTofficial @intbotai @BytedanceTalk @Google @moonlake @Rivian @Meta @Samsung @UCBerkeley @cruise @encord_team @ManycoreTech @OpenGraph_Labs @neuralmotion @AMD @nvidia @aurorafeng_01 @FusionFundVC @BoostVC it was a really productive saturday

270

_juliakeem retweeted

Junfan Zhu 朱俊帆 ✈️ CVPR

@junfanzhu98

2 months ago · San Francisco

📖Robotics World Model Reading Club #01 Summary @BostonDynamics, @Stanford, @AGIBOTofficial, @intbotai, @BytedanceTalk, @Google, @moonlake, @Rivian, @Meta, @Samsung, @UCBerkeley, @Cruise, @encord_team, @ManycoreTech, @OpenGraph_Labs, @neuralmotion, @AMD, @nvidia, @oysterecosystem, @Zoom, @FusionFundVC, @BoostVC, @yzilabs... policy learning→WM VLA: observation→action WAM: latent world→future trajectory→controllable action →Shift=reactive mapping→controllable simulation @nvidia Gr00t (7B, high mem efficiency on Thor)≈DreamDojo-style WAM. Bottleneck is NOT scale, but missing unified interface across perception–geometry–physics–action. 🧠 Representation Pixel space is redundant & non-geometric. Trend→Explicit 3D backbone: point cloud/mesh object+sub-object representations geometry-aware tracking (contact, affordance) Point-flow pipeline: detect→sample keypoint→track→dynamic graph Core tradeoff=which points&density (motion saliency/affordance attn) 🌍 4D Reconstructi→Unified Latent @GoogleDeepMind D4RT encodes video→temporally consistent latent field: geometry+motion+visibility unified Outputs: point clouds, 3D tracks, full reconstruct (300× faster) ❗Gap: no shared latent across: vision/geometry/semantics/action/physics ⚙️ Physics Gap Sim2Real Gap=physics, not vision: discontinuous contact deformable objects (∞ DoF) non-differentiable friction Engineering fails: brittle collision meshes, unstable contact Solutions: learned physics proxy hybrid pipeline convex decomposition (geometry → collision proxy, ~5× speedup) 🎥 Video Pretrain≠Interaction Video=strong prior but no counterfactuals Missing: force, depth, tactile, proprioception →can't answer: what if act differently ⏱️ Control≠Inference Real world=high-freq loop action chunking latent action FastWAM (train with rollout, infer without) KV-cache (AutoGaze) 👉control selects feasible trajectory, not full future modeling Thor is good, but LLM scaling≠robotics scaling 📉 Data No “robotics internet”: sim/video/teleop/factory logs fragmented no unified labeling or metrics Reality: factories use fixed primitives generalization often unnecessary Bitter lesson: data flywheel>pipelines (but robotics lacks one) 🦾 Embodiment Gap manipulation→full-body intelligence loco-manipulation+gaze+coordination Need cross-embodiment align (space, action, kinematics) 🔁 Sim2Real Pipeline human data→semantics→geometry→collision proxy→sim→fine-tuning Unsolved: deformables, contact stability, long horizon 🧩 Paper VQVAE (discrete latent) VL-JEPA (predictive align) token pruning (efficiency) recursive models (depth reuse) multi-path exploration (GRPO) ⚡ Infra→SLM Real-time stack (LLM infra too slow) →WM must compress into SLMs Future=small, domain-specialized, grounded models 🧪Bottlenecks no unified representation no data flywheel inference–control mismatch physics fragmented embodiment Reality can't be scraped like internet. It must be sensed, interacted, simulated. 👉 Goal: jointly optimize representation+simulation+action under physics constraints 💡minimal sufficient representation? can video DiT become WAM? vertical SLM inevitable? robotics ImageNet moment?

junfanzhu98's tweet photo. 📖Robotics World Model Reading Club #01 Summary
@BostonDynamics, @Stanford, @AGIBOTofficial, @intbotai, @BytedanceTalk, @Google, @moonlake,
@Rivian, @Meta, @Samsung, @UCBerkeley, @Cruise, @encord_team, @ManycoreTech, @OpenGraph_Labs, @neuralmotion, @AMD, @nvidia, @oysterecosystem, @Zoom, @FusionFundVC, @BoostVC, @yzilabs...

policy learning→WM
VLA: observation→action
WAM: latent world→future trajectory→controllable action
→Shift=reactive mapping→controllable simulation
@nvidia Gr00t (7B, high mem efficiency on Thor)≈DreamDojo-style WAM. Bottleneck is NOT scale, but missing unified interface across perception–geometry–physics–action.

🧠 Representation
Pixel space is redundant & non-geometric.
Trend→Explicit 3D backbone:
point cloud/mesh
object+sub-object representations
geometry-aware tracking (contact, affordance)
Point-flow pipeline:
detect→sample keypoint→track→dynamic graph
Core tradeoff=which points&density (motion saliency/affordance attn)

🌍 4D Reconstructi→Unified Latent
@GoogleDeepMind D4RT encodes video→temporally consistent latent field:
geometry+motion+visibility unified
Outputs: point clouds, 3D tracks, full reconstruct (300× faster)
❗Gap: no shared latent across:
vision/geometry/semantics/action/physics

⚙️ Physics Gap Sim2Real
Gap=physics, not vision:
discontinuous contact
deformable objects (∞ DoF)
non-differentiable friction
Engineering fails: brittle collision meshes, unstable contact
Solutions:
learned physics proxy
hybrid pipeline
convex decomposition (geometry → collision proxy, ~5× speedup)

🎥 Video Pretrain≠Interaction
Video=strong prior but no counterfactuals
Missing: force, depth, tactile, proprioception
→can't answer: what if act differently
⏱️ Control≠Inference
Real world=high-freq loop
action chunking
latent action
FastWAM (train with rollout, infer without)
KV-cache (AutoGaze)
👉control selects feasible trajectory, not full future modeling
Thor is good, but LLM scaling≠robotics scaling

📉 Data
No “robotics internet”:
sim/video/teleop/factory logs fragmented
no unified labeling or metrics
Reality:
factories use fixed primitives
generalization often unnecessary
Bitter lesson: data flywheel>pipelines (but robotics lacks one)

🦾 Embodiment Gap
manipulation→full-body intelligence
loco-manipulation+gaze+coordination
Need cross-embodiment align (space, action, kinematics)
🔁 Sim2Real Pipeline
human data→semantics→geometry→collision proxy→sim→fine-tuning
Unsolved: deformables, contact stability, long horizon

🧩 Paper
VQVAE (discrete latent)
VL-JEPA (predictive align)
token pruning (efficiency)
recursive models (depth reuse)
multi-path exploration (GRPO)
⚡ Infra→SLM
Real-time stack (LLM infra too slow)
→WM must compress into SLMs
Future=small, domain-specialized, grounded models
🧪Bottlenecks
no unified representation
no data flywheel
inference–control mismatch
physics
fragmented embodiment

Reality can't be scraped like internet.
It must be sensed, interacted, simulated.
👉 Goal: jointly optimize representation+simulation+action under physics constraints
💡minimal sufficient representation?
can video DiT become WAM?
vertical SLM inevitable?
robotics ImageNet moment?

321

287

61K

Julia Kim

@_juliakeem

2 months ago

@s_wistreich Love this work! amazing @s_wistreich

221

Julia Kim

@_juliakeem

2 months ago

data is being collected in regions where robots won’t be deployed anytime soon due to low labor costs, while the environments where deployment is actually viable remain largely inaccessible and require smarter, more strategic approaches to unlock

Jacob Zietek

@JacobZietek

2 months ago

Robotics has spent decades optimizing for research. Deployment requires a completely different kind of person: operators, industrialists, and outsiders the field typically ignores. There's a wave of people who want to build in robotics. The field doesn't know what to do with them. New essay, Robotics Needs Fewer Roboticists* below 👇

405

198

76K

538

_juliakeem retweeted

OpenGraph Labs 🧤 @OpenGraph_Labs

3 months ago

Excited to share that @OpenGraph_Labs has been accepted into @NVIDIA’s Inception Program 🚀 Our mission is to build reliable infrastructure for multimodal data capture, powering the next generation of robotics & world models 🌎

OpenGraph_Labs's tweet photo. Excited to share that @OpenGraph_Labs has been accepted into @NVIDIA’s Inception Program 🚀

Our mission is to build reliable infrastructure for multimodal data capture, powering the next generation of robotics & world models 🌎 https://t.co/cXMBokeSey

_juliakeem retweeted

Jerry Han

@JerryHan_og

3 months ago

World models can predict the next frame. They can't predict the next touch. That's the gap visuo-tactile world models will close. Is the robot gripping hard enough? Is the surface rigid or soft? When exactly does contact begin and end? Vision doesn't know. Tactile does. We built @OpenGraph_Labs to capture what cameras miss. Egocentric RGB × 5-finger multi-taxel tactile gloves. Frame-synced. Calibrated. In-the-wild. No lab setups. No scripted pick-and-place. Just humans doing real tasks in real stores. Watch the exact moment contact happens. The pressure map lights up in sync. Every touch. Every frame. 👇

118

13K

Julia Kim

@_juliakeem

3 months ago

Yeah, that’s true. Their gloves and the data collected from them are compatible with their robots. I’m also betting on human data but only when it’s captured as high-quality multimodal data. What I’m working on is a multimodal data capture tool that helps collect high-quality, in-the-wild data while keeping different sensors time synced and handling issues like sensor drift automatically

Julia Kim

@_juliakeem

3 months ago

@bercankilic 🤝🤝🤝

498

Julia Kim

@_juliakeem

3 months ago

Robotics & world models require real-world multi-sensory data at scale. But collecting vision, tactile, and IMU data simultaneously is much harder than it sounds. Each sensor runs at different frequencies, latencies, and clock domains. Integrating them means dealing with hardware quirks, driver inconsistencies, and constant timestamp drift. This is fundamentally a synchronization problem. And it gets harder as more modalities are added and tasks become longer-horizon, because temporal misalignment compounds: the model loses the causal structure of what happened and when. We learned this the hard way building our own pipelines. That experience led us to build a unified platform for multimodal capture, one that handles time alignment, hardware abstraction, and data integrity from day one. @OpenGraph_Labs built 'SyncField - Multimodal Data Capture System " which: ▪️ Supports any hardware configuration (multiple cameras + tactile + IMU) ▪️ Automatic synchronization across all modalities ▪️ Output is fully time-aligned and ready to train on It already powers humanoid robotics teams, data collection companies, and university research labs. If your team is collecting multimodal robotics data, we'd love to talk. (now onboarding teams one by one)

295

199

17K

_juliakeem retweeted