@_cmd8 No, we haven’t open sourced on GH yet, but you can download it over pypi. We just have a prebuilt wheel since we also use rust under the hood.
https://t.co/jzmTr1AKQf
@JustJerry121@jeremyparkphd We’re building mm-ctx for richer multi-modal context management for LLMs/VLMs. There’s a ton of open-source projects around context management, but they fail miserably when you’re looking to build multimodal agents.
If you're building visual agents and you're at #CVPR2026 this year, come find us.
Get a sneak peek at the next generation of @vlmrun's visual agent Orion. We'll be announcing it and demoing live for the first time.
If you're in Denver, stop by the meetup Thursday night. Details and signup in the original post from @jeremyparkphd.
@vlmrun is hosting a Visual Agents Meetup in Denver during CVPR!
We've been building visual coding agents and releasing open-source tools like mm-ctx (multimodal context for agents), and are looking to connect with others working in this space.
Stop by Thursday night to share what you're working on and see our new visual agent in action.
If you're attending CVPR, feel free to reach out! Happy to connect during the week.
Please share this with colleagues who will be at CVPR!
📅 Thursday, June 4th
🕕 6-8 PM MDT
📍 Downtown Denver
Hosted by: @vlmrun, @dineshredy, @jeremyparkphd
Signup link in the comments 👇
@vlmrun is hosting a Visual Agents Meetup in Denver during CVPR!
We've been building visual coding agents and releasing open-source tools like mm-ctx (multimodal context for agents), and are looking to connect with others working in this space.
Stop by Thursday night to share what you're working on and see our new visual agent in action.
If you're attending CVPR, feel free to reach out! Happy to connect during the week.
Please share this with colleagues who will be at CVPR!
📅 Thursday, June 4th
🕕 6-8 PM MDT
📍 Downtown Denver
Hosted by: @vlmrun, @dineshredy, @jeremyparkphd
Signup link in the comments 👇
Excited to share that mm-ctx is now live on @huggingface Spaces! Try it in the browser via an interactive terminal without installing anything: https://t.co/UgRLJDPHB8
mm-ctx – fast, multimodal context for agents.
LLM-based agents handle text fine, but as soon as a directory contains images, videos, or PDFs with visual content, they struggle to understand the full context.
mm-ctx is meant to feel familiar: the Unix tools we already love (find/cat/grep/wc), rebuilt for file types LLMs can't read natively and designed to work with agents via the CLI.
- mm grep "invoice #1234" ~/Downloads searches across PDFs and returns line-numbered matches
- mm cat <document>.pdf returns a metadata description of the file
- mm cat <photo>.jpg returns a caption of the photo
- mm cat <video>.mp4 returns a caption of the video
A few things we obsessed over:
⚡ Speed: Rust core for the hot paths
🏠 Local-first, BYO model: Uses any OpenAI-compatible endpoint: Ollama, vLLM/SGLang, LMStudio with any multimodal LLM (Gemma4, Qwen3.5, GLM-4.6V).
🔗 Composable: stdin + structured outputs
🤖 Drops into any agent via mm-cli-skills: Claude Code, Codex, Gemini CLI, OpenClaw.
We’d love to hear your feedback! Especially on the CLI and what file types and workflows you would like to see next.
I made a rock climbing tool using computer vision!
I prompted @vlmrun's visual agent Orion to segment all of the blue bouldering holds, and it did a good job! It is interesting that now we can prompt VLMs to segment all of the holds, rather than creating a new dataset from scratch to train a model.
With holds detection + pose estimation, I can show how each hold gets activated as a hand or foot uses it. Once we touch the final hold with both hands, the route is completed, and I show the overall path of my torso midpoint.
A tool like this could help climbers understand their movement better. I’m still very much a beginner at bouldering, so I could use all the help I can get 🤣
There are definitely things to improve, but overall I’m encouraged by this first demo 🙂
Let me know what you think in the comments!
Models used:
- @vlmrun's Orion for segmentation
- ViTPose+ Huge for pose estimation (via @huggingface 🤗)
- RT-DETR for person detection (via @huggingface 🤗)
Shoutout to Daniel Reiff and his bouldering + computer vision project for the inspiration!
Manually parsing handwritten intake forms can be slow and prone to error, while VLM Run's HIPAA-ready API allows you to extract the same details in seconds. In this tutorial by @jeremyparkphd, learn how to use VLM Run to extract structured JSON from handwritten healthcare documents at scale.
Through this walkthrough, you will learn how to:
- Upload documents in the Requests tab and run them against your saved skills
- Enable confidence scores and grounding to see exactly where each field came from in the original document
- Edit incorrect extractions and provide feedback to improve extraction over time
- Run the same workflow programmatically via the VLM Run API as shown in Google Colab
Announcing Orion Skills! 🚀
Rather than rewriting prompts every time you want to define a specific task, you can now package all of that knowledge into a reusable skill.
Why skills?
- Reusable: Create a skill once, reference it from any endpoint (image, document, video, audio, agent)
- Versionable: Pin a specific skill version for reproducible results, or use "latest" to always get the newest revision
- Composable: Pass multiple skills in a single request, or combine them with custom schemas
Unlike purely text-based skills, we have reimagined what skills mean for visual agents and how to codify visual workflows into skills.
Try skills in chat today!
And check out this skills creation tutorial by @jeremyparkphd 👇