Eagle2.5 natively supports long-context without using any compression module.
Eagle2.5-8B has:
• got 6 out of 10 SOTA on long video benchmarks
• beat GPT-4o (0806) on 3/5 video tasks
• beat Gemini 1.5 Pro on 4/6 video tasks
• got SOTA result on Hour-long video benchmark.
So much great work lately from Nvidia, the "King of American Open-source AI"!
- Crossed 1,000 total public repositories on @huggingface (820 models, 249 datasets & 57 spaces) & almost 60,000 followers
- Current #1 trending model on HF with LocateAnything and #5 trending with PiD
- Announced that they're adopting the @linuxfoundation OpenMDW framework
- Released Cosmos 3, Omnimodal World Models for Physical AI & Alphamayo 2 Super, an open model for autonomous driving
- Announced the release soon of Nemotron 3 & work on Nemotron 4
Thank you @nvidia for all the work you're doing for the ecosystem and open-source AI. Can't wait for the next few months!
We presented Parallel Box Decoding which improves both decoding efficiency and localization accuracy for vision-language grounding. Please check out more examples and demo through the project page: https://t.co/okB0s8uSvm
This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗
Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act.
Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection.
Project page: https://t.co/O7JMe8tzFM
Meet Nemotron 3 Nano Omni 👋
Our latest addition to the Nemotron family is the highest efficiency, open multimodal model with leading accuracy.
30B parameters. 256K context length. 🧵👇
One of our roles in LLM/VLM research at NVIDIA is to explore effective data recipes for training large-scale models and share them to the public—an area where transparency has been limited, as seen with models like Gemini, GPT-4o, Qwen-VL models etc. The Eagle2 project aligns closely with this mission. In this work, we have openly detailed our findings in curating the datasets to develop a frontier VLM model, and we’re glad to see that the community is finding these contributions valuable.
I did not notice this until just now. Thank you @andimarafioti for the recommendation! Very glad that even though Eagle 2 is not our latest work, people still find it very useful.
🥇Our NVIDIA Llama Nemotron Nano VL model is #1 on the OCRBench V2 leaderboard.
Designed for advanced intelligent document processing and understanding, this model extracts diverse info from complex documents with precision, all on a single GPU.
📗 Get the technical details on the newest Nemotron model ➡️ https://t.co/C03WxEBoNG
📝 Try out the NVIDIA NIM ➡️ https://t.co/DOrJdYDbMG
Cool paper from @nvidia
Prior methods for training LLMs for tool use rely on imitation or distilled reasoning, limiting generalization.
Nemotron-Research-Tool-N1 uses rule-based reinforcement learning.
It trains models with binary rewards evaluating only tool call structure and correctness, enabling autonomous reasoning.
📌 Binary format and correct tool call reward teaches autonomous reasoning over imitation.
📌 Binary rule-based reward prevents reward hacking, boosting real-world generalization (80.38 percent Live BFCL).
📌 Using binary rewards on structure and tool call leverages SFT data without detailed reasoning steps.
----------
Methods Explored in this Paper 🔧:
→ The model uses a structured reasoning and action output format.
→ A binary reward checks adherence to this format and exact match of parsed tool calls to ground truth.
→ Training uses the Generalized Reinforcement Policy Optimization GRPO algorithm on processed datasets.
→ Nemotron-Research-Tool-N1-7B achieved 84.82 percent accuracy on BFCL and 81.28 percent on API-Bank, outperforming GPT-4o.
------------
Paper - arxiv .org/abs/2505.00024v1
Paper Title: "Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning"
Tool-using LLMs can learn to reason—without reasoning traces.
🔥 We present Nemotron-Research-Tool-N1, a family of tool-using reasoning LLMs trained entirely via rule-based reinforcement learning—no reasoning supervision, no distillation.
📄 Paper: https://t.co/HCeMBaIE7f
💻 Code: https://t.co/4ql0gn71qK
(Please consider giving us a ⭐️ to stay updated on the upcoming code release!)
🧠 Why this matters:
Existing tool-call models rely heavily on supervised reasoning traces from stronger models—costly, brittle, and often imitative. We ask:
Can LLMs learn to reason directly from tool success signals?
📦 What we did:
– Train Qwen2.5-7B/14B with simple binary reward on tool-call correctness + reasoning format in R1-style
– No reasoning traces needed
– Evaluate on BFCL, API-Bank, and ACEBench
– Also study the role of SFT, RL, and widely adopted SFT-then-RL recipes in training Tool-Calling models.
📈 Key findings:
– Tool-N1-7B/14B obviously outperform GPT-4o and open baselines on all benchmarks
– Widely adopted SFT+RL paradigm doesn’t necessarily lead to better performance than Pure RL.
– Binary reward > fine-grained reward, esp. for real-world queries
– Scaling works: bigger = better gains under our RL setup
🌟 Takeaway:
Reasoning doesn’t have to be taught. With just a binary signal, LLMs can learn to reason and act.
Tool-N1 sets a new direction for scalable, supervision-light tool calling model training
Eagle2.5 natively supports long-context without using any compression module.
Eagle2.5-8B has:
• got 6 out of 10 SOTA on long video benchmarks
• beat GPT-4o (0806) on 3/5 video tasks
• beat Gemini 1.5 Pro on 4/6 video tasks
• got SOTA result on Hour-long video benchmark.
Thank you AK!
Excited to introduce Eagle 2.5, NVIDIA’s latest vision-language model that brings strong long-context capabilities across both image and video understanding — all with just 8B parameters.
Most existing VLMs struggle with high-res inputs and long video contexts. Eagle 2.5 is designed to tackle both — supporting up to 512 video frames and trained jointly on image + video data.
We introduce a new benchmark-scale dataset, Eagle-Video-110K, with over 110K annotated samples, including QA, localization, and summarization. Videos range from a few minutes to 3 hours — pushing the limits of long-form visual reasoning.
Key techniques:
• Information-First Sampling: spatially aware, quality-preserving frame selection
• Mixed image-video training for generalization
• Progressive long-context recipes up to 128K tokens
• Optimized decoding and inference for efficient deployment
Strong results across the board:
• 6 out of 10 SOTA on long video benchmarks
• Outperforms GPT-4o (0806) on 3/5 video tasks
• Outperforms Gemini 1.5 Pro on 4/6 video tasks
• Matches or beats Qwen2.5-VL-72B on multiple key datasets
• Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.
Evaluated on:
• Video-MME
• MVBench
• Charades-STA
• 1-Hour Video QA
• EgoSchema
• MLVU, LVBench, and more…
These tasks stress-test long-form visual understanding with dense supervision and temporal reasoning.
Model, demo, and dataset will be released soon.
Explore the project here: https://t.co/084U086jR0
Code: https://t.co/jGIQU45YBT
Tech Report: https://t.co/w1hGgJMwAw
We're excited to contribute toward long-context, general-purpose VLMs — and would love to hear your feedback or ideas for collaboration.
Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight:
- Real humanoid teleoperation data.
- Large-scale simulation data: we are open-sourcing 300K+ trajectories!
- Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”!
- Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos.
GR00T N1 is a single end-to-end neural net, from photons to actions:
- Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions.
- Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2.
We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings.
While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right.
Let’s solve robotics, together, one token at a time.
Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵
Mr. @pmddomingos
This is a country whose leader blatantly says "We lied, we cheated, we stole… we had entire training courses." And thus there's conceited clown like you to spread China hate everywhere.
Your self-imagined star-spangled awesomeness doesn't change the fact that Chinese researchers have become a major force in the AI community. China is also leading in industry general autonomy, robotics and AI applications. Your word can't change this fact and the successes don't come with fraud.
If you think there’s a problem with this, there’s a problem with you.
Thank you AK! @_akhaliq
This is just a beginning of a long journey, as we focused more on the model design space with multi-encoders, and fair comparisons under controlled settings. More will come in future versions! 🧵[1/n]
Try our model & demo:
GitHub: https://t.co/ps3KCQGPZD
HuggingFace: https://t.co/VfMpC7cSLB
Report: https://t.co/KYbHnSxh0D
We have also worked on transformer-based diffusion (link below) and video diffusion (https://t.co/NDmTTjzS5E). However, we did them in two different projects. :)
Congrats to OpenAI for proving scaling-up still works for video synthesis.
My co-authors are presenting P-Flow at NeurIPS on Thursday at 5pm! We'd love to chat about generative models, audio synthesis and understanding! We are also hiring, including for internships, researchers with expertise in multimodal LLMs!
@tariqafridi16 NVIDIA used to have PhD Residency Program before. But it is not active now. We also have internship program. Sometimes we may also extend internship if the project is interesting but not done. For some long-term projects, we may extend the internship longer.
I am attending #NeurIPS2023 between Dec. 10th and Dec. 16th. We are recruiting researchers to work on multi-modal models and DL for graphics. Would to love to have a chat if interested. We are the team that invented DLSS and Megatron-LM at NVIDIA.