// MiMo-V2.5-Pro-Base: 1T Parameter MoE Language Model //
The base variant of Xiaomi's flagship MoE language model with 1.02T total parameters and 42B active parameters, featuring hybrid attention architecture and up to 256K context length.
Key highlights:
- Massive scale: 1.02T total parameters, 42B active per token
- Hybrid attention: Interleaves Sliding Window Attention and Global Attention with 6:1 ratio and 128 sliding window
- Multi-Token Prediction: 3 lightweight MTP modules using dense FFNs for 3x output speed during inference
- Long context: Supports up to 256K tokens (Pro variant extends to 1M)
- Efficient training: FP8 mixed precision, native 32K sequence length, trained on 27T tokens
- Strong benchmarks: 88.4 BBH, 89.4 MMLU, 99.6 GSM8K, 86.2 MATH, 75.6 HumanEval+
- MIT licensed: Open source with permissive licensing for commercial use
By combining sparse MoE architecture with hybrid attention and multi-token prediction, it delivers frontier-level reasoning and coding performance at high token efficiency—making it suitable for demanding agentic and long-horizon tasks.
🤗 Model
https://t.co/1PoXYgCwBf
// Qwen3.6-27B-DFlash: Block Diffusion Drafter for Qwen3.6 //
A speculative decoding draft model for Qwen3.6-27B that uses a lightweight block diffusion model for parallel drafting, enabling faster inference without quality loss.
Key highlights:
- Block diffusion drafting: Uses bidirectional attention with mask tokens instead of autoregressive draft models
- Target model pairing: Designed specifically for Qwen/Qwen3.6-27B (must be used together)
- Multi-backend support: Compatible with vLLM nightly builds and SGLang (PR branch)
- Easy deployment: Single speculative config flag for vLLM; standard SGLang speculative algorithm flag
- Still training: Model is under active training; inference engine support may evolve with architectural changes
- Causal SWA layers: Includes architectural changes that may affect compatibility with some inference engines
By replacing traditional autoregressive draft models with a block diffusion approach, it enables higher acceptance rates and greater speedups than existing speculative decoding methods when paired with the Qwen3.6-27B target model.
📄 Paper
https://t.co/E55JZXSzMW
🤗 Model
https://t.co/OeGXBnP88n
🔗 Repo
https://t.co/NBxDN9oGY9
// Lark CLI: Official Lark/Feishu CLI Tool //
The official command-line interface for the Lark/Feishu open platform, designed for both human users and AI agents with 200+ commands and 22 structured AI Agent Skills.
Key highlights:
- Agent-native design: 22 structured Skills compatible with Claude Code, Codex, and other AI tools
- Wide coverage: 14 business domains including Messenger, Docs, Sheets, Calendar, Mail, Tasks, Meetings
- Three-layer architecture: Shortcuts (+) → API Commands → Raw API calls (2500+ endpoints)
- AI-friendly output: Concise parameters, smart defaults, structured formats (JSON, table, CSV, ndjson)
- Secure by default: Input injection protection, terminal output sanitization, OS-native keychain storage
- Identity switching: Execute commands as user or bot with `--as` flag
- Dry-run support: Preview requests before execution for safe automation
- Schema introspection: Inspect any API method's parameters and response structure
- MIT licensed: Open source with bilingual documentation (English & Chinese)
By providing structured AI Agent Skills alongside traditional CLI commands, it enables AI coding agents to operate Lark/Feishu workspaces with zero extra setup—automating everything from calendar scheduling to document creation to meeting summaries.
🔗 Repo
https://t.co/d5hpnKjrgN
// OmniVoice: High-Quality Voice Cloning TTS for 600+ Languages //
A massively multilingual zero-shot text-to-speech model built on a diffusion language model-style architecture, supporting voice cloning, voice design, and fine-grained control.
Key highlights:
- 600+ languages supported: Broadest language coverage among zero-shot TTS models
- Voice cloning: State-of-the-art quality from short 3-10 second reference audio clips
- Voice design: Control voices via speaker attributes (gender, age, pitch, dialect, accent, whisper)
- Fine-grained control: Non-verbal symbols ([laughter], [sigh]) and pronunciation correction via pinyin/phonemes
- Fast inference: RTF as low as 0.025 (40x faster than real-time)
- Local inference: Runs on NVIDIA GPU or Apple Silicon (MPS) without cloud dependencies
- Web UI included: Interactive Gradio demo via `omnivoice-demo` command
- Batch inference: Multi-GPU distributed inference support for large-scale TTS tasks
By combining a clean diffusion language model architecture with the broadest multilingual coverage available, it enables high-quality speech synthesis for low-resource languages that most commercial TTS services do not support.
🔗 Repo
https://t.co/mR7pMPTBhH
📄 Paper
https://t.co/riJ7iDU4hh
🤗 Model
https://t.co/2KNTOgEb7P
// TEMPO: Scaling Test-time Training for Large Reasoning Models //
Test-time training (TTT) adapts model parameters on unlabeled test instances during inference, extending capabilities beyond offline training — but existing methods plateau quickly as self-generated reward signals drift and diversity collapses. TEMPO introduces a framework that interleaves policy refinement with periodic critic recalibration on labeled data, formalized through the Expectation-Maximization (EM) algorithm, revealing that prior TTT methods are incomplete variants missing this crucial recalibration step.
Key highlights:
• Interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset
• Formalizes the alternating procedure via EM algorithm — tightening the evidence lower bound (ELBO) for sustained improvement
• Improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%
• Maintains high diversity while scaling with additional test-time compute — solving the plateau problem
TEMPO reframes TTT as a principled EM procedure rather than an ad-hoc adaptation trick, showing that periodic recalibration is what enables sustained gains. This work establishes a scalable path for reasoning models to continuously improve at inference time without external human labels.
📄 Paper
https://t.co/9U8Fjc93sW
💻 Code
https://t.co/ix1wAuzIxT
// Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items //
Virtual try-on has seen rapid advances through image generation and editing, but existing methods struggle with real-world complexity — extreme poses, lighting variations, motion blur, and diverse garment types. Tstars-Tryon 1.0 is a commercial-scale system from Alibaba that delivers robust, photorealistic results across challenging in-the-wild conditions while maintaining near real-time inference speed for seamless deployment.
Key highlights:
• Maintains high success rate across extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions
• Delivers highly photorealistic results preserving garment texture, material properties, and structural characteristics with minimal AI artifacts
• Supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories with coordinated identity and background control
• Heavily optimized for inference speed — near real-time generation for commercial deployment
• Deployed at industrial scale on the Taobao App, serving millions of users with tens of millions of requests
Tstars-Tryon 1.0 bridges the gap between research-quality virtual try-on and production-grade reliability. By releasing a comprehensive benchmark alongside an industrially deployed model, this work sets a new standard for robustness and realism in fashion AI — moving beyond controlled datasets to real-world conditions that actually matter for e-commerce.
📄 Paper
https://t.co/FV3b7Ry4UZ
🤗 Dataset
https://t.co/BerP1k9kjH
// GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification //
Large language models are typically post-trained using supervised fine-tuning (SFT) followed by reinforcement learning (RL), but unifying efficient knowledge injection with robust generalization remains an open challenge. This work provides a training-dynamics analysis showing SFT can be interpreted as policy gradient optimization with extremely sparse implicit rewards and unstable inverse-probability weighting — leading to single-path dependency, entropy collapse, and gradient explosion. Group Fine-Tuning (GFT) addresses these intrinsic limitations through two novel mechanisms.
Key highlights:
• Reveals SFT as a special case of policy gradient with sparse implicit reward and unstable weighting — explaining its failure modes
• Group Advantage Learning constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity
• Dynamic Coefficient Rectification adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection
• Consistently surpasses SFT-based methods across benchmarks
• Yields policies that integrate more smoothly with subsequent RL training
GFT reframes post-training as a unified optimization problem rather than disjointed SFT-then-RL stages. By diagnosing and fixing the fundamental instability of SFT through group advantages and dynamic rectification, this work offers a theoretically grounded path toward more stable and effective LLM alignment.
📄 Paper
https://t.co/JSMg3lDsEC
💻 Code
https://t.co/F4HaAtXYfO
// QuantCode-Bench: Evaluating LLMs on Executable Algorithmic Trading Strategies //
A benchmark for evaluating the ability of LLMs to generate executable trading strategies from textual descriptions, using a four-stage nested validation pipeline built around the Backtrader framework.
Key highlights:
- 400 tasks: Collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources across easy/medium/hard difficulties
- Four-stage pipeline: Compilation → Backtest execution → Trade presence → LLM judge semantic alignment
- Domain-specific challenge: Requires financial logic, specialized API knowledge, and behaviorally valid code
- Two evaluation settings: Single-turn (one-shot) and agentic multi-turn (up to 10 iterations with feedback)
- Single-turn results: Best frontier models achieve ~70-76% Judge Pass
- Agentic results: Best models reach 95-98% with iterative feedback and repair
- Key finding: Syntax is solved; the challenge is operationalizing trading logic and API usage
By formalizing trading strategy generation as a sequence of nested requirements rather than a single pass/fail metric, it reveals that current models struggle not with code syntax but with translating financial intent into behaviorally valid implementations—making it a distinct class of domain-specific code generation tasks.
📄 Paper
https://t.co/aXJ4J7z5CT
💻 Code
https://t.co/6Z26hI7UgD
// VLA Foundry: Pretrained LLM, VLM, and VLA Checkpoints //
A unified training framework from Toyota Research Institute for building Vision-Language-Action models, enabling progressive pretraining from LLM to VLM to VLA with shared infrastructure.
Key highlights:
- Unified training pipeline: Single codebase for LLM, VLM, and VLA training stages
- Progressive pretraining: Train LLM first, then fine-tune to VLM, then to VLA
- Pretrained checkpoints: Released models including Foundry-LLM-1.2B, Foundry-VLM-1.3B, Foundry-VLA-1.7B, and Foundry-Qwen3VLA-2.1B
- Database integration: VLA Foundry Database for exploring and filtering training data
- Tutorial support: Jupyter notebooks for training LLM, VLM, and VLA models
- MIT licensed: Open-source with permissive licensing for research and commercial use
By providing a single, modular framework that covers the full progression from language-only to vision-language to vision-language-action models, it lowers the barrier for robotics researchers to train and experiment with embodied AI models on custom datasets.
🔗 Repo
https://t.co/yKAugb4zxO
🤗 Models
https://t.co/PotkmaIb2b
// MultiWorld: Scalable Multi-Agent Multi-View Video World Models //
A unified framework for multi-agent, multi-view world modeling that enables precise control of multiple agents while maintaining cross-view consistency through a shared 3D-aware global state.
Key highlights:
- Multi-Agent Condition Module (MACM): Uses Agent Identity Embedding and Adaptive Action Weighting to associate actions with correct agents
- Global State Encoder (GSE): Aggregates multi-view observations into a 3D-aware global state for coherent view synthesis
- Flexible scaling: Supports arbitrary numbers of agents and camera views without architectural changes
- Parallel inference: Decomposes multi-view simulation into parallel single-view generation with shared global context
- Long-horizon support: Autoregressive chunk generation with state updates for horizons exceeding 2x training length
- Two datasets: Multi-player game (ItTakesTwo) and multi-robot manipulation (RoboFactory) with variable agent/view configs
By encoding observations into a compact global environment state rather than treating views independently, it enables scalable parallel generation of multi-view videos where each perspective remains anchored to a consistent shared world—outperforming baselines in video fidelity, action-following, and cross-view consistency.
📄 Paper
https://t.co/BPfmvzZhFp
// Kimi-K2.6 Deployment Guide //
Official deployment documentation for Moonshot AI's Kimi-K2.6 model, providing example configurations for vLLM, SGLang, and KTransformers inference engines.
Key highlights:
- vLLM support: Available in nightly wheels; TP8 on H200 with tool-call and reasoning parsers
- SGLang stable: Supported in v0.5.10+ without nightly builds; same TP8 configuration
- KTransformers integration: CPU+GPU heterogeneous inference achieving 640 tok/s prefill on 8x L20
- LoRA fine-tuning: KT+LLaMA-Factory setup for SFT at 44.55 tok/s on 2x 4090
- Tool calling: Requires `--tool-call-parser kimi_k2` flag
- Reasoning mode: Enabled by default; requires `--reasoning-parser kimi_k2` for correct processing
- Architecture note: Same architecture as Kimi-K2.5; deployment methods directly reusable
By providing verified deployment commands across multiple inference engines—from high-throughput GPU clusters to CPU-offloading setups—it enables production deployment of the 1T parameter MoE model with thinking mode and tool use capabilities.
🔗 Guide
https://t.co/kegNds8CdG
// Qwen3.5-Omni Technical Report //
A large-scale omni-modal model with hundreds of billions of parameters, Hybrid Attention MoE architecture, and native real-time streaming capabilities across text, audio, image, and video.
Key highlights:
• 🏗️ Hybrid Attention MoE: Efficient long-sequence inference for both Thinker and Talker modules
• 📏 256K context: Supports 10+ hours of audio or 400 seconds of 720P video at 1 FPS
• 🗣️ ARIA alignment: Dynamic text-speech unit alignment for stable, natural streaming speech synthesis
• 🌍 Multilingual: Speech recognition in 113 languages, synthesis in 36 languages with emotional nuance
• 🎬 Audio-visual grounding: Script-level structured captions with temporal sync and scene segmentation
• 💻 Audio-Visual Vibe Coding: Emergent capability to generate executable code from audio-visual instructions
• 🏆 SOTA results: Surpasses Gemini-3.1 Pro across 215 audio and audio-visual benchmarks
By unifying multimodal understanding and generation under a single end-to-end architecture with streaming-first design, it pushes the boundaries of real-time omni-modal interaction—from voice dialogue to video reasoning to autonomous agentic behavior.
📄 Paper
https://t.co/6yP4B3RLK0
🔗 API
https://t.co/NZUz9Oo4kt
// OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation //
A unified VLA and World Model framework for autonomous driving that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual decoders, achieving SOTA accuracy at answer-only latency.
Key highlights:
• 🧠 Dual-decoder supervision: Language decoder reconstructs text CoT + visual world model decoder predicts future frames
• ⚡ One-step inference: All latent tokens prefilled in a single parallel pass—no autoregressive decoding
• 🏆 First latent CoT to beat explicit CoT: Surpasses token-by-token reasoning across 4 benchmarks
• 🎯 Causal dynamics: Latent space internalizes road geometry, agent motion, and environmental change
• 🎬 Three-stage training: Progressive alignment with trajectory, language, and visual objectives
• 🚗 Real-time ready: Matches answer-only prediction speed while maintaining reasoning quality
By forcing the latent space to internalize causal driving dynamics through joint language and world-model supervision, it demonstrates that tighter compression produces more generalizable representations than verbose token-by-token reasoning—enabling real-time autonomous driving with deep reasoning.
📄 Paper
https://t.co/9hXe4Yllmq