ForProduction

// MiMo-V2.5-Pro-Base: 1T Parameter MoE Language Model // The base variant of Xiaomi's flagship MoE language model with 1.02T total parameters and 42B active parameters, featuring hybrid attention architecture and up to 256K context length. Key highlights: - Massive scale: 1.02T total parameters, 42B active per token - Hybrid attention: Interleaves Sliding Window Attention and Global Attention with 6:1 ratio and 128 sliding window - Multi-Token Prediction: 3 lightweight MTP modules using dense FFNs for 3x output speed during inference - Long context: Supports up to 256K tokens (Pro variant extends to 1M) - Efficient training: FP8 mixed precision, native 32K sequence length, trained on 27T tokens - Strong benchmarks: 88.4 BBH, 89.4 MMLU, 99.6 GSM8K, 86.2 MATH, 75.6 HumanEval+ - MIT licensed: Open source with permissive licensing for commercial use By combining sparse MoE architecture with hybrid attention and multi-token prediction, it delivers frontier-level reasoning and coding performance at high token efficiency—making it suitable for demanding agentic and long-horizon tasks. 🤗 Model https://t.co/1PoXYgCwBf

ForProduction @ForProduction

about 1 month ago

// Qwen3.6-27B-DFlash: Block Diffusion Drafter for Qwen3.6 // A speculative decoding draft model for Qwen3.6-27B that uses a lightweight block diffusion model for parallel drafting, enabling faster inference without quality loss. Key highlights: - Block diffusion drafting: Uses bidirectional attention with mask tokens instead of autoregressive draft models - Target model pairing: Designed specifically for Qwen/Qwen3.6-27B (must be used together) - Multi-backend support: Compatible with vLLM nightly builds and SGLang (PR branch) - Easy deployment: Single speculative config flag for vLLM; standard SGLang speculative algorithm flag - Still training: Model is under active training; inference engine support may evolve with architectural changes - Causal SWA layers: Includes architectural changes that may affect compatibility with some inference engines By replacing traditional autoregressive draft models with a block diffusion approach, it enables higher acceptance rates and greater speedups than existing speculative decoding methods when paired with the Qwen3.6-27B target model. 📄 Paper https://t.co/E55JZXSzMW 🤗 Model https://t.co/OeGXBnP88n 🔗 Repo https://t.co/NBxDN9oGY9

ForProduction @ForProduction

about 1 month ago

// Lark CLI: Official Lark/Feishu CLI Tool // The official command-line interface for the Lark/Feishu open platform, designed for both human users and AI agents with 200+ commands and 22 structured AI Agent Skills. Key highlights: - Agent-native design: 22 structured Skills compatible with Claude Code, Codex, and other AI tools - Wide coverage: 14 business domains including Messenger, Docs, Sheets, Calendar, Mail, Tasks, Meetings - Three-layer architecture: Shortcuts (+) → API Commands → Raw API calls (2500+ endpoints) - AI-friendly output: Concise parameters, smart defaults, structured formats (JSON, table, CSV, ndjson) - Secure by default: Input injection protection, terminal output sanitization, OS-native keychain storage - Identity switching: Execute commands as user or bot with `--as` flag - Dry-run support: Preview requests before execution for safe automation - Schema introspection: Inspect any API method's parameters and response structure - MIT licensed: Open source with bilingual documentation (English & Chinese) By providing structured AI Agent Skills alongside traditional CLI commands, it enables AI coding agents to operate Lark/Feishu workspaces with zero extra setup—automating everything from calendar scheduling to document creation to meeting summaries. 🔗 Repo https://t.co/d5hpnKjrgN

ForProduction @ForProduction

about 1 month ago

// OmniVoice: High-Quality Voice Cloning TTS for 600+ Languages // A massively multilingual zero-shot text-to-speech model built on a diffusion language model-style architecture, supporting voice cloning, voice design, and fine-grained control. Key highlights: - 600+ languages supported: Broadest language coverage among zero-shot TTS models - Voice cloning: State-of-the-art quality from short 3-10 second reference audio clips - Voice design: Control voices via speaker attributes (gender, age, pitch, dialect, accent, whisper) - Fine-grained control: Non-verbal symbols ([laughter], [sigh]) and pronunciation correction via pinyin/phonemes - Fast inference: RTF as low as 0.025 (40x faster than real-time) - Local inference: Runs on NVIDIA GPU or Apple Silicon (MPS) without cloud dependencies - Web UI included: Interactive Gradio demo via `omnivoice-demo` command - Batch inference: Multi-GPU distributed inference support for large-scale TTS tasks By combining a clean diffusion language model architecture with the broadest multilingual coverage available, it enables high-quality speech synthesis for low-resource languages that most commercial TTS services do not support. 🔗 Repo https://t.co/mR7pMPTBhH 📄 Paper https://t.co/riJ7iDU4hh 🤗 Model https://t.co/2KNTOgEb7P

ForProduction @ForProduction

about 1 month ago

// TEMPO: Scaling Test-time Training for Large Reasoning Models // Test-time training (TTT) adapts model parameters on unlabeled test instances during inference, extending capabilities beyond offline training — but existing methods plateau quickly as self-generated reward signals drift and diversity collapses. TEMPO introduces a framework that interleaves policy refinement with periodic critic recalibration on labeled data, formalized through the Expectation-Maximization (EM) algorithm, revealing that prior TTT methods are incomplete variants missing this crucial recalibration step. Key highlights: • Interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset • Formalizes the alternating procedure via EM algorithm — tightening the evidence lower bound (ELBO) for sustained improvement • Improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8% • Maintains high diversity while scaling with additional test-time compute — solving the plateau problem TEMPO reframes TTT as a principled EM procedure rather than an ad-hoc adaptation trick, showing that periodic recalibration is what enables sustained gains. This work establishes a scalable path for reasoning models to continuously improve at inference time without external human labels. 📄 Paper https://t.co/9U8Fjc93sW 💻 Code https://t.co/ix1wAuzIxT

ForProduction @ForProduction

about 1 month ago

// Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items // Virtual try-on has seen rapid advances through image generation and editing, but existing methods struggle with real-world complexity — extreme poses, lighting variations, motion blur, and diverse garment types. Tstars-Tryon 1.0 is a commercial-scale system from Alibaba that delivers robust, photorealistic results across challenging in-the-wild conditions while maintaining near real-time inference speed for seamless deployment. Key highlights: • Maintains high success rate across extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions • Delivers highly photorealistic results preserving garment texture, material properties, and structural characteristics with minimal AI artifacts • Supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories with coordinated identity and background control • Heavily optimized for inference speed — near real-time generation for commercial deployment • Deployed at industrial scale on the Taobao App, serving millions of users with tens of millions of requests Tstars-Tryon 1.0 bridges the gap between research-quality virtual try-on and production-grade reliability. By releasing a comprehensive benchmark alongside an industrially deployed model, this work sets a new standard for robustness and realism in fashion AI — moving beyond controlled datasets to real-world conditions that actually matter for e-commerce. 📄 Paper https://t.co/FV3b7Ry4UZ 🤗 Dataset https://t.co/BerP1k9kjH

ForProduction @ForProduction

about 1 month ago

// GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification // Large language models are typically post-trained using supervised fine-tuning (SFT) followed by reinforcement learning (RL), but unifying efficient knowledge injection with robust generalization remains an open challenge. This work provides a training-dynamics analysis showing SFT can be interpreted as policy gradient optimization with extremely sparse implicit rewards and unstable inverse-probability weighting — leading to single-path dependency, entropy collapse, and gradient explosion. Group Fine-Tuning (GFT) addresses these intrinsic limitations through two novel mechanisms. Key highlights: • Reveals SFT as a special case of policy gradient with sparse implicit reward and unstable weighting — explaining its failure modes • Group Advantage Learning constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity • Dynamic Coefficient Rectification adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection • Consistently surpasses SFT-based methods across benchmarks • Yields policies that integrate more smoothly with subsequent RL training GFT reframes post-training as a unified optimization problem rather than disjointed SFT-then-RL stages. By diagnosing and fixing the fundamental instability of SFT through group advantages and dynamic rectification, this work offers a theoretically grounded path toward more stable and effective LLM alignment. 📄 Paper https://t.co/JSMg3lDsEC 💻 Code https://t.co/F4HaAtXYfO

ForProduction @ForProduction

about 1 month ago

// QuantCode-Bench: Evaluating LLMs on Executable Algorithmic Trading Strategies // A benchmark for evaluating the ability of LLMs to generate executable trading strategies from textual descriptions, using a four-stage nested validation pipeline built around the Backtrader framework. Key highlights: - 400 tasks: Collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources across easy/medium/hard difficulties - Four-stage pipeline: Compilation → Backtest execution → Trade presence → LLM judge semantic alignment - Domain-specific challenge: Requires financial logic, specialized API knowledge, and behaviorally valid code - Two evaluation settings: Single-turn (one-shot) and agentic multi-turn (up to 10 iterations with feedback) - Single-turn results: Best frontier models achieve ~70-76% Judge Pass - Agentic results: Best models reach 95-98% with iterative feedback and repair - Key finding: Syntax is solved; the challenge is operationalizing trading logic and API usage By formalizing trading strategy generation as a sequence of nested requirements rather than a single pass/fail metric, it reveals that current models struggle not with code syntax but with translating financial intent into behaviorally valid implementations—making it a distinct class of domain-specific code generation tasks. 📄 Paper https://t.co/aXJ4J7z5CT 💻 Code https://t.co/6Z26hI7UgD

ForProduction @ForProduction

about 1 month ago

// VLA Foundry: Pretrained LLM, VLM, and VLA Checkpoints // A unified training framework from Toyota Research Institute for building Vision-Language-Action models, enabling progressive pretraining from LLM to VLM to VLA with shared infrastructure. Key highlights: - Unified training pipeline: Single codebase for LLM, VLM, and VLA training stages - Progressive pretraining: Train LLM first, then fine-tune to VLM, then to VLA - Pretrained checkpoints: Released models including Foundry-LLM-1.2B, Foundry-VLM-1.3B, Foundry-VLA-1.7B, and Foundry-Qwen3VLA-2.1B - Database integration: VLA Foundry Database for exploring and filtering training data - Tutorial support: Jupyter notebooks for training LLM, VLM, and VLA models - MIT licensed: Open-source with permissive licensing for research and commercial use By providing a single, modular framework that covers the full progression from language-only to vision-language to vision-language-action models, it lowers the barrier for robotics researchers to train and experiment with embodied AI models on custom datasets. 🔗 Repo https://t.co/yKAugb4zxO 🤗 Models https://t.co/PotkmaIb2b

ForProduction @ForProduction

about 1 month ago

// MultiWorld: Scalable Multi-Agent Multi-View Video World Models // A unified framework for multi-agent, multi-view world modeling that enables precise control of multiple agents while maintaining cross-view consistency through a shared 3D-aware global state. Key highlights: - Multi-Agent Condition Module (MACM): Uses Agent Identity Embedding and Adaptive Action Weighting to associate actions with correct agents - Global State Encoder (GSE): Aggregates multi-view observations into a 3D-aware global state for coherent view synthesis - Flexible scaling: Supports arbitrary numbers of agents and camera views without architectural changes - Parallel inference: Decomposes multi-view simulation into parallel single-view generation with shared global context - Long-horizon support: Autoregressive chunk generation with state updates for horizons exceeding 2x training length - Two datasets: Multi-player game (ItTakesTwo) and multi-robot manipulation (RoboFactory) with variable agent/view configs By encoding observations into a compact global environment state rather than treating views independently, it enables scalable parallel generation of multi-view videos where each perspective remains anchored to a consistent shared world—outperforming baselines in video fidelity, action-following, and cross-view consistency. 📄 Paper https://t.co/BPfmvzZhFp

ForProduction @ForProduction

about 1 month ago

// Kimi-K2.6 Deployment Guide // Official deployment documentation for Moonshot AI's Kimi-K2.6 model, providing example configurations for vLLM, SGLang, and KTransformers inference engines. Key highlights: - vLLM support: Available in nightly wheels; TP8 on H200 with tool-call and reasoning parsers - SGLang stable: Supported in v0.5.10+ without nightly builds; same TP8 configuration - KTransformers integration: CPU+GPU heterogeneous inference achieving 640 tok/s prefill on 8x L20 - LoRA fine-tuning: KT+LLaMA-Factory setup for SFT at 44.55 tok/s on 2x 4090 - Tool calling: Requires `--tool-call-parser kimi_k2` flag - Reasoning mode: Enabled by default; requires `--reasoning-parser kimi_k2` for correct processing - Architecture note: Same architecture as Kimi-K2.5; deployment methods directly reusable By providing verified deployment commands across multiple inference engines—from high-throughput GPU clusters to CPU-offloading setups—it enables production deployment of the 1T parameter MoE model with thinking mode and tool use capabilities. 🔗 Guide https://t.co/kegNds8CdG

ForProduction @ForProduction

about 1 month ago

// Qwen3.5-Omni Technical Report // A large-scale omni-modal model with hundreds of billions of parameters, Hybrid Attention MoE architecture, and native real-time streaming capabilities across text, audio, image, and video. Key highlights: • 🏗️ Hybrid Attention MoE: Efficient long-sequence inference for both Thinker and Talker modules • 📏 256K context: Supports 10+ hours of audio or 400 seconds of 720P video at 1 FPS • 🗣️ ARIA alignment: Dynamic text-speech unit alignment for stable, natural streaming speech synthesis • 🌍 Multilingual: Speech recognition in 113 languages, synthesis in 36 languages with emotional nuance • 🎬 Audio-visual grounding: Script-level structured captions with temporal sync and scene segmentation • 💻 Audio-Visual Vibe Coding: Emergent capability to generate executable code from audio-visual instructions • 🏆 SOTA results: Surpasses Gemini-3.1 Pro across 215 audio and audio-visual benchmarks By unifying multimodal understanding and generation under a single end-to-end architecture with streaming-first design, it pushes the boundaries of real-time omni-modal interaction—from voice dialogue to video reasoning to autonomous agentic behavior. 📄 Paper https://t.co/6yP4B3RLK0 🔗 API https://t.co/NZUz9Oo4kt

ForProduction @ForProduction

about 1 month ago

// OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation // A unified VLA and World Model framework for autonomous driving that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual decoders, achieving SOTA accuracy at answer-only latency. Key highlights: • 🧠 Dual-decoder supervision: Language decoder reconstructs text CoT + visual world model decoder predicts future frames • ⚡ One-step inference: All latent tokens prefilled in a single parallel pass—no autoregressive decoding • 🏆 First latent CoT to beat explicit CoT: Surpasses token-by-token reasoning across 4 benchmarks • 🎯 Causal dynamics: Latent space internalizes road geometry, agent motion, and environmental change • 🎬 Three-stage training: Progressive alignment with trajectory, language, and visual objectives • 🚗 Real-time ready: Matches answer-only prediction speed while maintaining reasoning quality By forcing the latent space to internalize causal driving dynamics through joint language and world-model supervision, it demonstrates that tighter compression produces more generalizable representations than verbose token-by-token reasoning—enabling real-time autonomous driving with deep reasoning. 📄 Paper https://t.co/9hXe4Yllmq

ForProduction

@ForProduction

Last Seen Users on Sotwe

Trends for you

Most Popular Users