Today we're releasing Mellum2: our first "serious" LLM.
This is a 12B A2.5B MoE LLM pre-trained on ~11T tokens and post-trained with RLVR.
I'm proud to be leading the team that was working on it for the last 6 months.
We release base/SFT/RL checkpoints along with a tech report
Writing at the intersection of math, data science, and operations research, Berend Markhorst presents an accessible and comprehensive guide to Benders' decomposition. https://t.co/BgXdHoJz11
“Mastering NLP From Foundations to Agents” by Lior Gazit and Meysam Ghaffari, from @PacktPublishing@PacktDataML
👉➡️ https://t.co/fnOUOFRuyu ⬅️👈
Learn this:
•Engineer NLP systems from ML foundations to LLM architectures
•Implement RAG pipelines, routing layers, and agent workflows
•Fine-tune and align LLMs using LoRA, RLHF, and DPO methods
•Design production-grade AI systems with governance and safety
How CopilotKit Is Redefining the Agentic AI Stack in 2026
For years, AI inside software meant a chat widget bolted onto the corner of an application. You typed, the model responded with text, and you manually translated that output into whatever you actually needed it to do. It was useful the way a calculator is useful: functional, but fundamentally passive. CopilotKit, a Seattle-based startup co-founded by Atai Barkai and Uli Barkai, has spent the last two years arguing that the model is broken — and in 2026, the developer community is agreeing loudly.
- AG-UI completes the agentic protocol stack by handling the agent-to-UI interaction layer that MCP and A2A leave unaddressed, with first-party SDKs across LangGraph, CrewAI, Mastra, Agno, and Pydantic AI, and community SDKs now live for Go, Kotlin, Dart, Java, Rust, Ruby, and C++.
- AIMock ships one zero-dependency mock server for the entire agentic call chain — 11 LLM providers, MCP, A2A, vector DBs, search — with record-and-replay, daily drift detection, and chaos testing built in.
- Pathfinder is a self-hosted MCP knowledge server that indexes docs, code, Notion pages, Slack, and Discord into hybrid vector-keyword search, with pluggable embeddings that need no external API key.
- The three tools together target the three production blockers — knowledge retrieval, testing reliability, and runtime persistence — that demo-quality agents consistently fail to address.
- CopilotKit's vendor-neutral, self-hostable design means teams can adopt any single layer without being locked into a proprietary runtime or forced to rebuild their existing stack.
Full analysis: https://t.co/eOxovDdjtW
GitHub repo: https://t.co/YDv9rhIu4T
@CopilotKit #ai #aiagent #agenticai
Most agent frameworks today are stitching together reasoning models with external orchestration layers. Qwen3.7-Max takes a different position — train the agent capability into the model itself.
Alibaba just introduced Qwen3.7-Max
Here's what's actually interesting:
→ 1M-token context window — up from 256K on Qwen3.6 Max Preview
→ Extended-thinking mode with visible chain-of-thought reasoning trace
→ 1,000+ tool calls executed autonomously in an internal kernel optimization test
→ 35 hours of sustained autonomous execution on a single complex task
→ 56.6 on the Artificial Analysis Intelligence Index — #5 overall, ahead of Gemini 3.5 Flash
→ #13 in Text Arena (1,475 Elo), #7 in Math, #9 in Expert Prompts
Full analysis: https://t.co/qSLp3fta9c
Other technical details ⤵
@Alibaba_Qwen
MIT researchers developed “Insum,” a technique for speeding up computations on datasets replete w/zeros.
It rewrites Einstein summation (“einsum”) operations to avoid inefficient handling of zeros, improving memory efficiency & performance: https://t.co/JXLLk3yuLe
Most vector search libraries make you train a codebook before indexing anything.
That's not a search tool — it's a data dependency. turbovec just removed it entirely.
It's a Rust-built vector index with Python bindings, built on Google Research's TurboQuant algorithm — a data-oblivious quantizer that requires zero training and zero data passes.
Here's what's actually interesting:
→ 10 million documents: 31 GB as float32, 4 GB with turbovec — 16x compression at 2-bit
→ Beats FAISS IndexPQFastScan by 12–20% on ARM across every configuration
→ On x86, wins every 4-bit config by 1–6% against FAISS
→ Zero codebook training — add vectors, they're indexed immediately
→ Fully local, no data egress — drop-in for LangChain, LlamaIndex, and Haystack
The core idea: after applying a random rotation, every coordinate follows a known Beta distribution — regardless of input data. That makes the quantization boundaries computable from math alone, not from your dataset.
Full analysis with Guide: https://t.co/RcUvsavLvi
Repo: https://t.co/dmcGErIfbT
#ai #python #aiinfrastructure #data #ml
#1 best-seller in AI on Amazon...
"Agentic Coding with Claude Code: The everyday developer's guide to agentic coding with Claude Code"
𝗚𝗲𝘁 𝗶𝘁 𝗵𝗲𝗿𝗲: https://t.co/jKP3cE9HgV v/ @PacktDataML
𝗪𝗵𝗮𝘁 𝘆𝗼𝘂 𝘄𝗶𝗹𝗹 𝗹𝗲𝗮𝗿𝗻:
❇️Design agentic coding workflows in the terminal and IDE using Claude Code
🔷Build custom automations with reusable slash commands and hooks
❇️Use Claude Code with a Next.js project to implement AI-driven workflows
🔷Create persistent AI memory using Claude Code memory files
❇️Apply MCP for structured context sharing across tools and agents
🔷Design multi-agent systems using subagents and orchestration patterns
❇️Enforce coding standards using project documentation and context control
🔷Scale AI pair programming while keeping code maintainable
Most LLM inference optimization forces a choice: fast drafting with a weak auxiliary model, or accurate generation with full Standard autoregressive (AR) decoding. NVIDIA Researchers just built a third option into the weights themselves.
They released Nemotron-Labs-Diffusion — a 3B/8B/14B model family trained on a joint Autoregressive AR-diffusion objective that supports three decoding modes from one checkpoint: standard AR, parallel diffusion decoding, and self-speculation, where the same model drafts and verifies without any auxiliary head.
Here's what's actually interesting:
→ Self-speculation achieves 5.99× tokens per forward over Qwen3-8B with comparable accuracy on a 10-task benchmark
→ Average acceptance length: 6.82 (with LoRA) vs. 2.75 for Eagle3 and 4.24 for Qwen3-9B-MTP — same draft length of 31
→ AR and diffusion objectives peak at the same loss coefficient (α=0.3) and improve together — they don't compete for model capacity
→ Speed-of-light analysis shows a theoretical ceiling of 7.60× TPF at block length 32; current confidence-based sampling realizes only ~3×, leaving headroom for better samplers
Full analysis: https://t.co/tJdGfHjCFr
Paper: https://t.co/LdEz01hEQt
Model weights: https://t.co/eP2MJs1GT8
Technical details: https://t.co/TQ84fmKFP5
@PavloMolchanov@NVIDIAAI@nvidia@YongganFu@xieenze_jr@MardaniMorteza@songhan_mit@jankautz
Most translation models are audio pipelines with a TTS layer bolted on at the end. That's not simultaneous interpretation and Alibaba's Qwen team just built a clear technical case for the difference.
They released Qwen3.5-LiveTranslate-Flash: a real-time multimodal translation model that processes audio and video frames simultaneously, clones the original speaker's voice in the output, and covers 60 input languages at 2.8 seconds of latency.
No turn-detection. No generic synthesis voice replacing the speaker.
Here's what's actually interesting:
→ Vision-enhanced comprehension reads lip movements, gestures, and on-screen text alongside audio — robust in noisy or degraded audio environments
→ Semantic unit prediction via "reading units" processing commits to output segments mid-sentence, enabling continuous streaming without waiting for full utterances
→ Real-time voice cloning replicates the original speaker's voice profile from a single spoken sentence
→ Dynamic keyword configuration lets you inject domain-specific glossaries at runtime — brand names, medical terms, legal vocabulary
→ FLEURS and CoVoST2 benchmarks: outperforms major commercial alternatives across multilingual speech translation tasks
Full analysis: https://t.co/gVorchcSuU
Technical details: https://t.co/R3QQurGlB9
@Alibaba_Qwen #tts #audioai #voiceai #ai @Ali_TongyiLab
Deep Learning with C++ — Design and deploy neural networks using CUDA for high-performance AI in C++
Get the book at https://t.co/RzMRhYihTE from @PacktPublishing@PacktDataML
We at Marktechpost been building a GitHub repository of 300+ hands-on Jupyter notebooks covering the tools, models, and frameworks that actually matter for AI Agents and Agentic AI
Here's what's inside:
→ LLM fine-tuning, RAG pipelines, and agentic workflows — end to end
→ Notebooks for open-source models: LLaMA, Mistral, Qwen, Gemma, and more
→ Covers LangChain, LlamaIndex, HuggingFace, vLLM, and the full modern stack
→ Every notebook is runnable — Google Colab links included
→ Updated continuously as new models and frameworks drop
The goal was simple: if you read about something on Marktechpost, you should be able to run it the same day.
300+ notebooks. Zero paywalls.
https://t.co/B8Z6nRou83
Most "privacy-preserving" AI memory just masks sensitive values with ***. That breaks the task. The cloud can't draft your doctor's email if the blood pressure reading is gone.
MemTensor just proposed a different approach — and it actually holds up under benchmarking.
They introduced MemPrivacy, a framework that runs a lightweight on-device model to detect private spans, replaces them with semantically typed placeholders like <Health_Info_1> before anything leaves the device, and restores the original values locally after the cloud responds. The cloud reasons on structure. It never sees the actual data.
Here's what's actually interesting:
→ Four-level privacy taxonomy (PL1–PL4) from general preferences to immediately exploitable credentials — user-configurable per session
→ MemPrivacy-4B-RL hits 85.97% F1 on MemPrivacy-Bench vs. 78.41% for Gemini-3.1-Pro and 68.99% for GPT-5.2 on privacy span extraction
→ Utility loss across LangMem, Mem0, and Memobase stays within 1.6% at PL2–PL4 protection — irreversible masking causes drops up to 41.87%
→ Models run at 0.6B, 1.7B, and 4B parameters with sub-2-second per-message latency on-device
The core insight: privacy protection and semantic utility don't have to trade off — if you replace values with typed structure instead of blank masks.
Full analysis: https://t.co/Zn62GYvv7G
Paper: https://t.co/Y2NEV8Mam6
Model Weights: https://t.co/VC3Rn6Iap7
@ModelScope2022 #ai #data #privacy #model #llm
Most "4-bit training" results come from small models on short token horizons because the format breaks before you can validate it. That's not pretraining — and NVIDIA just drew a clear line between the two.
They introduced the first public 4-bit pretraining run at multi-trillion-token scale — a 12B hybrid Mamba-Transformer (Nemotron-Nano-12B-v2-Base architecture) trained on 10 trillion tokens in NVFP4, a microscaling format with 16-element blocks, E4M3 block scales, and an FP32 per-tensor scale, with downstream accuracy closely tracking an FP8 baseline.
Here's what's actually interesting:
→ MMLU-Pro 5-shot: 62.58% (NVFP4) vs 62.62% (FP8). MMLU 76.57 vs 77.36. GSM8K CoT 92.27 vs 89.08. Validation loss within 1% of FP8 in the stable phase
→ Recipe = selective BF16 (~16% of linear layers) + 16×16 Random Hadamard Transforms on Wgrad inputs + 2D 16×16 weight scaling + stochastic rounding on gradients. Ablations show all four are required
→ Only linear-layer GEMMs run in NVFP4 — attention, embeddings, normalization, master weights, gradients, and optimizer states stay in BF16/FP32
→ On an 8B model, MXFP4 needed 1.36T tokens (+36%) to match NVFP4's loss at 1T tokens
Full Analysis: https://t.co/IXByEJbZuJ
Paper: https://t.co/VRkTaIXApx
@NVIDIAAI@ctnzr