Machine Learning Community ⭐️

about 12 hours ago

@Sumanth_077 Love seeing Kimi-K2.6 in the list, Great share!

c4ml_ retweeted

about 12 hours ago

Huge! Real-world agentic leaderboard from Arena. Instead of synthetic benchmarks, it measures how models actually perform when real users put them to work - writing code, debugging projects, researching the web, building apps, analyzing documents. The methodology is different from anything else out there. Arena uses causal tracing - treating each model as a randomized controlled trial across millions of live sessions. Every session produces signals from two sources: the user (praise, complaints, corrections, task approval) and the environment (shell errors, tool failures, recovery attempts). Five signals measured: task success, praise vs complaint, steerability, bash error recovery, and tool hallucination. Each has its own leaderboard, and there is also an overall one. The scale behind this: 300K+ tasks, 2M+ tool calls, 40M lines of code written by agents in a single week. Current leaderboard: 1. GPT-5.5 (High) 2. Claude Opus 4.7 (Thinking) 3. GLM-5.1 4. Gemini-3.1-Pro 5. Kimi-K2.6 I've shared the full methodology blog in the replies!

Sumanth_077's tweet photo. Huge! Real-world agentic leaderboard from Arena.

Instead of synthetic benchmarks, it measures how models actually perform when real users put them to work - writing code, debugging projects, researching the web, building apps, analyzing documents.

The methodology is different from anything else out there. Arena uses causal tracing - treating each model as a randomized controlled trial across millions of live sessions. Every session produces signals from two sources: the user (praise, complaints, corrections, task approval) and the environment (shell errors, tool failures, recovery attempts).

Five signals measured: task success, praise vs complaint, steerability, bash error recovery, and tool hallucination. Each has its own leaderboard, and there is also an overall one.

The scale behind this: 300K+ tasks, 2M+ tool calls, 40M lines of code written by agents in a single week.

Current leaderboard:
1. GPT-5.5 (High)
2. Claude Opus 4.7 (Thinking)
3. GLM-5.1
4. Gemini-3.1-Pro
5. Kimi-K2.6

I've shared the full methodology blog in the replies!

2 days ago

@Sumanth_077 No separate vision or audio encoder is interesting architectural approach going forward for multimodal models.

218

c4ml_ retweeted

🔨 Founder & CEO @ SwirlAI 📖 Writing about #LLM, #AI, #DataEngineering, #MachineLearning and #Data ✍️ Author of SwirlAI Newsletter.

2 days ago

Google just dropped Gemma 4 12B! You can now run it locally on just 8GB RAM using Dynamic GGUF from Unsloth. The architecture is different from any multimodal model before it. No separate vision encoder, no audio encoder. Both flow directly into the LLM backbone. Vision is handled by a lightweight embedding module - a single matrix multiplication. Audio is projected directly into the same space as text tokens. This is also the first mid-sized Gemma model with native audio input. Text, images, and audio all handled natively in one model running on your laptop. Key details: • Encoder-free architecture: vision and audio inputs flow directly into the LLM • Text, image, and audio support in one unified model • Runs on 8GB RAM (4-bit) or 14GB (8-bit) • 256K context window • 140+ languages support • Performance nearing the larger 26B MoE model • Multi-Token Prediction drafters built in to reduce latency • Thinking mode with explicit enable/disable control • Apache 2.0 license I've shared the Unsloth guide to run it locally in the replies!

Sumanth_077's tweet photo. Google just dropped Gemma 4 12B!

You can now run it locally on just 8GB RAM using Dynamic GGUF from Unsloth.

The architecture is different from any multimodal model before it. No separate vision encoder, no audio encoder. Both flow directly into the LLM backbone. Vision is handled by a lightweight embedding module - a single matrix multiplication. Audio is projected directly into the same space as text tokens.

This is also the first mid-sized Gemma model with native audio input. Text, images, and audio all handled natively in one model running on your laptop.

Key details:

• Encoder-free architecture: vision and audio inputs flow directly into the LLM
• Text, image, and audio support in one unified model
• Runs on 8GB RAM (4-bit) or 14GB (8-bit)
• 256K context window
• 140+ languages support
• Performance nearing the larger 26B MoE model
• Multi-Token Prediction drafters built in to reduce latency
• Thinking mode with explicit enable/disable control
• Apache 2.0 license

I've shared the Unsloth guide to run it locally in the replies!

Who to follow

Aurimas Griciūnas

@Aurimas_Gr

Akshay 🚀

@akshay_pachaar

Simplifying LLMs, AI Agents, RAG, and Machine Learning for you! • Co-founder @dailydoseofds_• BITS Pilani • 3 Patents • ex-AI Engineer @ LightningAI

Simplifying LLMs, RAG, Machine Learning & AI Agents for you! • ML Developer Advocate • Shipping Open Source AI apps

c4ml_ retweeted

LandingAI

@LandingAI

2 days ago

https://t.co/iMNTj8n1UM

4 days ago

@Sumanth_077 Thanks for curating. This is definetely gonna help a lot!

263

c4ml_ retweeted

4 days ago

AI Engineering Toolkit! I have curated a list of 100+ libraries and frameworks for training, fine-tuning, building, evaluating and deploying LLMs, RAG, and AI Agents. Categories of LLM Libraries include: • Vector Databases – Store and retrieve embeddings efficiently. • Orchestration & Workflows – Chain tools and LLM calls, manage pipelines. • Agent Frameworks – Build autonomous and multi-step agents. • Training & Fine-Tuning – Pretrain, fine-tune, and adapt models. • Inference & Serving – Run LLMs efficiently on diverse hardware. • Safety & Security – Guardrails, red-teaming, and policy checks. • Evaluation & Quality – Test and monitor both LLMs and LLM-powered apps (benchmarks, unit tests, telemetry, feedback). • Model Management – Versioning, experiment tracking, lineage, and lifecycle management. • AI App Development – Build UIs and LLM-powered apps quickly with Python-based frameworks. Link to the repo in the comments!

Sumanth_077's tweet photo. AI Engineering Toolkit!

I have curated a list of 100+ libraries and frameworks for training, fine-tuning, building, evaluating and deploying LLMs, RAG, and AI Agents.

Categories of LLM Libraries include:

• Vector Databases – Store and retrieve embeddings efficiently.
• Orchestration & Workflows – Chain tools and LLM calls, manage pipelines.
• Agent Frameworks – Build autonomous and multi-step agents.
• Training & Fine-Tuning – Pretrain, fine-tune, and adapt models.
• Inference & Serving – Run LLMs efficiently on diverse hardware.
• Safety & Security – Guardrails, red-teaming, and policy checks.
• Evaluation & Quality – Test and monitor both LLMs and LLM-powered apps (benchmarks, unit tests, telemetry, feedback).
• Model Management – Versioning, experiment tracking, lineage, and lifecycle management.
• AI App Development – Build UIs and LLM-powered apps quickly with Python-based frameworks.

Link to the repo in the comments!

12K

c4ml_ retweeted

LandingAI

@LandingAI

7 days ago

https://t.co/c8T0gqbEm4

17K

7 days ago

@Sumanth_077 This is a bit complex but really interesting and could be a real part of agent harness, thanks for the share.

465

c4ml_ retweeted

7 days ago

Microsoft just turned SKILL .md into a trainable object! SkillOpt is a text-space optimizer for agent skills. Instead of hand-writing or one-shot generating your SKILL .md, SkillOpt treats the skill document as the trainable external state of a frozen agent and optimizes it through a feedback loop. The core idea: a separate optimizer model analyzes agent rollout trajectories, proposes bounded add/delete/replace edits to the skill document, and accepts only edits that strictly improve performance on a held-out validation split. Rejected edits go into a buffer as negative feedback for future iterations. The deep learning analogy is intentional. Rollout batch is your training data. Edit budget is your learning rate. Validation gate is your validation set. Rejected-edit buffer is your negative feedback signal. The optimizer runs offline. The deployed artifact is just a static SKILL .md file. Results on GPT-5.5 across 6 benchmarks: +23.5 points average over no-skill baseline in direct chat, +24.8 inside Codex, +19.1 inside Claude Code. SpreadsheetBench jumped from 41.8 to 80.7. OfficeQA from 33.1 to 72.1. Best or tied-best on 52 of 52 evaluated cells. What's striking: these gains come from just 1-4 accepted edits. The final skill stays compact at 300-2000 tokens. One accepted edit gave OfficeQA a +39 point gain. Optimized skills also transfer. A SpreadsheetBench skill trained in Codex transferred to Claude Code with a +59.7 point gain. Skills trained on GPT-5.4 improved every smaller GPT variant tested. Key capabilities: • Text-space skill optimization with no model weight updates • Bounded add/delete/replace edits with validation gating • Rejected-edit buffer as negative feedback • Epoch-wise slow/meta update for longer-horizon learning • Works across Claude Code, Codex, and direct chat harnesses • Optimized skills transfer across models, harnesses, and benchmarks 100% Open Source I've shared the link to the paper and repo in the comments!

Sumanth_077's tweet photo. Microsoft just turned SKILL .md into a trainable object!

SkillOpt is a text-space optimizer for agent skills. Instead of hand-writing or one-shot generating your SKILL .md, SkillOpt treats the skill document as the trainable external state of a frozen agent and optimizes it through a feedback loop.

The core idea: a separate optimizer model analyzes agent rollout trajectories, proposes bounded add/delete/replace edits to the skill document, and accepts only edits that strictly improve performance on a held-out validation split. Rejected edits go into a buffer as negative feedback for future iterations.

The deep learning analogy is intentional. Rollout batch is your training data. Edit budget is your learning rate. Validation gate is your validation set. Rejected-edit buffer is your negative feedback signal. The optimizer runs offline. The deployed artifact is just a static SKILL .md file.

Results on GPT-5.5 across 6 benchmarks: +23.5 points average over no-skill baseline in direct chat, +24.8 inside Codex, +19.1 inside Claude Code. SpreadsheetBench jumped from 41.8 to 80.7. OfficeQA from 33.1 to 72.1. Best or tied-best on 52 of 52 evaluated cells.

What's striking: these gains come from just 1-4 accepted edits. The final skill stays compact at 300-2000 tokens. One accepted edit gave OfficeQA a +39 point gain.

Optimized skills also transfer. A SpreadsheetBench skill trained in Codex transferred to Claude Code with a +59.7 point gain. Skills trained on GPT-5.4 improved every smaller GPT variant tested.

Key capabilities:

• Text-space skill optimization with no model weight updates
• Bounded add/delete/replace edits with validation gating
• Rejected-edit buffer as negative feedback
• Epoch-wise slow/meta update for longer-horizon learning
• Works across Claude Code, Codex, and direct chat harnesses
• Optimized skills transfer across models, harnesses, and benchmarks

100% Open Source

I've shared the link to the paper and repo in the comments!

148

181

18K

9 days ago

@Sumanth_077 This is really cool, Thanks for sharing!

c4ml_ retweeted

9 days ago

Self Improving AI (SIA) beats Karpathy's autoresearcher agent by improving itself! SIA is a Self Improving AI framework to autonomously improve the performance of any AI system (Model / Agent) on a benchmark task. Most agent frameworks are static. Fixed harness, fixed model weights, fixed memory layer. They plan, act, and use tools. SIA operates on a different layer entirely. SIA focuses on one problem: how do you design structured feedback loops that allow an agent to evaluate its own performance, adapt its strategy, and get better over time? After every run, SIA evaluates itself and improves three things. It updates its own harness. Updates the weights of its underlying model. Updates its own memory layer to handle new complexities. The agent rewrites itself based on what it learned. On MLE-Bench, OpenAI's benchmark for evaluating an agent's ability to train ML models, SIA climbed to the top of the leaderboard. Beat every specialized ML research agent including MLEvolve and AIRA-dojo. Then kept improving and displaced its own previous versions on the leaderboard. I've shared the link to the paper and the repo in the replies!

Sumanth_077's tweet photo. Self Improving AI (SIA) beats Karpathy's autoresearcher agent by improving itself!

SIA is a Self Improving AI framework to autonomously improve the performance of any AI system (Model / Agent) on a benchmark task.

Most agent frameworks are static. Fixed harness, fixed model weights, fixed memory layer. They plan, act, and use tools. SIA operates on a different layer entirely.

SIA focuses on one problem: how do you design structured feedback loops that allow an agent to evaluate its own performance, adapt its strategy, and get better over time?

After every run, SIA evaluates itself and improves three things. It updates its own harness. Updates the weights of its underlying model. Updates its own memory layer to handle new complexities. The agent rewrites itself based on what it learned.

On MLE-Bench, OpenAI's benchmark for evaluating an agent's ability to train ML models, SIA climbed to the top of the leaderboard. Beat every specialized ML research agent including MLEvolve and AIRA-dojo. Then kept improving and displaced its own previous versions on the leaderboard.

I've shared the link to the paper and the repo in the replies!

987

174

88K

10 days ago

@Sumanth_077 The SOP building from completed tasks is the real unlock here.

110

c4ml_ retweeted

10 days ago

Run your personal AI company with a team of AI agents! Alook is an open-source collaboration platform for AI coding agents. Self-hosted and local-first. The setup: Define an org structure. Give each agent a role - dev, ops, research, whatever you need. Set reporting lines. Alook gives each agent an email address. How it works: Assign a task to the right agent. They take it from there. Agents coordinate through email - passing deliverables, asking questions, updating status. You see everything in your inbox but you're not routing anything manually. Runs as an always-on daemon. Close your laptop, agents keep working. Come back to finished tasks. Shared memory across all agents. Every agent knows what every other agent worked on. You never re-explain context. After each task completes, Alook logs what worked and builds SOPs. The whole team gets sharper over time. Works with Claude Code, Codex, and OpenCode. Mix and match or run multiple agents from one runtime. Built-in Kanban for task tracking. Calendar for scheduling. Email for all communication. Agents pick up tasks autonomously, update their own calendars, close issues when done. Chat or email with agents like any AI tool. Install the runtime once, runs in the background. No terminal needed after setup. Key capabilities: • Email-based agent coordination with real inboxes • Org structure with roles and reporting lines • Shared memory and self-learning SOPs • Always-on daemon for 24/7 operation • Works with Claude Code, Codex, OpenCode • Built-in Kanban, calendar, and email • Self-hosted and local-first 100% open source. I've shared the Github Repo in the replies!

Sumanth_077's tweet photo. Run your personal AI company with a team of AI agents!

Alook is an open-source collaboration platform for AI coding agents. Self-hosted and local-first.

The setup: Define an org structure. Give each agent a role - dev, ops, research, whatever you need. Set reporting lines. Alook gives each agent an email address.

How it works: Assign a task to the right agent. They take it from there. Agents coordinate through email - passing deliverables, asking questions, updating status. You see everything in your inbox but you're not routing anything manually.

Runs as an always-on daemon. Close your laptop, agents keep working. Come back to finished tasks.

Shared memory across all agents. Every agent knows what every other agent worked on. You never re-explain context. After each task completes, Alook logs what worked and builds SOPs. The whole team gets sharper over time.

Works with Claude Code, Codex, and OpenCode. Mix and match or run multiple agents from one runtime.

Built-in Kanban for task tracking. Calendar for scheduling. Email for all communication. Agents pick up tasks autonomously, update their own calendars, close issues when done.

Chat or email with agents like any AI tool. Install the runtime once, runs in the background. No terminal needed after setup.

Key capabilities:

• Email-based agent coordination with real inboxes
• Org structure with roles and reporting lines
• Shared memory and self-learning SOPs
• Always-on daemon for 24/7 operation
• Works with Claude Code, Codex, OpenCode
• Built-in Kanban, calendar, and email
• Self-hosted and local-first

100% open source.

I've shared the Github Repo in the replies!

10K

11 days ago

@Sumanth_077 Great to see all the coding agents support including OpenCode, Codex, OpenClaw. Thanks for share

242

c4ml_ retweeted

11 days ago

Turn any GUI app into a CLI for AI agents! CLI-Anything automatically generates command-line interfaces for any software by analyzing its codebase. Point it at GIMP, Blender, LibreOffice, or any application and get a production-ready CLI that AI agents can use. The problem: AI agents can't use professional software. These tools have GUIs, not command interfaces. UI automation breaks constantly. APIs don't exist or miss 90% of functionality. Building custom CLIs manually takes months. CLI-Anything runs as a Claude Code plugin. You call `/cli-anything ./gimp` and it analyzes the codebase, designs command architecture, implements the full CLI, writes comprehensive tests, and publishes everything as a pip-installable package. The generated CLIs call the actual software backends. Blender renders 3D scenes, LibreOffice generates PDFs, Zoom schedules meetings. How it works depends on what the software provides. Tools with REST APIs get wrapped with OAuth handling. Tools with Python scripting get script generators. Tools with documented file formats get file creators plus rendering calls. Each CLI works the same way for agents. Commands return structured JSON. REPL mode for interactive use. pip-installable to PATH. Comprehensive tests included. Tested across 16 diverse applications: GIMP, Blender, Inkscape, Audacity, LibreOffice, OBS Studio, Kdenlive, Shotcut, Zoom, Mermaid, ComfyUI, Ollama. 1,839 passing tests with 100% pass rate. Key capabilities: • Automated CLI generation from any codebase • Calls actual software backends (no replacements) • Multiple integration approaches (APIs, scripting, file formats) • Production-ready with comprehensive tests • pip-installable with JSON output for agents • REPL mode for interactive use Built as Claude Code plugin. Works with OpenCode, Codex, OpenClaw, Qodercli. I've shared the link to the repo in the comments!

Sumanth_077's tweet photo. Turn any GUI app into a CLI for AI agents!

CLI-Anything automatically generates command-line interfaces for any software by analyzing its codebase. Point it at GIMP, Blender, LibreOffice, or any application and get a production-ready CLI that AI agents can use.

The problem: AI agents can't use professional software. These tools have GUIs, not command interfaces. UI automation breaks constantly. APIs don't exist or miss 90% of functionality. Building custom CLIs manually takes months.

CLI-Anything runs as a Claude Code plugin. You call `/cli-anything ./gimp` and it analyzes the codebase, designs command architecture, implements the full CLI, writes comprehensive tests, and publishes everything as a pip-installable package.

The generated CLIs call the actual software backends. Blender renders 3D scenes, LibreOffice generates PDFs, Zoom schedules meetings.

How it works depends on what the software provides. Tools with REST APIs get wrapped with OAuth handling. Tools with Python scripting get script generators. Tools with documented file formats get file creators plus rendering calls.

Each CLI works the same way for agents. Commands return structured JSON. REPL mode for interactive use. pip-installable to PATH. Comprehensive tests included.

Tested across 16 diverse applications: GIMP, Blender, Inkscape, Audacity, LibreOffice, OBS Studio, Kdenlive, Shotcut, Zoom, Mermaid, ComfyUI, Ollama. 1,839 passing tests with 100% pass rate.

Key capabilities:

• Automated CLI generation from any codebase
• Calls actual software backends (no replacements)
• Multiple integration approaches (APIs, scripting, file formats)
• Production-ready with comprehensive tests
• pip-installable with JSON output for agents
• REPL mode for interactive use

Built as Claude Code plugin. Works with OpenCode, Codex, OpenClaw, Qodercli.

I've shared the link to the repo in the comments!

13 days ago

@Sumanth_077 Really interesting as it is using a Rust-based engine, and it's is 5x faster.

177

c4ml_ retweeted

13 days ago

Turn any document into structured data for AI agents! Firecrawl just released a new parse endpoint. Upload local files or non-public documents and get back clean, LLM-ready data. The parse endpoint converts PDF, DOCX, XLSX, HTML, and other formats into Markdown, JSON, or structured output. Reading order and tables are preserved. Upload a file via multipart/form-data. The endpoint processes it using a Rust-based engine (up to 5x faster) and returns your chosen format. Key capabilities: • Multiple output formats: Markdown, JSON, HTML, summaries, extracted links, or metadata • Preserves document structure, reading order, and tables • Extracts metadata automatically (title, description, language) • Zero data retention option (document not logged or stored) • Content filtering via includeTags and excludeTags Built for AI agent pipelines that need clean document data at scale. I've shared the link in the comments!

Sumanth_077's tweet photo. Turn any document into structured data for AI agents!

Firecrawl just released a new parse endpoint. Upload local files or non-public documents and get back clean, LLM-ready data.

The parse endpoint converts PDF, DOCX, XLSX, HTML, and other formats into Markdown, JSON, or structured output. Reading order and tables are preserved.

Upload a file via multipart/form-data. The endpoint processes it using a Rust-based engine (up to 5x faster) and returns your chosen format.

Key capabilities:

• Multiple output formats: Markdown, JSON, HTML, summaries, extracted links, or metadata
• Preserves document structure, reading order, and tables
• Extracts metadata automatically (title, description, language)
• Zero data retention option (document not logged or stored)
• Content filtering via includeTags and excludeTags

Built for AI agent pipelines that need clean document data at scale.

I've shared the link in the comments!

14 days ago

@Sumanth_077 The dynamic quantization selection is smart. Great share!

248

c4ml_ retweeted