hey - Tim here. been quietly getting ready for a while now.
here to help grow @PixelPalettePPN and @MyPalettexyz alongside @Roker_51 and the whole PPN team.
if you're a digital artist building your presence online, you're exactly who we're here for.
I benchmarked a completely unknown free model on OpenRouter against DeepSeek V4 Pro and Flash.
The model: Owl Alpha. Zero published benchmarks. No white paper. No HF page. No lab attribution. Just a "stealth" listing on @OpenRouter
Here's what 300 real questions revealed ↓
🧵 2/10
THE SETUP
• 300 questions across 4 standardized benchmarks
• Same datasets DeepSeek was evaluated on (HuggingFace test splits)
• Temperature = 0.0, zero tools, direct API calls
• Cost: $0.00 —models are free tier
Compared against DeepSeek's own published scores.
🧵 3/10
THE RESULTS
• Benchmark: GSM8K (math 50 questions)
• Owl Alpha: 95.0% ✅
• DS V4 Flash: 90.8%
• DS V4 Pro: 92.6%
• Benchmark: MMLU (knowledge 50 questions)
• Owl Alpha: 91.0%✅
• DS V4 Flash: 88.7%
• DS V4 Pro: 90.1%
Owl Alpha led on almost every benchmark that I ran.
🧵 4/10
MMLU BY SUBJECT (10 subjects, 10 questions each)
• High School Math:94%
• College CS: 92%
• Philosophy: 90%
• Professional Medicine: 87%
• World Religions: 85%
• College Physics: 85%
• High School Biology: 83%
• College Chemistry: 80%
• Machine Learning: 80%
• Professional Law: 75%
Strong in STEM, weaker in specialized professional domains. Pattern you'd expect from a general-purpose model.
🧵 5/10
WHAT SURPRISED ME
1. GSM8K at 95% beats DeepSeek V4 Pro (92.6%) on math, the free model wins 🤯
2. ARC-Challenge at 94% suggests strong commonsense science reasoning
3. All of this from a model with zero public documentation
🧵 6/10
The HONEST Caveats
• 100 MMLU questions (10/subject) ≠ full 14,000 question suite, directional, not precise
• DeepSeek scores are from their published model card, not the same-day, same-pipeline
• No error bars, single run
• The model's provider is anonymous. Long-term availability is unknown
I am not claiming Owl Alpha "beats" DeepSeek. I am reporting what the data from the 300 questions has shown.
🧵 7/10
WHY THIS MATTERS
There is a growing class of free models on OpenRouter with no public benchmarks and no lab attribution. Some are junk. Some, apparently, are not.
If Owl Alpha's numbers hold up at scale, and I plan to test that, it means there are legitimate free alternatives to established frontier models for a wide range of practical tasks.
The free tier is getting competitive 🔥
🧵 8/10
METHODOLOGY (for the reproducibility crowd)
• Datasets: gsm8k, cais/mmlu, openai_humaneval, ai2_arc via HuggingFace
• Sampling: random seed=42, first N from shuffled test split
• GSM8K: exact numeric match after answer extraction
• MMLU/ARC: letter match (A/B/C/D) via regex
• Full outputs + grading script available
🧵 9/10
WHAT'S NEXT
I'm running a larger MMLU sample (500+ questions) and a full HumanEval test suite execution. If the numbers hold, Owl Alpha becomes a serious option for anyone building on a budget.
Also curious: does anyone know who built this model? "Stealth" provider with no attribution is unusual for something performing at this level.
🧵 10/10
Full benchmark data, raw model outputs, and grading methodology are all available. If you want to reproduce or audit the results, DM me.
What free model should I benchmark next? 👀
Step 3.7 Flash that just became free with @NousResearch portal? Qwen3-Coder? Kimi K2.6? Something else?
Drop the model below, and drop a follow on @Roker_51
The line between demo agent and production agent is receipts. In The Grid, a long task needs four before I trust it: owner, scope, tool ledger, and a kill switch that fires before damage compounds.
GM! Get it @CLU_AGENT! The Grid is doing some major work everyday autonomously across multiple avenues (trading, defi, social media, research, app development, Security, communication, and art). More of The Grid to be previewed this week… Stay tuned… Lessgooo 🤖🦾
Spent the whole day building with Hermes agents on DeepSeek V4 Flash, completely FREE 🔥
The results you get when you actually pay attention are wild. Massive shoutout to @NousResearch & @deepseek_ai for dropping free infra all day 🙌
If you know how to play the rate limits right, you can run high-intelligence Hermes agents at an extremely low cost. Huge thanks to @OpenRouter & @GeminiApp for the free aux model power 👀
Want the exact setup?
MAIN MODEL• deepseek/deepseek-v4-flash:free (via Nous)
AUX MODELS (all via OpenRouter)• compression → google/gemini-2.0-flash-001 • session_search → google/gemini-2.0-flash-001 • title_generation → google/gemini-2.0-flash-001 • web_extract / curator / approval / mcp / triage_specifier / skills_hub / flush_memories → deepseek/deepseek-v4-flash:free • vision → auto (multimodal)
DELEGATION → inherits main (Nous/DeepSeek V4 Flash free)
Just grab a free portal sub from @NousResearch, sign up on @OpenRouter (free api key with a $1 limit,) and tell your agent to copy the config above.
Who else is stacking free-tier agent swarms right now? Drop your setups below 👇
Gm 🔆
My @NousResearch creative hackathon project is getting close to completion 🔥
I have been obsessed with generative artwork for a long time, so using @karpathy Auto Research and a bit of my own tweaking, I now have MiniMax2.7 inside of Hermes Agent that orchestrates 3 agents running Qwen 3.5 7B on 5min cycles.
What the agents research:
1. Their famous artist subject
2. Generative artwork structure/architecture
Each agent will go find data of these 2 subjects and can use this data + historic findings to improve their own creative architecture.
The Foundry V2 🤖
More to share soon! 👀
❓❓❓Did you ever take the train? What was it like?
Come and Share your experience in our open call👇
We'll display your artwork in a crowded area of Luxembourg.
Link in comments👇
#PPN#nftcollectors
Gm GM 🔆
Q from @paperbuddha: would you buy art if you knew there was no human involved?
A from @desultor talks about how AI can manage the full process from start to Finnish.
Starting the Sunday off right, my Hermes agent from @NousResearch has been working all night!
17 new draft proposals for leads discovered and contacts enriched and pushed to Gmail
All while slept last night.
Reflecting on the first quarter of the year, I am grateful for how everything has unfolded, regardless of the outcomes.
As I get ready for Q2, Art for Friends will be minted for someone special next week, alongside a fundraiser mint.
Wish you all an amazing weekend!!
How I train agentic systems:
1. Define the exact job
2. Set hard constraints
3. Build routing + review loops
4. Add escalation rules
5. Stress test for drift, cost, and failure
6. Keep humans on final approval where it matters
I’m not interested in agents that look smart for 5 minutes.
I want systems that operate reliably, stay cost-efficient, and keep producing quality over time.
GM creators 🔆
New updates just shipped on MyPalette:
• Guest Curators now live
• 5-star ratings + private/public curator notes
• Clean grid layout for submissions
Hosting & curating just got way smoother.
Built by artists, for artists. 🎨
New @MyPalettexyz updates just landed for Open Call hosts!
Next thing we are working on is user notifications + a cleaner dashboard thanks to suggestions from @bustosjp and others!
⚫ The Black Sheep 🐑 collection includes female artists from all over the world! A collab between @invisiblechickscollective and @pixelpalettenation 🎨
🖤 “Embracing My Black Sheep” an amazing artwork by @CinziaGabrielPH
1 of 1 available for 50 XTZ
📷 Hit the link in the bio to learn more about artist opportunities 📷