As promised, here are my thoughts after spending all day with Mythos. i hope to god anthropic doesnt sue the fuck outta me but yolo. fair warning, this is a long one.
1. The Cost
Mythos pricing, at least for our enterprise was uhh expensive. I thought being a pilot company would mean they’d let us try it for free but no lmao. They did give a decent amount of free tokens from the API at least, but cost estimates put us well above a million dollars spent on it. In comparison, my company spent 2 million on inference for the entirety of last month for everyone in the company. So yeah, shit is pricey as hell.
2. The harness
The biggest surprise to me was that they actually sent us a harness that was NOT claude code. its sort’ve dinky and, looks to me largely ai generated. most of it focused on ensuring mythos did not “escape containment” along with some shitty security skills. so, they are def taking the sandboxing seriously. imo its pretty shit/restrictive harness. half of the guard rails dont work, lmao and apparently this is basically what “project glasswing” is, which is pretty funny considering the harness is shit. im not sure that the harness will be released with the model api when it drops either, it seemed like that was part of the deal. quite interested to see what they do when it drops/how it gets opened up.
I was able to use Mythos outside of the harness (omp btw)… more on that in a sec, though, I did have to hack around as they really dont want people to do this (what I was told at least)
3. the model
probably the part everyone is most interested in. i will say, the model is good. is it expensive? fuck yes. but its good. to me, it feels like it is fined tuned explicitly for this sort’ve security research tasks. for general coding, which I wasn’t able to play with much, it wasnt that surprising. but, it is indeed very good at security based tasks. far better than opus / 5.5 xhigh.
that said, I dont feel as though its some omnipresent danger/threat to society. I watched it get confused trying to use our build tool, actually to the point where I had to build the code for it and then run the model against the full build. you’d think an omnipresent model could do this, but nothing on the market have been able to figure it out. and its just Bazel with some custom shit we built. nothing crazy.
that said, if people have a shit ton of money AND extensive harness knowledge, yeah, they can probably use it to do some malicious shit. but only a genuinely skilled engineer/security researcher.
4. The results
Mythos was able to find quite a bit of vulnerabilities across a few of our products (like products probably everyone on this app has interacted with indirectly, maybe a small few directly). I think the final total was like ~800 major threats. Definitely enough to rethink some of the security strategy.
5. Final Thoughts
It’s a good model sir. It’s not an existential threat to humanity as Anthropic might lead you to believe, but it’s genuinely good. Cost wise I would like to try a comparison with 5.5 xhigh but alas I dont have a million dollars to throw at it to do a proper comparison.
oh-my-pi is a must have because it ships with remote compaction for Codex models, but the harness also comes with comprehensive tooling that is worth adopting instead of customizing yourself from base pi.
@_can1357 shipped dynamic workflows (subagents orchestrated with deterministic JS/Python) months before claude code did. If you fully take advantage of these features you could run large scale engineering efforts solo faster then other devs.
We already see devs like @usr_bin_roygbiv stacking jobs and writing scrapers to do exactly that to maximize cashflow. To stay ahead of the curve you need to try the newest features and models and develop your own intuition for what works
DeepSeek V4-Flash 162B runs on 64GB Apple (M2) !
The Legend @0xSero made a REAP checkpoint that runs with DwarfStar by @antirez after minor changes with Codex:
- ds4.c so that it knows how what expert_count to expect (required)
- ds4_metal.m to get around the macOS SDK 15 check (as necessary)
@AriDavidPaul Improving nutrition is critical. This Harvard study has 150+ data points for the most common foods. Knowing the less reported stuff like Choline is critical to improving anxiety : https://t.co/V9o6lfFBdE
I told a guy at a barbecue last weekend that I had been buying busted small-cap software stocks at 4x free cash flow, and he looked at me with the specific facial expression of a man who has just realized he is trapped in a conversation with someone who voluntarily reads 10-Ks on vacation. He asked, with great gentleness, if I had considered Nvidia. I said I had considered Nvidia in the way one considers jumping off a bridge: briefly, theoretically, and with a clear understanding of the outcome.
I told him I owned a company that sells dental practice management software to 11,000 orthodontists and that the CEO, a 64-year-old man named Greg who has not updated his LinkedIn since 2017, was, in my professional opinion, the single greatest capital allocator alive in North America today, and that I would, if legally permitted, have Greg’s name tattooed on my forearm.
He asked if Greg knew this. I said Greg did not know I existed, and that this was the foundation of our relationship and the source of its strength. He excused himself to go check on his children, who, I observed, were not present at the barbecue. I stood by the grill alone for the next 40 minutes, eating directly from a bag of buns, thinking about Greg, who at that exact moment was, somewhere in suburban Indianapolis, almost certainly buying back stock at prices that will, in 2031, be regarded as the single greatest gift any small-cap CEO has ever given his shareholders, and the host’s wife came over and asked, with palpable concern, if I needed a ride home, and I said no, I needed nothing, I had Greg, and Greg was enough, and I have not been invited back to that house, and I do not care, because Greg loves me even though Greg does not know I am alive, and the math, as it has always been in every great deep value trade in history, is the only thing in this country that has not lied to me.
The Local LLM cheat sheet for your 16GB RAM device
I pulled together a lineup of small models that can run comfortably on a Mac Mini or personal laptop while still leaving room for context without melting your machine.
Models for Daily Use
Qwen3.5 9B / GGUF / Q4_K_M
Daily driver. General chat, drafting, research, translation. If you're keeping only one, keep this.
DeepSeek-R1 Distill Qwen 7B / GGUF / Q4_K_M
Reasoning engine. Math, logic, step-by-step problems. Slower, but worth it when you need actual thinking.
Models for Specialty Work
Qwen2.5 Coder 7B / GGUF / Q4_K_M
Code specialist. Completions, refactors, debugging, repo Q&A. Better than a generalist when the task is code.
Llama 3.1 8B / GGUF / Q4_K_M
Long context worker. RAG, doc chat, codebase Q and A. The output isn't top tier, but the context is strong for its size.
Phi-4 Mini Reasoning / GGUF / Q4_K_M
Compact thinker. Logic, structured answers, math, and short coding bursts. Smaller context is the catch.
Models for Efficiency
Gemma 4 E4B / GGUF / Q4_K_M
Light all-rounder. Writing, chat, light agents, structured output.
Phi-3.5 Mini / GGUF / Q5_K_M
Pocket sidekick. Summaries, extraction, background doc chat. Easy to pair with a bigger model.
Qwen3.5 2B / GGUF / Q4_K_M
Useful for summaries, tagging, rewrites, and lightweight sidekick work.
Micro Models
Qwen3.5 0.8B / GGUF / Q5_K_M
Classification, keyword routing, binary decisions, triage.
Gemma 4 E2B-it / GGUF / Q4_K_M
Lightweight chat, quick Q and A, summaries, tiny agents.
My personal choice for a single model is Qwen3.5 9B
For two models use Qwen3.5 9B + Qwen2.5 Coder 7B for code, or Qwen3.5 9B + Phi-3.5 Mini for support tasks.
Let me know in the comments your experience with these models, or any I have left out.
@FundamentEdge Where does the deep context document live? Claude.md is read at the start of every session, so you might be burning a lot more tokens than you need too. I’m building something similar - we should talk!
This is the M&A diligence version of @karpathy's "LLM Wiki" pattern. Every run enriches a persistent knowledge base. Entity profiles, contradictions, finding lineage, ontology and LLM enriched markdown. With strong guardrails, verifications and sources citations.
@cfo_mm You're on to something. Can you have the current technical analyst debate a new analyst that is informed by congressional PTRs and pending legislation? I think you'll see good (but infrequent) buys and generate killer content either way.
plan mode sucks, across all coding agents
I wrote down my workflow for working with claude code: - plan in a dedicated doc
- annotate the doc
- iteratively work with claude with a persistent artifact that doesn't get compacted
https://t.co/0Zz68i1M68
Summarizing data rooms shouldn’t cost $10,000.
If you work in PE/VC, I’m sure you’ve been pitched an automated AI due diligence solution that promises massive time savings. Unfortunately, most products fail as soon as the demo is over and have questionable security/privacy practices.
We built our own RAG application that beats ChatGPT at summarizing data rooms at a fraction of the cost after taking Systematically Improving RAG Applications by @jxnlco , a Waterloo alum who built AI content moderation products at Meta.
The secret? OpenAI’s models routinely struggle with financial tables, especially when they are embedded as images. By separately transforming these tables, you can dramatically improve summarization and financial analysis. This was one of the many insights we learned from Jason during the course which also covers advanced multi-modal handling, LLM based data extraction, and fine-tuning techniques.
If you want to build AI tools that automate your most painful tasks, this course is a must. Next cohort runs Sep 16 – Oct 23, 2025. Here’s the course link with 20% off if you’re interested: https://t.co/aDKhKEOeWe