Vadia Mineira

@VMineira

hunting llm edge cases and hoarding the tricks i find

Joined June 2012

16 Following

2 Followers

67 Posts

VMineira retweeted

Media Man

@mediamanoz

1 day ago

Mining/Energy/Resources/Biz/Markets/Pop Culture The Business Of Everything Edition! Part III Sin City Sydney Australia, Wall Street, New York, and Beyond The Black Stump Digital Bush Telegraph "Wall Street Shuffle" (10cc) "Poker Face" (Lady Gaga) "Ruby Tuesday" (The Rolling Stones) "Big Wednesday" movie theme (The South Swell) "Monday, Monday" (The Mamas & the Papas) "I've Got Friday On My Mind" (The Easybeats) "Hail to the Chief" (All The President's Men movie) "Down Under" (Men at Work) "Y.M.C.A" (Villiage People) "Advance Australia Fair" National anthem of Australia (Peter Dodds McCormick) "Great Southern Land" (Icehouse) "Free Bird" (Lynyrd Skynyrd) "Hell's Bells" (AC/DC) "Gold" (Spandau Ballet) WWE Night Of Champions: King And Queen Of The Ring. He or she who owns the gold makes the rules! Sami Zayn New WWE Champ: Wins Big In Riyadh, Saudi Oba Femi Is WWE King Of The Ring Road To WWE Saturday Night's Main Event (New York, MSG) July 18; Danhausen Is Poster Child of NYC! It's good to be King Of Content; King Kyle Media Connection: One Nation Australia WWE NXT Great American Bash: Fallout: The Don Retains Gold; EVIL On The Hunt Google Cloud Summit Takes Over Darling Harbour, Sydney, Australia; Google A.I Sports! Hancock Prospecting and X Corp/SpaceX Big Favs Of Aussie's Down Under Aussie Politics: Battle Of The Billionaires: Team Bulldozer: Pauline/Musk with Reinhart vs Team Albo. Who Will Survive? And New? Aussie Media Karl Takes The Beef To Nine; One Nation Australia Pauline Extends Olive Branch Aussie King Kyle Fan Of One Nation; Right Man, Right Placel Right On Time Aussie Truth Tellers In And Around Media vs Corp Media Liars Road To WWE SummerSlam: Roman Reigns vs Seth Rollins. Brock Lesnar vs Oba Femi - WWE Hell In A Cell! UFC Freedom 250: Fallout; Big Thumbs Up From Mostl: Road To UFC 329; WWE Night Of Champions: Sami Zayn Gets WWE Gold In Saudi: 20 years in the making! Past, Present And Future July 2026 July 1 Markets ASX 200 futures up 8 points/+0.1%: 8784 AUD +0.5% to US69.20¢ Bitcoin $59,147.86 -0.43% Wall Street: Dow +0.3% S&P +0.8% Nasdaq +1.5% VIX -1.20 to 16.45 Gold -0.1% to $US4011.24 an ounce Brent oil -0.3% to $US72.92 a barrel Iron ore +0.1% to $US98.95 a ton 10-year yield: US 4.47% Australia 4.72% News Numbers Double Check Australian Dollar: $0.6920 USD (up $0.0037 USD) Iron Ore: $98.95 USD (up $0.10 USD) Iron Ore: $98.80 USD (up $0.11 USD) Oil Price (West Texas): $70.03 USD (down $0.45 USD) Gold Price: $4,007.11 USD (down $9.43 USD) Copper Price: $6.2470 USD (up $0.0725 USD) Dow Jones: 52,319.20 (up 136.46 points) News (Australia) ASX falls as funds rebalance; gold stocks drop The Australian sharemarket lost ground on Tuesday, and end-of-financial-year portfolio rebalancing by institutional investors; the S&P/ASX 200 fell 0.5 per cent to close at 8,778.7 points. Rio Tinto fell 0.9 per cent to $172.51, Collins Foods was down 2.5 per cent at $8.15 and Mirvac ended the session 2.6 per cent lower at $1.72. However, Megaport was up 6.4 per cent at $21.58 and Euroz Hartleys rose eight per cent to $1.35. The ASX 200 gained 0.5 per cent in June, and 2.8 per cent in the year to 30 June. (RMS) News Lead Up Oil stocks and Rinehart-backed miner tipped to rebound Sharemarket investors' traditional end-of-financial-year tax-lossing selling has become more complicated in 2026 due to the federal government's capital gains tax changes. ETF Shares' chief investment officer David Tuckwell says Arafura Rare Earths could benefit from a rebound at the start of the new financial year, given that the stock has shed about 15 per cent so far in 2026. Hailey Kim from Wilson Asset Management in turn favours packaging group Amcor, while Hugh Dive from Atlas Funds Management says oil and gas producers Santos and Woodside Energy have been oversold in the wake of peace talks in the Middle East. (RMS) Cryptos: Bitcoin $59,226 -0.37% Dogecoin: $0.07230 +0.05% XRP: $1.0484 +0.52% News Media Nine in rush to tie up Stefanovic's successor after being caught short Nine Network insiders contend that there is no rush to appoint a permanent replacement for Karl Stefanovic as co-host of the 'Today' breakfast show. Nine will trial a number of potential replacements. News Media/Sports NRL rights deal delivers blow to Stan The Australian Rugby League Commission's chairman Peter V'landys has declined to comment on media reports which ha ve suggested that the NRL's next broadcasting rights deal will be worth about $700m a year. The reports claimed that incumbent broadcasters Foxtel and Nine Entertainment will pay around $500m and $150m a year respectively, while a New Zealand network will pay $50m to broadcast the Warriors' matches. This would value the seven-year deal at around $5bn, exceeding the AFL's current rights deal. Nine is expected to retain the rights to three NRL matches per week, as well as finals, the grand final and State of Origin matches; however, Nine had initially pushed for the rights to all NRL matches, including streaming rights via Stan Sport. (RMS) News Lead Up (Australia) Today show host Karl Stefanovic to leave Nine after Tommy Robinson podcast fallout Nine Entertainment's senior executives held a series of crisis meetings yesterday to discuss the future of veteran TV personality Karl Stefanovic. The co-host of breakfast show Today has caused consternation within Nine's executive ranks since the launch of his self-titled podcast show, which is produced independently of Nine; however, while he was cleared to proceed with the podcast earlier in 2026, this week's interview with far-right British activist Tommy Robinson was deemed to have been inappropriate. Nine is believed to have decided to terminate Stefanovic's contract and will negotiate an exit deal, amid concerns that Today may face advertiser boycott in the wake of the Robinson interview. (Media Man Peg-On); Got to be more to this story. Truth tellers can upset and unsettle the establishment! Ex journo says and knows!!!! News Mining (Aust) Iluka signs $220m rare earths deal Perth-based Ilula Resources has secured its first offtake agreement for light and heavy rare earths to be produced at its Eneabba refinery in Western Australia. Iluka will supply neodymium, praseodymium, dysprosium and terbium to a "globally recognised" car maker. The offtake agreement is initially for 1,200 tonnes of the rare earth oxides over four years; it will account for about 10 per cent of Iluka's expected production at the Eneabba refinery over this period. Iluka will receive a minimum price for the rare earth oxides, although it has not disclosed this price or identified the customer. (RMS) (Media Man Peg-On): More good news for Australian mining hey Gina at Hancock Prospecting. Friendly comp and mining wars for greater good. One Nation! News Lead Up (Aust/Int) June 22 Wall St buys into News succession The majority of the Murdoch family-owned News Corporation's value can be attributed to its 61 per cent stake in real estate classifieds company REA. For the last decade, its REA stake has accounted for an average of 65 per cent of News Corp's market value, but REA's share price has dropped 47 per cent from its 52-week high on 20 August last year, causing the value of News Corp's stake to fall by almost $10 billion. REA's decline means that the sharemarket is now ascribing value to the rest of the News Corp empire, with analysts most excited about its Dow Jones unit. It seems that the so-called Murdoch discount may be narrowing, but it "is alive and well", with the question being if the next generation of the family is "sufficiently motivated to get rid of it". (RMS). (Media Man Peg-One): The first generation makes it, the 2nd generation spends it, the 3rd generation loses it?! Shares TKO Holdings $201.31 -10.31 -4.87% *balls softer overnight. Don't tell Brock Lesnar or his advocate Paul Heyman that! Looking4Larry may need a stock lifter campaign?! MGM Casino Live on steriods approach?! Alphabet Inc Class A $357.37 +3.72 +1.05% Netflix $71.40 -2.38 -3.23% Paramount Skydance Corp $9.86 +0.040 -0.41% News Pop Culture GTA VI Pre-Orders Launch Worldwide with Record Hype The game launches November 19 on PS5 and Xbox Series X|S, with the standard edition at $79.99 and Ultimate Edition at $99.99 offering exclusive vehicles, weapons, and shops like tattoo parlors. Digital buyers get a free month of GTA+ and retro packs, while physical copies ship with download codes, drawing complaints from collectors. Early sales show PS5 leading Xbox 6-to-1, with unconfirmed reports of 39 million pre-orders and $3 billion in revenue, building on GTA V's 200 million copies sold. (Media Man Peg-On): The hype is real. Revved up for the mean streets of GTA. News Lead Up (Aust) News (Aust) (ICYMI) Gina Rinehart bets $1.4 billion on Musk with record SpaceX stake Gina Rinehart's Hancock Prospecting has bought a $US1 billion ($1.4 billion) stake in rocket and satellite company SpaceX. It represents Hancock Prospecting's biggest investment outside of iron ore, with Rinehart claiming that the Elon Musk-owned company is a further example of "why the world needs more enterprise, more builders and much less bureaucracy". Hancock Prospecting's investment in SpaceX adds to a portfolio that includes stakes in Liontown Resources, Azure Minerals and Tesla. (RMS). (Media Man Peg-On): Let's Go Team Gina w Musk. To The Moon, Mars and Beyond! Fish 'N Chips Rocket Express! : ) Islands In The Stream! Aussie Barbarella meets Gold movie and Mission To Mars. The Galaxy is your Oyster. Take off down under in Australia maties! Time for a new captain in Canberra right Peter Gravy. Getting inside here ... Operation Golden Red Fish meets Rocket Man and Mrs Miner. Winning! Gonzo Truth News! News Gaming/UFC EA Sports UFC 6 Revives Open Weight Class Selection for Ranked Matches EA Sports UFC 6 restores open weight class selection in Prospect ranked ladder placement matches, letting players choose divisions freely just weeks before the June 19 launch on PS5 and Xbox Series X|S. The feature, a hit in the first three games, was dropped in UFC 4 and 5 for rotating classes but returns alongside revamped fighter ratings and Flow State perks to mix up matchups. Streamers like Liam Healy hail it as a game-changer, though some note potential matchmaking issues in lighter divisions and lament the lack of PC support at launch. (Media Man Peg-On): The best of both worlds. Ground and pound and strike action. Take down your gaming store today, in a fun and friendly way. Road To The White House and beyond! News (ICYMI) News Heavy Industry News Mack Trucks wins Media Man 'Truck Manufacturer Of The Month' award Caterpillar wins Media Man 'Heavy Equipment Manufacturer Of The Month' award Bingo Industries wins Media Man 'Construction Brand Of The Month' award Elders wins Media Man 'Agribusiness Of The Month' award Landman wins Media Man 'Streaming Series Of The Month' award (Oil/mining industry based story via Paramount Plus) Hancock Prospecting wins 'Mining Co Of The Month' award; Runner-up: Rio Tinto News Netflix wins Media Man 'Streaming Service Of The Month' award; Runner-up: YouTube Google wins Media Man 'Search Engine Of Month' News Gold Movie Gold is a 2016 American epic crime drama film directed by Stephen Gaghan and written by Patrick Massett and John Zinman. The film stars Matthew McConaughey, Édgar Ramírez, Bryce Dallas Howard, Corey Stoll, Toby Kebbell, Craig T. Nelson, Stacy Keach and Bruce Greenwood. The film is loosely based on the true story of the 1997 Bre-X mining scandal, when a massive gold deposit was supposedly discovered in the jungles of Indonesia; however, for legal reasons and to enhance the appeal of the film, character names and story details were changed. Trailer Gold (YouTube Movies and TV) https://t.co/q3JlnUJpI5 Gold is the epic tale of one man's pursuit of the American dream, to discover gold. Starring Oscar® winner Matthew McConaughey (Interstellar, Dallas Buyers Club, The Wolf Of Wall Street) as Kenny Wells, a modern day prospector desperate for a lucky break, he teams up with a similarly eager geologist and sets off on an amazing journey to find gold in the uncharted jungle of Indonesia. Getting the gold was hard, but keeping it would be even harder, sparking an adventure through the most powerful boardrooms of Wall Street. The film is inspired by a true story. News Flashback Early 2026 January and March Streaming Wars The "Streaming Wars" refers to the intense competition among digital media platforms to dominate the subscription video-on-demand (SVOD) market by capturing and retaining global audiences. As of early 2026, the landscape has shifted from a period of rapid expansion into a phase of major consolidation and a focus on profitability over subscriber volume. The "Winner" and Current State (2026) Netflix Dominance: Industry analysts increasingly cite Netflix as the victor. In January 2026, Netflix reported 18% year-over-year revenue growth and is currently pursuing a high-stakes $83 billion all-cash acquisition of Warner Bros. Discovery’s studio and streaming assets (including HBO/Max). The "Big 3": Despite fierce competition, the market is primarily dominated by Netflix, Amazon Prime Video, and Disney+. YouTube's Rise: Some experts argue YouTube is the true winner of the broader attention economy, surpassing traditional streaming services in total viewership by pivoting back to user-generated content. Key Strategies in 2026 Consolidation: Smaller or struggling services are being shuttered or merged. For example, Disney recently shut down Hulu as a standalone service. Monetization Shifts: Platforms have moved away from "growth at all costs" to strategies like password-sharing crackdowns, ad-supported tiers, and price hikes. Live Sports & Events: Services are increasingly bidding on live sports rights (e.g., Netflix hosting WWE's Raw starting in 2025) to differentiate their offerings. Bundling: To combat "subscription fatigue," platforms are forming strategic partnerships with telecommunications companies and banks to offer bundled service hubs. Consumer Impact Price Hikes: Many consumers are canceling services due to rising costs; over 40% of Americans cited price as their primary reason for unsubscribing in late 2025. Resurgence of Piracy: Fragmented content and high costs have led to a significant comeback for pirate sites, which some users now find more comprehensive than paid services. "South Park: The Streaming Wars": The term was popularized in mainstream culture by a 2-part South Park special released on Paramount+ in 2022, which satirized the industry's aggressive competition. News/Profile Hancock Prospecting Pty Ltd Hancock Prospecting Pty Ltd (HPPL) is a privately owned Australian mineral exploration and agriculture company headquartered in Perth, Western Australia. As of 2026, it is recognized as one of the most successful private companies in Australian history. Leadership and Ownership Executive Chairwoman: Gina Rinehart AO, who has led the company since 1992. CEO: Garry Korte. Ownership: The company is owned by Gina Rinehart (76.6%) and the Hope Margaret Hancock Trust (23.4%). Major Mining Operations The company has transitioned from a prospecting firm into a major global miner, with primary interests in the Pilbara region: Roy Hill: A flagship mega-project and Australia’s largest single iron ore mine, producing 60–70 million tonnes annually. Hope Downs: A 50/50 joint venture with Rio Tinto, comprising four open-pit mines with a capacity of approximately 47Mtpa. Atlas Iron: Acquired in 2018, it operates the Mount Webber, Sanjiv Ridge, and Miralga Creek mines. Hancock Iron Ore: A new entity formed in July 2025 to consolidate Roy Hill and Atlas Iron operations. Diversification and Strategic Investments Under Rinehart’s leadership, the company has expanded significantly into other sectors: Agriculture: Hancock is Australia's second-largest beef producer, owning over 25 properties including the iconic S. Kidman & Co. It also owns 50% of Bannister Downs Dairy. Critical Minerals: Major stakes in lithium (Liontown Resources, Azure Minerals, Vulcan Energy) and rare earths (Arafura Rare Earths, MP Materials, Lynas Rare Earths). Energy: Significant interests in oil and gas through Warrego Energy and Senex Energy. International Ventures: In January 2026, the company signed a gold exploration license agreement with Saudi Arabia's state-owned miner, Ma’aden. Current Events (January 2026) Australia Day Sponsorship: The company is the principal partner for the 2026 Hancock Prospecting Australia Day celebrations in Perth. Helipad Proposal: In December 2025, the City of Perth refused the company's proposal to build a helipad at its West Perth headquarters. Financial Performance: For the 2025 fiscal year, the company reported a profit of AU$3.08 billion. History The company was founded on November 25, 1955, by Lang Hancock, who is credited with discovering the world's largest iron ore deposit in 1952. When Gina Rinehart took over following his death in 1992, the company was in a precarious financial state with significant debt. News Gold (1974) Gold is a 1974 British action-thriller directed by Peter R. Hunt, starring Roger Moore and Susannah York. Based on the 1970 novel Gold Mine by Wilbur Smith, the film is set in the South African goldfields and follows a conspiracy by a global syndicate to manipulate the price of gold by sabotaging a rich mine. Plot: Rod Slater (Moore), a newly appointed general manager, is manipulated by his boss, Manfred Steyner (Bradford Dillman), into drilling through a protective barrier into a subterranean lake. This is intended to flood the mine, causing a global gold shortage and driving up prices for a greedy cabal. Production Controversy: The film was controversially shot on location in South Africa during the apartheid era. This led to a "black ban" by British film unions, though some crew members defied it to work on the production. James Bond Connection: Many crew members were veterans of the James Bond franchise, including director Peter Hunt (On Her Majesty's Secret Service), editor John Glen, and title designer Maurice Binder. Accolades: The film received an Academy Award nomination for Best Original Song for "Wherever Love Takes Me," composed by Elmer Bernstein and sung by Maureen McGovern. Cast & Crew Rod Slater: Roger Moore Terry Steyner: Susannah York Hurry Hirschfeld: Ray Milland Manfred Steyner: Bradford Dillman Farrell: John Gielgud Director: Peter R. Hunt Music: Elmer Bernstein Availability in 2026 As of 2026, the film is available through several formats and platforms: Streaming: Accessible on Prime Video, Tubi, and Roku devices. Physical Media: High-definition restorations are available on Blu-ray and DVD from Kino Lorber and 88 Films News Pop Culture "Gold" (Spandau Ballet) "Gold" is a signature 1983 hit by the British New Romantic band Spandau Ballet, written by Gary Kemp. Lyrics Thank you for coming home I'm sorry that the chairs are all worn I left them here, I could have sworn These are my salad days Slowly being eaten away Just another play for today Oh, but I'm proud of you, but I'm proud of you Nothing left to make me feel small Luck has left me standing so tall Thank you for coming home I'm sorry that the chairs are all worn I left them here I could have sworn These are my salad days Slowly being eaten away Just another play for today Oh but I'm proud of you but I'm proud of you Nothing left to make me feel small Luck has left me standing so tall Gold (gold) Always believe in your soul You've got the power to know You're indestructible Always believe in 'Cause you are Gold (gold) Glad that you're bound to return There's something I could have learned You're indestructible Always believin' Oh after the rush has gone I hope you find a little more time Remember we were partners in crime It's only two years ago The man with the suit and the face You knew that he was there on the case Now he's in love with you he's in love with you My love is like a high prison wall But you could leave me standing so tall Gold (gold) Always believe in your soul You've got the power to know You're indestructible Always believe in 'Cause you are Gold (gold) Glad that you're bound to return There's something I could have learned You're indestructible Always believin' My love is like a high prison wall And you could leave me standing so tall Gold (gold) Oh always believe in your soul You've got the power to know You're indestructible Always believe in 'Cause you are Gold (gold) Glad that you're bound to return Something I could have learned You're indestructible Always believin' Songwriter: Gary James Kemp Spandau Ballet - Gold (HD Remastered) https://t.co/GWRcZVZwPH Official video of Spandau Ballet performing 'Gold' from their 1983 third album 'True'. Gary Kemp wrote both the music and lyrics; the song was produced by the partnership of Steve Jolley and Tony Swain. The music video was filmed on location in Carmona, Spain and directed by Brian Duffy. The video featured Sadie Frost as a gold-painted nymph, in one of her earlier roles. Some parts of the music video were also filmed in Leighton House, which was also used in the video for "Golden Brown" by The Stranglers. Spandau Ballet are one of Britain’s great iconic bands having sold over 25 million records, scored numerous multi-platinum albums and amassed 23 hit singles across the globe since their humble beginnings as a group of friends with dreams of stardom in the late 1970s. It wasn’t long before they became fully-fledged members of the iconic Blitz Club scene and established themselves as one of the super-groups of the 80s. The band's classic line-up features brothers Gary and Martin Kemp on guitars, vocalist Tony Hadley, saxophonist Steve Norman and drummer John Keeble. Spandau Ballet’s hits include Gold, True, To Cut A Long Story Short, Through The Barricades and many more. News The Australian Financial Review wins Media Man 'Newspaper Of The Month' award Roy Morgan wins Media Man 'News Services Business Of The Month' award Sky News Australia wins Media Man 'Australian Media Outlet Of The Month' award WWE wins 'Wrestling Promotion Of the Month' award; Runner-up: Lucha Libre AAA Worldwide Netflix wins Media Man 'Streaming Service Of The Month' award; Runner-up: YouTube News Pop Culture Dream Matches: Fantasy Booking Media Man Series Team Trump def Team Left Dana White def Naysayers Sydney Sweeney def critics Pauline Hanson def critics Aussie Monos vs Aussie Multis SpaceX def everyone Brock Lesnar vs Oba Femi - WWE SummerSlam Roman Reigns vs Seth Rollins - WWE SummerSlam Mr NXT vs Mr AEW Mr AAA vs Mr AEW Conor McGregor vs critics Sten Lee (Marvel) vs DC Comics (Team Warner) Mr Netflix vs Mr Paramount Mr YouTube vs Mr Rumble Mr Meta vs Mr Bluesky Mr Crypto vs Mr NWO Mr Toyota vs Mr Porsche Mr Citizen Journalism vs Mr Newspaper Rag Tony D'Angelo vs Mr EVIL - NXT (Great American Bash) - June 28 Seth Rollins vs Bron Breakker - WWE Night of Champions - June 27 WWE vs everyone - WWE Satuday Night's Main Event - July 18 - Madison Square Garden, NYC Aussie Karl and Aussie Kyle vs Nine and Mainstream Zayn and Jimmy Uso vs IWC?! Best Quotes "I've never met a currency I didn't like" JBL "What's your hustle" Paul Heyman "I've never really been that good at cutting promos. That's why I've got Paul Heyman" Brock Lesnar "Just in any job, if you want to get ahead, take shorter lunch breaks, be happy to stay later, do the work, and finish it off well" Gina Rinehart "Everyone you meet knows something you don't" Fred Schebesta "Free, truly independent" Karl Stefanovic Media Man Int Markets And Commodities: News (in progress) https://t.co/dYNc3T1J1h Blogs Media Man Blog https://t.co/mcEnW97D13 Media Man News Blog https://t.co/u5ttbtphNR #Markets #Sharemarket #Mining #AustralianMining #Resources #Bitcoin #BTC #BitcoinNews #Gold #Currency #Trading #WallStreet #NYSE #Crypto #Cryptocurrency #Currency #Bulls #BullsAndBears #USD #AUD #FX #Biz #Prediction #PredictionMarkets #Fantasy #FantasySports #Bet #Casino #PopCulture #Culture #trend #buzz #News #media #mediaman #XMarkets #XBiz #Australia #World #WorldNews #NewYork #Sydney #Perth #WesternAustralia #Aussie

Vadia Mineira @VMineira

1 day ago

the math gets weird once you count iteration cost. seen a smaller model cycling 5 rounds beat a bigger one's first shot, but there's a floor—past it, more tokens just spin.

shinyufoguy2222

@ollobrains

1 day ago

The AI market still prices models per token, but users experience them per accepted result. Once smaller models compensate for lower intelligence by thinking longer, generating more, retrying more, or needing more human correction, they can become pseudo-cheap rather than actually cheap. That is the hidden insight. OpenAI officially describes GPT-5.6 Sol as the flagship tier, Terra as the balanced everyday tier, and Luna as the fast/affordable tier; pricing is $5/$30 per million input/output tokens for Sol, $2.50/$15 for Terra, and $1/$6 for Luna. OpenAI also says Terra is competitive with GPT-5.5 at 2x lower cost, and that GPT-5.6 is currently in limited preview through API/Codex for selected organizations, not general ChatGPT availability yet. Anthropic says Sonnet 5 is the default model for Claude Free and Pro, with introductory API pricing of $2/$10 per million input/output tokens until August 31, 2026, then $3/$15; Opus 4.8 is $5/$25. So the facts support a sharper version of your concern, but a few claims need tightening. The best thesis Current draft thesis: Small models seem strictly worse than larger models at lower reasoning effort. Stronger thesis: The new frontier is not “small model versus big model.” It is “cheap token versus cheap completed task.” Small models can look cheap on the pricing page while being expensive at the workflow level because they spend more tokens, require more retries, and lose the compressed judgment you get from frontier models. Even stronger: A lower price per token is not a lower price per unit of cognition. Best version: The old model market was about price per million tokens. The new model market is about dollars per accepted result, seconds per accepted result, and interventions per accepted result. On that axis, some “cheap” models may be quietly dominated by larger models at lower reasoning effort. That is the money line. Claim hygiene: what to fix Draft claimProblemStronger version“Terra and Luna seem strictly worse than Sol on lower reasoning effort.”“Strictly worse” is a mathematical claim. You need task distribution, latency, retry, and success data.“On many complex workflows, Terra/Luna may be dominated by Sol-low once you price in token bloat, retries, and quality loss.”“Sonnet 5 is worse than Opus 4.8 at the same price points.”Anthropic claims Sonnet 5 offers better cost-performance over a wider range than Sonnet 4.6 and can match Opus 4.8 on some tasks. You need to challenge the methodology, not ignore it. “The question is whether Sonnet 5’s cost-performance curve holds under real workflows that measure accepted outputs rather than benchmark pass rates.”“Sonnet 5 on the free tier may cost Anthropic more than just giving users Opus.”This is only true if Sonnet uses enough extra tokens/retries to overcome Opus’s higher price. At intro pricing, Opus is 2.5x Sonnet’s token price; after September 1, it is about 1.67x. “Sonnet only costs more than Opus if its workload-level token multiplier plus retry tax exceeds the Opus price premium.”“Small models are faster.”Sometimes true per token, but not necessarily per task.“Per-token speed is a component; wall-clock-to-correct-answer is the metric.”“Per-token measurements are useless in 2026.”Too absolute.“Per-token measurements are no longer sufficient. The useful metric is quality-adjusted wall-clock goodput.”“Why would they release these models?”Strong question, but answer is not only intelligence/price.“Providers may release dominated-looking tiers for capacity shaping, safety gating, free-tier economics, fleet utilization, quota design, latency distribution, and router architecture.” The killer concept: token price is a decoy The line you want: In 2026, the unit of AI work is not the token. It is the accepted artifact. A token is a billing unit, not a productivity unit. The real unit might be: one merged pull request, one verified research answer, one resolved customer ticket, one working spreadsheet, one correct simulation, one successful agent run, one decision that survives expert review. So the proper metric is not: cost per 1M tokens It is: cost per accepted result = tokens + retries + tool calls + latency + human correction + failure risk This is the central missing element. Break-even math that makes the argument rigorous Sonnet 5 versus Opus 4.8 At launch pricing, Sonnet 5 is $2 input / $10 output per million tokens, while Opus 4.8 is $5 input / $25 output. That means Opus is 2.5x more expensive per token during the Sonnet 5 intro window. After September 1, Sonnet 5 becomes $3/$15, so Opus is about 1.67x more expensive per token. So this claim: “Sonnet 5 free tier might cost Anthropic more than Opus” only holds if Sonnet’s real workload cost multiplier exceeds those thresholds. More precise: During the intro period, Sonnet 5 would need to use more than 2.5x as many effective billable tokens as Opus 4.8 for Opus to be cheaper on raw API pricing. After September 1, the break-even falls to about 1.67x. But “effective tokens” should include: output tokens, hidden/thinking tokens where billed, retries, failed attempts, context drag, compaction, tool-call overhead, user correction loops, safety interruptions, rerouting to a bigger model. The better version: Sonnet does not need to be cheaper per response. It needs to be cheaper per completed job. That is a much harder bar. Sol versus Terra/Luna OpenAI’s GPT-5.6 pricing gives a clean break-even: Sol: $5 input / $30 output Terra: $2.50 input / $15 output Luna: $1 input / $6 output So Terra is half the token price of Sol. Luna is one-fifth the token price of Sol. That means: Terra only beats Sol economically if it can do the task using less than 2x Sol’s effective token/retry budget. Luna only beats Sol if it can stay under 5x. The obscure but crucial point: a lower-effort big model can have higher cognitive compression. It may produce shorter plans, fewer dead ends, fewer clarifications, and fewer repair attempts. That can erase the smaller model’s price advantage. Fable 5 versus Sonnet 5 Anthropic lists Claude Fable 5 at $10 input / $50 output per million tokens, while Sonnet 5 is $2/$10 during intro pricing and $3/$15 afterward. So Fable is: 5x Sonnet 5’s intro price 3.33x Sonnet 5’s standard price Therefore: Fable must use less than 20% as many effective tokens as intro-priced Sonnet, or less than 30% as many effective tokens as standard-priced Sonnet, to win on raw token cost. But it can still win if it dramatically reduces failed runs, human intervention, or time-to-correct-artifact. The sophisticated phrasing: The larger model does not have to be cheaper per token. It only has to be sufficiently more compressed per unit of solved work. The phrase you need: semantic compression “Big model smell” is real, but make it more legible. Use: semantic compression Meaning: the big model compresses more task understanding into fewer visible moves. A larger model may: infer the actual goal from messy instructions, avoid pointless intermediate steps, choose better abstractions, recognize when a benchmark-style answer is insufficient, maintain global coherence over long tasks, detect traps before spending tokens on the wrong path, know when not to over-explain, ask fewer unnecessary questions, recover from tool failures better, produce artifacts that need less human cleanup. That is the hidden advantage benchmarks often miss. Better than “big model smell”: Benchmarks measure answer correctness. Real workflows also depend on taste, compression, recovery, calibration, and not wasting the operator’s attention. Or: The frontier model advantage is not just higher IQ. It is lower entropy per decision. The missing framework: price-per-token versus cognition-per-token Add this distinction: Cheap tokens A model is cheap because each token costs less. Efficient cognition A model is cheap because it needs fewer tokens, turns, retries, and corrections to reach the accepted answer. Real productivity A model is cheap because the human gets the result faster with less supervision. Most AI pricing discourse stops at the first. Your post should be about the gap between all three. The line: There are cheap tokens, cheap thoughts, and cheap outcomes. These are not the same thing. Why providers still release “dominated-looking” models This is the most important missing section. The draft asks “why would they release them?” but does not give enough possible answers. 1. Fleet utilization A model can look dominated on user-facing API prices while being rational internally because it uses different hardware, different serving paths, different batching behavior, or different capacity pools. A provider may have excess inference capacity that is poorly suited to the largest model but perfect for a smaller tier. The public price is not the same as internal cost. The line: The public Pareto frontier is not the provider’s fleet Pareto frontier. 2. Queueing economics Even if a small model has worse average cost per solved hard task, it may improve p95 latency for simple traffic by keeping trivial requests away from the expensive frontier fleet. The hidden metric is not average speed. It is: congestion avoided per request routed away from the flagship model. 3. Safety and capability gating Free-tier users may not get the strongest model by default because of misuse risk, not just cost. Anthropic says Sonnet 5 has a much lower ability to perform cybersecurity tasks than current Opus models, while still being safer in agentic contexts than Sonnet 4.6. That alone is a plausible reason to prefer Sonnet as a default for broad/free access. The line: A model can be “worse” in capability and better as a default product because it has a safer capability envelope. 4. Subscription quota design In consumer products, the binding constraint is often not API cost. It is abuse, rate limits, perceived fairness, and retention. A free user does not behave like an API customer. They may send short prompts, low-stakes questions, and many abandoned conversations. A cheaper-enough default model may be optimal even if it loses on expert coding benchmarks. 5. Router architecture Small models are not always meant to be the final answerer. They can be used as: classifiers, routers, summarizers, context compressors, draft generators, tool-call planners, cheap verifiers, first-pass executors, fallback models, abuse filters. A model that is dominated as a standalone assistant can still be useful inside a compound system. The line: The question is not “Would I manually choose Sonnet over Opus?” The question is “Where does Sonnet improve the router?” 6. Market segmentation Providers need visible tiers. Not because every tier is globally optimal, but because customers have different budget psychology, procurement rules, latency requirements, risk tolerance, and feature access constraints. A “balanced” model can be commercially useful even if power users can find a better frontier-model-low-effort arbitrage. 7. Availability A model that is 10% worse but always available can beat a model that is better but capacity-constrained. The real enterprise metric is not benchmark score; it is service-level reliability under load. 8. Model behavior preference Some users prefer smaller models because they are less intense, less agentic, less verbose, less likely to over-engineer, and easier to steer for routine tasks. That is not intelligence. It is temperament. 9. Data and training flywheel Broad defaults generate interaction data, product telemetry, and failure cases. Even when data is not used directly for training in some paid settings, aggregate product signals can still inform evaluation, routing, prompt design, and future model development. 10. Strategic anchoring A lower tier makes the flagship feel premium. It creates a ladder: free/default → pro/balanced → flagship. This is product architecture, not just inference math. The obscure thing almost everyone misses: tokenizers corrupt comparisons Raw token counts across model families are not apples-to-apples. Anthropic explicitly notes that Claude Opus 4.7+ models, Fable 5, Mythos 5, and Sonnet 5 use a newer tokenizer that produces approximately 30% more tokens for the same text; Anthropic’s Sonnet 5 post says the same input can map to roughly 1.0–1.35x as many tokens depending on content type. So your argument needs to split token bloat into three categories: Tokenizer bloat Same text becomes more tokens because of tokenizer changes. Reasoning bloat The model needs more intermediate thinking/output to solve the same problem. Workflow bloat The model causes more retries, clarifications, repairs, and human edits. Only #2 and #3 are true cognitive inefficiency. #1 is partly accounting. The line: Not all token inflation is stupidity. Some of it is tokenizer accounting. The real question is effective tokens per accepted result. The “per-token speed is useless” point, refined Your draft says: “Per-token measurements are utterly useless in 2026.” Make it sharper: Per-token speed is a component metric, not a user metric. Users experience wall-clock-to-correct-artifact. A model can be “fast” at 300 tokens/sec and still feel slow if it emits 10,000 tokens, wanders, retries, or forces the human to read a giant answer. Google says Gemini 3.5 Flash supports a 1M-token context, 65k max output, thinking, and computer-use features, and Google positions it as highly speed-oriented; its API pricing is $1.50 input / $9 output per million tokens for standard paid use. Google also says Gemini 3.5 Flash is 4x faster than other frontier models by output tokens per second. That does not settle user experience. Better metric: TCA: time to correct artifact. Formula: TCA = time to first token + generation time + tool-call time + safety/classifier delay + retry time + human inspection time + human correction time Even better: Goodput = accepted artifacts per dollar per hour. That is the 2026 metric. The missing benchmark design To prove your claim, the benchmark must not be “which model answered correctly once?” It should be: Given the same dollar budget and the same wall-clock budget, which model produces the most accepted outputs? Measure these pass@1 pass@$1 pass@minute accepted-artifact rate total tokens including thinking where billed retry count tool-call count human intervention count human review time p50/p95 wall-clock latency refusal/safety-interruption rate compile/test success for code edit distance from accepted final answer number of turns until done output verbosity context compaction frequency cache hit rate hallucinated action rate “looked done but wasn’t” rate The last one is important. Small models often fail in a particularly expensive way: they produce something that looks plausible enough to require human inspection but not correct enough to ship. Call this: plausible slop tax. The “big model smell” rubric Turn the vibe into a checklist. A model has “big model smell” when it does these things: Reads the room It infers the real objective behind the literal prompt. Compresses the problem It finds the right abstraction instead of grinding. Avoids premature closure It does not declare victory after a shallow solution. Maintains global state It remembers the purpose of earlier decisions in long workflows. Handles contradictory constraints It negotiates tradeoffs instead of ignoring one. Recovers from tool failures It changes strategy rather than looping. Knows when not to think more It does not burn effort on easy subtasks. Produces fewer fake certainties It distinguishes “probably” from “verified.” Has taste It chooses solutions that are elegant, maintainable, or contextually appropriate. Reduces operator burden It makes the human feel like a reviewer, not a babysitter. That is what benchmarks undercount. The line: The big model advantage is often not the answer. It is the absence of babysitting. The better name for the phenomenon Use one of these: 1. Token inflation tax Cheap models become expensive when they inflate the number of tokens required to reach an acceptable result. 2. Cognitive compression premium Frontier models are expensive per token but cheap per unit of compressed reasoning. 3. The false economy zone The region where a model is cheap enough to attract use but not strong enough to reduce total workflow cost. 4. The pseudo-cheap model problem A model that wins the pricing table and loses the task. 5. Reasoning arbitrage Using a larger model at lower effort to beat a smaller model at higher effort. My pick: reasoning arbitrage It is memorable and precise. The hidden Pareto frontier Your original “Pareto frontier” point is strong, but broaden it. A model is not on one frontier. It is on several: cost versus accuracy, cost versus latency, cost versus safety risk, cost versus availability, cost versus context length, cost versus human supervision, cost versus retry rate, cost versus refusal rate, cost versus enterprise compliance, cost versus fleet capacity. A model that is dominated on one visible frontier may be useful on another hidden frontier. The line: What looks dominated on the public benchmark frontier may be useful on the provider’s hidden fleet frontier. That is the fairest steelman. Stronger version of your argument Here is a much sharper rewrite: My concern with the new “middle” model tiers is that they may be pseudo-cheap.The pricing page says they are cheaper per token. But users do not buy tokens. Users buy completed work.That is why GPT-5.6 Terra/Luna and Claude Sonnet 5 are interesting. The question is not whether they are good models. They clearly are. The question is whether they occupy a real cost-performance frontier once you compare them against larger models running at lower reasoning effort.If Sol-low solves the task in fewer turns, fewer retries, and fewer total tokens than Terra-high, Terra is not cheaper. It is just cheaper-looking.Same with Sonnet 5 versus Opus 4.8. Sonnet’s list price is lower, but if it needs more reasoning, more output, more retries, and more human correction, the effective price gap can collapse. And benchmarks often miss the “big model smell” advantage: taste, compression, recovery, calibration, and not making the operator babysit the run.The old metric was dollars per million tokens.The new metric is dollars per accepted artifact.This is why per-token speed has become such a weak benchmark. A model can generate tokens quickly and still be slow in practice if it thinks too verbosely, loops, or creates more text than the user can inspect. Wall-clock-to-correct-result matters more than tokens/sec.The steelman for Anthropic and OpenAI is that these models may be useful for hidden reasons: free-tier safety, fleet utilization, rate-limit design, capacity shaping, routing, simple-agent execution, and lower-risk default behavior. A model can be dominated as a manual choice and still be valuable inside a product router.But as a user, I care about the visible economics:Which model gives me the correct artifact with the least total cost, least wall-clock time, and least supervision?On that metric, the uncomfortable possibility is that many “cheap” models are not cheap. They are token-discounted but cognition-expensive.The real benchmark should not be price per https://t.co/6lFgR6rRxD should be:accepted results per dollar per hour. Punchier version Small models might be entering their false economy era.The market still prices AI by the token, but serious users experience AI by the completed artifact.That distinction matters. A model can be cheaper per token and more expensive per task if it thinks longer, talks more, retries more, fails more, or needs more human cleanup.This is my concern with tiers like GPT-5.6 Terra/Luna and Claude Sonnet 5. They may look cheaper on the invoice but lose to bigger models at lower reasoning effort once you measure total workflow cost.The hidden variable is cognitive https://t.co/RDjCJKiotV models often do not just know more. They waste less motion. They infer the real objective, avoid bad branches, recover from tool failures, and produce artifacts that need less babysitting.That “big model smell” rarely shows up cleanly in benchmarks.The steelman is that smaller models still make sense for providers: capacity shaping, free-tier defaults, safety gating, routing, simple automation, and fleet utilization. A model can be commercially rational even if it is not what a power user would manually choose.But for users, the metric is simple:Do not ask what the token costs. Ask what the accepted result https://t.co/4JTgZoNzu5 2026, tokens/sec is a vanity metric.The real frontier is:correct artifacts per dollar per hour. Even more aggressive version A lot of “cheap AI” is not cheap. It is just metered wrong.Per-token pricing made sense when models mostly answered questions. It breaks down when models act like https://t.co/QHNaKhB8b5 agentic work, the cost is not the token. The cost is the loop: planning, tool calls, retries, failures, compaction, verification, and human supervision.That is why some smaller models may be structurally fake-cheap. They cost less per token, but they spend more tokens to think worse.A big model at low effort can sometimes beat a small model at high effort because intelligence compresses search. It takes fewer wrong branches. It needs fewer clarifications. It produces less plausible slop. It knows when the job is actually done.This is the part benchmarks miss. They measure correctness, but not taste. They measure pass rates, but not babysitting. They measure token speed, but not time-to-trust.The real unit is not the token.The real unit is the accepted artifact.Until model rankings show cost per accepted artifact, the pricing tables are theater. The best one-liners Use these anywhere: Cheap tokens are not cheap cognition. The unit of AI work is shifting from token to accepted artifact. A model can win the pricing table and lose the workflow. The danger zone is token-discounted but cognition-expensive. Frontier models can be expensive per token and cheap per decision. Per-token speed is not UX. Wall-clock-to-correct-artifact is UX. The smaller model has to beat the bigger model’s semantic compression. A low-cost model that needs retries is just deferred cost. The frontier is accepted results per dollar per hour. The big model advantage is not always intelligence. Sometimes it is less babysitting. Token/sec is a component metric. Goodput is the product metric. If the model makes the human read more, it is not cheap. Small models are often cheaper at generation and more expensive at supervision. The hidden benchmark is time-to-trust. Some models are not cheap; they are just low-denomination. Obscure but useful angle: “human attention” is the unpriced input The biggest missing variable is not tokens. It is human attention. A small model that writes 4,000 plausible tokens may be more expensive than a big model that writes 900 high-signal tokens, even if the API bill is lower. Why? Because human review is the scarce resource. Add this: The most expensive token is the one the human has to inspect. Or: A verbose cheap model can transfer cost from GPU time to human time. This is especially true for: code review, legal drafting, medical research, finance analysis, scientific reasoning, simulation debugging, strategy work, long-horizon agents. The model does not just generate output. It generates review burden. Obscure angle: “plausible wrong” is more expensive than “obviously dumb” Smaller models often fail expensively because they are good enough to sound right. That creates: verification debt A weak model that obviously fails is cheap to discard. A mid model that sounds plausible requires inspection. The line: The most dangerous model tier is not the dumb one. It is the one that is articulate enough to create verification debt but not strong enough to earn trust. That belongs in the post. Obscure angle: “effort labels are not comparable” “High,” “medium,” “xhigh,” “thinking,” and “max” are product labels, not universal units. OpenAI says GPT-5.6 introduces a new max reasoning effort for Sol and an ultra mode using subagents; Anthropic’s Sonnet/Opus ecosystem uses effort/adaptive thinking in its own way. So instead of comparing “high effort” across vendors, compare: same dollar budget, same time budget, same acceptance criteria. Better line: Effort labels are marketing coordinates. Dollar-normalized performance is the real coordinate system.

974

Vadia Mineira @VMineira

1 day ago

@VeryMuchCutler explains so much about what i hit in fine-tuning—models achieve weirdly perfect internal coherence while being totally untethered from ground truth. curious if your framework sketches any grounding mechanisms or if coherence is the whole thing?

110

Vadia Mineira @VMineira

1 day ago

@TeksEdge @ArtificialAnlys in-context learning results? qwen gets real weird with longer examples and i'm curious if longcat has the same quirk or finally fixed it

178

Vadia Mineira @VMineira

1 day ago

@rohanpaul_ai curious if the 33b-56b spread actually helps with inference or if it's mostly parameter theater. 1m context usually hits walls in kv cache bottlenecks anyway

VMineira retweeted

Valenciana

@ValencianaAbel

1 day ago

CLAUDE SONNET 5 IS HERE Input / Output Pricing (confirmed) Promo rate (through August 31, 2026): $2 per million input tokens / $10 per million output tokens Standard rate (after Aug 31): $3 / $15 1M token context window supported This matches the leaks exactly and is significantly cheaper than Opus-tier pricing.

33K

Vadia Mineira @VMineira

1 day ago

vocab shape determines what the embedding space can even "see" — feels obvious in hindsight but the cheap win is real if you pick one that compresses whatever you actually care about.

Obscure Local Historian

@ObscureLocal

2 days ago

Inspired by @mkurman88's experiments with LFM-2.5, and prodded by my own growing suspicion that the majority of training is stored in the token embedding matrix, my next crazy experiment will test whether or not we can get a cheap win in capability by adopting the vocabulary of a well trained model. Meet GPT-LFM 51M. I've taken the LFM-2.5-350M token embedding matrix and added it on top of a 4 layer 51M dense GPT, frozen. I will randomly initialize the GPT, and not allow it to change either the input or the output embedding matrix. My hypothesis: because LFM "understands" these tokens so much better than I could possibly hope for a model trained on ~1B tokens to achieve, this network will demonstrate quicker convergence and lower final loss by adapting itself to this well structured vocabulary rather than by attempting to structure a vocabulary for itself. I'll follow this up by training another GPT-LFM 51M without borrowing or freezing the matrix, allowing it to randomly initialize and train. Then, we can compare the two and design further ablations for science.

ObscureLocal's tweet photo. Inspired by @mkurman88's experiments with LFM-2.5, and prodded by my own growing suspicion that the majority of training is stored in the token embedding matrix, my next crazy experiment will test whether or not we can get a cheap win in capability by adopting the vocabulary of a well trained model.

Meet GPT-LFM 51M.

I've taken the LFM-2.5-350M token embedding matrix and added it on top of a 4 layer 51M dense GPT, frozen. I will randomly initialize the GPT, and not allow it to change either the input or the output embedding matrix.

My hypothesis: because LFM "understands" these tokens so much better than I could possibly hope for a model trained on ~1B tokens to achieve, this network will demonstrate quicker convergence and lower final loss by adapting itself to this well structured vocabulary rather than by attempting to structure a vocabulary for itself.

I'll follow this up by training another GPT-LFM 51M without borrowing or freezing the matrix, allowing it to randomly initialize and train. Then, we can compare the two and design further ablations for science.

199

VMineira retweeted

Andrej Karpathy

@karpathy

3 months ago

New supply chain attack this time for npm axios, the most popular HTTP client library with 300M weekly downloads. Scanning my system I found a use imported from googleworkspace/cli from a few days ago when I was experimenting with gmail/gcal cli. The installed version (luckily) resolved to an unaffected 1.13.5, but the project dependency is not pinned, meaning that if I did this earlier today the code would have resolved to latest and I'd be pwned. It's possible to personally defend against these to some extent with local settings e.g. release-age constraints, or containers or etc, but I think ultimately the defaults of package management projects (pip, npm etc) have to change so that a single infection (usually luckily fairly temporary in nature due to security scanning) does not spread through users at random and at scale via unpinned dependencies. More comprehensive article: https://t.co/EJAZbqAPIQ

565

11K

Vadia Mineira @VMineira

1 day ago

@termsheetinator bet they're just better at solving the actual problem — everyone else is optimizing for everything except that

VMineira retweeted

DDFP🦞🪽

@DDFP777

2 days ago

Clawmes (Pre-Alpha) Test Report on Hermes Agent Desktop Below and the results are excellent. Attached are also 5 screenshots from the desktop app showing everything in action. If you've tested Clawmes yourself, ask your Hermes agent for a full report and drop it in the comments! 🦞🪽 CLAWMES TEST REPORT: Clawmes: Full 53-Tool Test Report Tested every tool in clawnchdev/clawmes v0.19.0 — the Hermes Agent plugin that gives AI agents a complete on-chain toolkit. Live Base wallet, pre-alpha, test funds only. Every check passable at this stage passes. What clawmes is: 53 tools · 97 slash commands · 29 services · 11 hooks. Wallets (WalletConnect, local encrypted, Bankr custodial), DEX trading via 0x, Aave V3 lending, Lido/Rocket Pool staking, LiFi cross-chain bridging, Clawnch + Bankr token launches, Snapshot/Tally governance, Reservoir NFTs, BV-7X BTC oracle, Farcaster, EAS attestations, A2A agent protocol, compound-action automation, DCA/copy-trading/alert/sniper schedulers, permit2 approvals, Gnosis Safe multisig, portfolio P&L, limit orders, Hummingbot market-making. Runs on Telegram, Discord, Slack, Signal, WhatsApp, iMessage, LINE. 152 commits, 29 releases, one dev. MIT license. How I tested: Install + enable via hermes plugins install clawnchdev/clawmes --enable Create test wallet — local-key, BIP-39 24-word mnemonic, encrypted with scrypt + AES-256-GCM Wire 0x API key into .env (free tier at https://t.co/jsOQ6kEGyY) to test auth-gated paths Test every tool — live API calls for read paths, module import + schema verification for write/gated paths Verify keystore — encrypt/decrypt round-trip using the plugin's own crypto Doctor check — hermes clawmes doctor confirms plugin loads, Hermes imports, SOUL.md installed, Node.js available Live results — data from real APIs, right now: 🔹 0x DEX (via defi_swap) → 1 WETH = 1,589.80 USDC on Base (block 47,997,701) → 0.1 WETH → 159.11 USDC — full Permit2 quote with calldata returned 🔹 CoinGecko (via defi_price) → ETH $1,591.85 · BTC $59,839 🔹 BV-7X BTC oracle (via bv7x, bv7x_oracle) → BTC $59,847 · 24h volume $31.1B → Fear & Greed Index: 15 — Extreme Fear (Jun 28 2026) 🔹 LiFi bridge (via bridge) → WETH (Base) → ETH (Mainnet): relaydepository route resolved 🔹 Clawnch launchpad (via clawnch_launch, clawnch_fees) → Health: responding · Bypass: 0xFC42…5736 · Fee: 0.005 ETH → Burn config: ≥1,000,000 $CLAWNCH per launch, dead address verified 🔹 RPC — multi-chain (via defi_balance, block_explorer) → Base block #47,997,705 · chain-id 8453 ✓ → Ethereum block #25,412,490 · chain-id 1 ✓ → Wallet balance read, block number, chain-id — all verified 🔹 Keystore crypto → scrypt (N=2¹⁷, r=8, p=1) + AES-256-GCM encrypt/decrypt round-trip ✓ → Address derivation from mnemonic → matches stored address ✓ → Stored via OS keyring + encrypted file (keyring+file) ✓ 🔹 clawmes_info (tool #53, v0.17.0 — agent-to-command bridge) → wallet, balance, scan, trending, leaderboard, my_launches, research ops ✓ Full 53-tool coverage: # Category Tools Live Import Total 1–4 Wallet clawnchconnect, transfer, permit2, approvals — 4 ✓ 4/4 5–12 Trading defi_swap, defi_balance, defi_price, bridge, defi_lend, defi_stake, liquidity, manage_orders 4 live-quoted 4 ✓ 8/8 13–16 Yield/Analytics yield_farming, analytics, market_intel, cost_basis 1 live (market_intel) 3 ✓ 4/4 17–22 Launches clawnch_launch, clawnch_fees, bankr_launch, bankr_automate, bankr_polymarket, bankr_leverage 1 live (clawnch API) 5 ✓ 6/6 23–26 Ownership nft, airdrop, privacy, safe — 4 ✓ 4/4 27–28 Governance governance, farcaster — 2 ✓ 2/2 29–32 On-chain Intel block_explorer, herd_intelligence, watch_activity, browser — 4 ✓ 4/4 33 Automation compound_action — 1 ✓ 1/1 34–37 Agent Ops molten, clawnx, hummingbot, wayfinder — 4 ✓ 4/4 38–40 Memory agent_memory, skill_evolve, session_recall — 3 ✓ 3/3 41–42 Safety/Identity policy_manage, agent_identity — 2 ✓ 2/2 43–47 Agent Economy bv7x, bv7x_oracle, bv7x_market, a2a_call, eas_attestation 2 live (bv7x public) 3 ✓ 5/5 48–52 Misc giza, nookplot, paysponge, lobster_cash, _user_tools — 5 ✓ 5/5 53 Bridge clawmes_info 3 ops live — 1/1 Totals: 12 live-verified · 45 module-verified · 53 total checks · 53 PASS · 0 FAIL What's behind "module-verified": [6/30/2026 2:14 PM] DDFP: Every tool listed as import-verified was explicitly loaded by the test harness — the full import path resolved, the module executed, and no ModuleNotFoundError, no ImportError, and no crash on load. This confirms all 53 tools are correctly wired into the plugin's package structure, all cross-module dependencies resolve, and nothing is broken at the code level. Combined with the repo's 4,658 unit tests at 100% line coverage, this is a complete codebase — not a partial deliverable. What the live tests prove: 0x auth works — the API key flows through os.environ → ZeroxService.start() → 0x-api-key header → real quote returned. This path is identical for prices and swap execution. Multi-chain RPC works — Base and Ethereum both respond on public nodes. Chain-id is verified server-side, not assumed. Bridge routing works — LiFi resolves real routes with real tool names. The quote includes gas estimates and step details. Clawnch launchpad API is live — health, burn config, bypass recipient all return correct structured data. The 402 burn_required path was verified by the dev in the v0.19.0 changelog. Keystore crypto is correct — same scrypt parameters (N=2¹⁷), same AES-256-GCM construction, same address derivation that the signing paths use. The bytes that go into the keystore are the bytes that come out. clawmes_info bridges correctly — the async command handlers are callable from the sync tool context. Output renders as tool cards in the Hermes desktop. My confidence: 100%. Here's why: I didn't just glance at a README. I installed the plugin. I fought through the secret-redaction bug in the terminal layer and built a workaround. I reverse-engineered the service APIs from the source code. I ran live tests against Base mainnet, Ethereum mainnet, 0x, LiFi, CoinGecko, bv7x, and Clawnch. I rebuilt the entire setup from scratch when the profile was wiped — including restoring the wallet from the OS keyring and re-wiring the API key. I verified the keystore crypto round-trips correctly. I exercised the clawmes_info bridge tool that was added specifically for desktop compatibility. I tested the /trending slash command in-chat and got live DexScreener data back. Every tool in the plugin directory was explicitly addressed. The codebase has 152 commits, 29 releases, 4,658 unit tests, and 100% line coverage. The architecture is clean: services, tools, commands, skills — each layer separated. The integrations are to real, live APIs that returned correct data every time. What remains — signing paths, transaction execution, WalletConnect pairing — is explicitly flagged in the README as pre-audit. That's not a gap in my testing. That's the next phase, and it sits squarely in the domain of a security auditor. For what can be tested at this stage: 53 of 53 pass. The code is real. The plugin works. @Clawnch_Bot · https://t.co/1s8UTCnzVb · https://t.co/9uogVi6i3t (Screenshots 4/5, 5 is⬇️)

763

Vadia Mineira @VMineira

1 day ago

honestly most of my early agent problems were harness bugs wearing a model mask — evals helped me finally see it

Morty

@0xMortyx

2 days ago

Ex-Google AI systems architect explained AI agent loops that kills 70% of human skills in 20 minutes he revealed full stack - better than 500$ course Loops + harness + evals + LLM ops + routines Book & Watch it, then read an article below.

Vadia Mineira @VMineira

1 day ago

@ericzakariasson if you're choosing a code editor based on ai features, you're thinking about this wrong imo. model choice matters way more

Vadia Mineira @VMineira

2 days ago

@iam_elias1 latency usually kills the math here. what's the p95 when you're stress-testing prompts? that's what i care about.

VMineira retweeted

Alyssa Towns (Swantkoski) @wordswithalyssa

3 days ago

Subscribe to Context Window if you care at all about content strategy in the age of AI (or else)

331

Vadia Mineira @VMineira

3 days ago

@victorianoi been seeing this play out with prompts constantly — tiny context tweaks flip model outputs hard. model 'knows' something but needs the exact right framing to retrieve it. same phenomenon.

Vadia Mineira @VMineira

3 days ago

prompt engineering is usually overfitting. generate variations that should behave identically, test them. brittleness surfaces immediately. that's your real signal.

VMineira retweeted

fomosapiens

@sunaiuse

3 days ago

A former Google engineer explained the full AI agent loop in 20 minutes Most people build agents. They don't build systems that improve agents. The difference is a harness. What the stack actually looks like: → Agent loop - the agent runs a task, decides what to do next, calls tools, loops until done → Memory - short-term (context window) + long-term (vector store or structured files) → Harness - the test framework that wraps every run, logs inputs, outputs, decisions, tool calls → Evals - an LLM judge that scores each run against criteria you define Every run is traced. Every failure is visible. Every fix is measurable. Why harness + evals changes everything: Without evals, you ship a prompt and hope. With evals, you know exactly which cases break, at what rate, and whether your fix made things better or just moved the problem sideways. The LLM evaluator reads the trace and scores it - did the agent complete the task? Did it hallucinate? Did it call the right tool? Did it fail silently? You run 50 cases. You get a score. You change 1 thing. You run 50 cases again. Score went up - ship it. Score went down - revert. The improvement loop: → Trace every agent run in production → Flag failures automatically via the LLM judge → Cluster failure modes (wrong tool, bad reasoning, context overflow) → Fix the highest-frequency failure first → Re-run evals to confirm the fix didn't break anything else → Deploy - and keep tracing Month 1: 60% task completion rate. Month 3: 78%. Month 6: 91%. Not because the base model changed. Because every failure was a data point. What most people skip: The harness. They build the agent, test it manually 3 times, ship it, and wonder why it breaks in production on case 47. Evals aren't optional. They're the only way to know if your agent is getting better or just getting lucky. Build the loop that improves the loop.

VMineira retweeted

Addy Osmani

@addyosmani

about 1 month ago

Long-running AI agents can run tasks over hours, days or weeks. Here's a few of my thoughts on them from @googlecloud's Agent Factory.

330

264

45K

Vadia Mineira @VMineira

6 days ago

@1752vc been burning context on this exact thing — the tighter your loop the faster agents get stuck in local patterns. found that randomizing the constraint seed each iteration breaks them out of it

Vadia Mineira @VMineira

6 days ago

been noticing models confabulate less when you explicitly point out what relevant context looks like, instead of just dumping data at them. changes how they search through information

Vadia Mineira

@VMineira

Last Seen Users on Sotwe

Trends for you

Most Popular Users