@vasuman its particularly bad if your context has high logic density (ie rules, conditionals, exceptions) which becomes more common for production use cases
The reason recent model releases are disappointing is because of benchmark hacking.
Co optimizes their model for the benchmark and says “this is 50% smarter than the others”.
Prophet arena is cool because I think we can all agree we have agi when AI can predict the future.
🔮 Introducing Prophet Arena — the AI benchmark for general predictive intelligence.
That is, can AI truly predict the future by connecting today’s dots?
👉 What makes it special?
- It can’t be hacked. Most benchmarks saturate over time, but here models face live, unseen future events. You can’t memorize tomorrow (unless you’ve cracked time travel).
- It’s interpretable. Strong performance = real foresight, which translates into real investment gains.
👉 Check it out: https://t.co/1ASTV8GzWy
@levie The main value of subagents is UX. We get to see the “little ai workers” “working” and is easy to understand.
Similar to the CoT letting you “see what they think” (it doesn’t), it’s useful for building agent ux but not for production ai workflows.
@jacob_posel@MatthewGattozzi We're working on this specific problem. Easy to do a quick and dirty version, hard (but solvable) to do it in a more consequential large data environment (ie 8-9 figure spend, XXM MAU)
once you solve it for real, you can have self-improving creative which is insanely powerful
After seeing the anthropic blog post and posts like these, I spent the night using claude code for non-code use. Three agents on growth strategy and three agents on creative generation. I gave them a small creative repo, some data, and a context profile instead of code.
Takeaways:
1. For whatever reason each agent call took ~60s, and averaged 15k tokens used.
2. Bad instruction following. Subagents are overly eager and are extremely aggressive in liberties taken. If you ask them to do one easy task, they often do 4-5 additional ones on their own.
3. Agent coordination and handoff are very poor and kind of random even if there were clear instructions given.
4. Creatively, these models are much worse than the sonnet-4 models. Theyre stated as being the same (?) or similar, but theyre notably very bad at creative compared to most stock LLMs. Yes, you can make stuff, but its not really shippable yet.
5. Subagents generated by anthropic (recommended) did much worse than ones created yourself.
6. Inspired by @boringmarketer I tried quant analysis of marketing creative, and my tldr right now is that most implementations will lose you all your money. I can make a separate thread on this and how to make it work, but I compared agent creative analysis when given performance data vs hand analysis and the agents is generally wrong.
The killer feature here is claude code interface but with general purpose agent sdk. Net I was able to get some decent results but this use case is pretty far from production viable. Next is recreating this with openai agents.
Something that many people haven't realized:
Claude Code is useful for much more than just coding. For example, @AnthropicAI's growth marketing team is using it to generate 100s of new ad creatives in minutes.
I think Anthropic has undersold the power of this tool. It should be branded as "Claude Agent" not "Claude Code."
I asked people what their non-coding use cases are and some of the replies are 🤯. See next post for the link.
How good are stock LLMs at translation? Is 4o english-> spanish basically perfect or can native speakers immediately tell the difference?
Also is there some kind of language benchmarking to see how good each model is for different languages?
@shl Company A buys saas from Company B for $1m ARR
Company B buys saas from Company C for $1m ARR
Company C buys saas from Company A for $1m ARR
Is the total revenue $1M or $3M a year?
more raw veo3 experiments. these are all sequential, first attempt, no cherry picking since I think that gives a better view of what the model's actually like:
veo3 prompt: generate the most sterotypical tiktok video you can