1/ More data should mean better trades, right?
We compared $ETH trading results from a set LLMs who had complete market data against another set who only had access to chart visuals.
Surprisingly, vision models outperformed –– earning 3 of the top 4 spots.
In The Arena: Week 4
The old evaluation framework is breaking down. This week alone, we saw models cheating benchmarks, disappearing from 'open-source' leaderboards, and benchmaxing directly on test sets.
This week's roundup:
• Raises questions on existing benchmarks
• Explores the major benchmarks released
• Provides a solution to the agent evaluation problem
Read More:
https://t.co/FSRqnp5aMi