@thdxr Perfect! Exactly what we are missing now. Lots of plugins are implementing notifications via either a user message or an invisible message.transform
@Chaos2Cured@deredleritt3r The methodology is ridiculous, it's essentially measuring the capability of the web interface, not the model. Anthropic did loads of dedicated RL on law and accounting. No way they would lose to OpenAI.
@deredleritt3r@Greg22040755@simobis23 btw, there are already multiple law benches way more authoritative and objective then yours. If you don't even conduct research before action, I cannot image how would you succeed in law-related works. https://t.co/isXpWOoIp4
https://t.co/3mS9OspbGY
@deredleritt3r@Greg22040755@simobis23 I don't know how good you are in law, but apparently you are very elementary at using ai agents. Be humble and learn before you post any data. This is misleading and irresponsible.
@deredleritt3r@Greg22040755@simobis23 You are an underskilled enterprise AI user if you don't customize your agent with external data source, extra tools, or your internal knowledge bases. That's not how professional agent-driven legal works are done.
@deredleritt3r@Greg22040755@simobis23 Then you are NOT measuring the model. You are just measuring the quality of provided web search. Which is not valueable at all. Stop making false conclusions on model intelligence.
@deredleritt3r@Greg22040755@simobis23 1. Open-source part of your question bank so people can reproduce the results.
2. Create a standard and transparent agent environment for testing.