I'm delighted to announce the open #source release of Fewshell, a mobile-first self-hosted #AI terminal #agent for #devops, oncalls and AI researchers. Run complex #bash commands from your phone with ease. Fix your systems remotely https://t.co/XCEXIWDvFV
#AI Agents can use curl in your evals' environment to cheat on tasks by finding the answer online. If you disable network access, be sure to also disable search grounding, otherwise they can still search the internet on the LLM provider's side.
I want to share a new dataset of 331 reward-hackable environments. These are real environments used in Terminal Bench and adjacent benchmarks. I first got interested in this because, as a reviewer of Terminal Bench, I noticed a lot of our tasks were hackable. I also noticed that many contributors to the benchmark do so because it provides credibility when selling environments to labs. Hence, TBench tasks are, in my opinion, held to a higher quality standard than those being used today for RL. No one is spending hours manually reviewing the $1B in tasks being purchased by major labs. As far as I understand, while everyone knows environments are hackable, nobody has released hundreds of "realistic" environments. (link in comment)
I'm delighted to announce the open #source release of Fewshell, a mobile-first self-hosted #AI terminal #agent for #devops, oncalls and AI researchers. Run complex #bash commands from your phone with ease. Fix your systems remotely https://t.co/XCEXIWDvFV
@TimSweeneyEpic@Pirat_Nation The proliferation of 3p launchers is a blight for customer experience. Using Epic Store and Steam would be so much better if they were to disallow vendors from adding launchers.