Excited to share MoNoCo -- a new benchmark with @TomerWolfson, @harsh3vedi, and others for pushing the frontier of LLMs on realistic information-seeking questions that require combining info from dozens of Wikipedia pages!
LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇
📢 New Benchmark: SUPER for Setting UP and Executing tasks from Research repositories
Reproducibility is crucial in science. We introduce SUPER to evaluate LLMs' capabilities in autonomously running experiments from research repositories. ⬇️
https://t.co/U47r3F3UO5
🥳The BIGGEST congratulations for our teams' recognition at #ACL2024! OLMo received the Best Theme Paper, Dolma + AppWorld received the Best Resource Paper, and "Political Compass or Spinning Arrow?" was honored with an Outstanding Paper Award.
Excited to share AppWorld, our challenging new interactive coding environment and benchmark to push AI agents further!
Super easy to use (`pip install...`), reliable, reproducible, realistic.
Congratulations to @harsh3vedi for the huge effort!!
🔥 Autonomous AI Assistants (e.g., #googleio2024, #WWDC24) and coding agents (e.g., #Devin, #SWEAgent) have garnered a lot of attention recently. We can envision coding agents autonomously completing complex day-to-day tasks across apps using APIs on our behalf. But how can we develop & benchmark them in a rigorous & reproducible manner?
🚀 Introducing AppWorld: 🌎a simulated world environment where agents can write code to interact with many apps via APIs on behalf of people 📊a benchmark of complex tasks defined on it, and 🧪a robust evaluation framework for assessing agent’s goal completion.
📢 To appear as an #ACL2024 paper 🌎💻🧑🤝🧑 “AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents” #NLProc #ai #AIagents
📜 https://t.co/cdQ07847Kp (paper)
🌐 https://t.co/dIawTLcI7a for code, blog, data (tasks, APIs, trajectories) explorer, interactive playground, leaderboard & more!
New paper by @ben_bogin@shivanshug11 Peter Clark & @Ashish_S_AI shows that generating code with domain description prompts is hugely effective for Semantic Parsing in the modern era of LLMs!
And there is more to it than just the target language's popularity!
Turns out SSMs like S4 and S6 don't quite get the best of both worlds -- sequential and parallel -- and struggle to track state just like Transformers.
Excited to share the "Illusion of State" paper w/ @lambdaviking, @jowenpetty !
https://t.co/O8eD2Xzy7f
✨Excited to finally drop our new paper: SSMs “look like” RNNs, but we show their statefulness is an illusion🪄🐇
Current SSMs cannot express basic state tracking, but a minimal change fixes this! 👀
w/ @jowenpetty, @Ashish_S_AI
https://t.co/rkHp2BvYm1
ICYMI @benbenbrubaker wrote an eloquent Quanta article✍️covering our ICLR-2024 paper (w/ @lambdaviking) on how the expressive power of transformers changes with the length of CoT!
Recently updated paper📜at https://t.co/6BbOON04DA
Today in Quanta Magazine, @benbenbrubaker gives a new overview of our work (w/ @Ashish_S_AI) on the expressive power of transformers with/without CoT
https://t.co/FgZSBzUTpn
New paper: "Attentiveness to Answer Choices Doesn’t Always Entail High QA Accuracy" 📊💬
https://t.co/5V47V0kfjn
Something I've been thinking a lot about recently is the relationship between distributions over vocabularies produced by language models and the various ways... 1/5
📢 Introducing ReFeed: a novel plug-and-play approach to enhance the factuality of large language models via retrieval feedback! Together with @Meng_CS@zhihz0535@LiangZhenwen@ai2_aristo
Read more: https://t.co/V1FowqUG8v
📢 Introducing IfQA - the first large-scale open-domain question answering (ODQA) dataset centered around counterfactual reasoning. Together with @Meng_CS@ai2_aristo!
Paper link: https://t.co/dwKCzrv3Nn
Introducing 𝗥𝗘𝗙𝗟𝗘𝗫: What does my LLM believe?🧐We show that we can add a 𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗹𝗮𝘆𝗲𝗿 to an LLM to materialize its "belief graph", repair inconsistencies & produce reasoning chains drawn from a now-consistent system of beliefs! https://t.co/t1w65Sf73d #NLProc