(1/4) Gemini has been trending a lot on twitter 🔥 We wanted to bring the conversation back to actual LLM evals results. Through a lot of testing, we have found Gemini to be a very solid model.
We recently made 2️⃣ updates to the Gemini Needle in a Haystack test 🪡 based on some notes from the Google team. The final results show a perfect haystack result similar to @JeffDean results 💯
✅ Tokenizer: The tokenizer used was incorrect and threw off the results from the first test. Fixing this did not fix all the results, but it did improve results. This is our miss.
✅ Prompting: Matching the prompt to @AnthropicAI , gave Gemini the best results yet, a perfect execution by Gemini by simply using the Anthropic prompt addition.
All evals run using @ArizePhoenix. Tagging relevant Evals folks! @rown@universeinanegg@ybisk@YejinChoinka@allen_ai@haileysch__@lintangsutawika@hendrycks@markchen90@MillionInt@HenriquePonde@Shahules786@karlcobbe@mobav0@lukaszkaiser@gdb
(1/7) The G in RAG Trips Up GPT-4 .... Or Does it?
The retrieval stage of RAG gets an incredible amount of attention. The generation stage ... not so much 🤔
We set out to test how models handle the Generation stage of RAG .. and @AnthropicAI beat @OpenAI#GPT4 ‼️ Most Evals we’ve run, GPT-4 has out performed Claude so these test results were a bit of a surprise.
Spoiler: Anthropic’s verbose responses might actually have given it an edge in generation 🗣️
Spoiler Spoiler: A small trick to make GPT-4 more verbose gave it perfect responses 🪄
Test we Ran
1⃣ Retrieve Random Number 1 = Ex: 4827143
2⃣Retrieve Random Number 2 = Ex: 4
3⃣Generate a month for our date string from Random Number 1
4⃣Generate the day for our date string from Random Number 2
Here you can see the results of GPT-4 and Anthropic side by side. All evals run using @ArizePhoenix
If you're interested in research results, we're reviewing them this week at @anyscalecompute with @robertnishihara and @GregKamradt Tuesday evening in SF: https://t.co/z82GVHKobP
Tagging relevant Evals folks! @rown@universeinanegg@ybisk@YejinChoinka@allen_ai@haileysch__@lintangsutawika@hendrycks@markchen90@MillionInt@HenriquePonde@Shahules786@karlcobbe@mobav0@lukaszkaiser@gdb
🚨 Announcing the Anthropic Claude 2 Hackathon 🚨
24 hours
July 29-30 in SF
$10K prizes in cash + credits
The @AnthropicAI team is giving hackers Claude2 API access
Sign up here 👉 https://t.co/Aoz6dAPs16
Today, we announced a $38 million Series B – a record investment in the #MLobservability space! 🎉
If you are passionate about understanding ML models, love building tools for engineers and like working with a diverse group, come join our team! https://t.co/NmJUlnHIcC
Machine learning observability is crucial to building the right guardrails for better and more responsible AI. Today we announced a $19M Series A led by Battery Ventures w/ participation from @FoundationCap@trinityventures@thehousefund & Swift Ventures https://t.co/3OgnH0ECTV
At Ushur, we have a team-first culture, which was built through a careful, strategic early hiring process. There are 5 key components that are crucial to building a strong culture, as discussed in the latest blog post from Simha Sadasiva, CEO of Ushur.
https://t.co/bgGKLtqvp6
By far the best #SaaStrAnnual session so far. Do yourself a favor and watch it. Managing inwards, sideways, up & down. @SlackHQ https://t.co/l0xAbrJOJw