Chen Xing @LynetteSohn - Twitter Profile

Chen Xing @LynetteSohn

over 1 year ago

@willccbb So the performance of the non grayed out models shouldn't be directly compared to the grayed models.

1

0

66

Chen Xing @LynetteSohn

over 1 year ago

@willccbb Good observation! If you hover over the grayed out model names, we try to explain the caveat that we produce data examples according the 6 grayed out frontier models' failures (including llama 405B and GPT 4o). So the benchmark is biased against these grayed models.

1

2

1

0

239

LynetteSohn retweeted

Summer Yue

@summeryue0

over 1 year ago

Introducing MultiChallenge by @scale_AI - a new multi-turn conversation benchmark. Current frontier LLMs score under 50% accuracy (top: 44.93%). The new Gemini 2.0 Flash model launched today has also been included to our SEAL leaderboard. 📄 Paper: https://t.co/9ok0yYNEO0 🏆Leaderboard: https://t.co/ubvpY6cPeb

summeryue0's tweet photo. Introducing MultiChallenge by @scale_AI - a new multi-turn conversation benchmark. Current frontier LLMs score under 50% accuracy (top: 44.93%). The new Gemini 2.0 Flash model launched today has also been included to our SEAL leaderboard.

📄 Paper: https://t.co/9ok0yYNEO0
🏆Leaderboard: https://t.co/ubvpY6cPeb

3

26

5

1

15K

LynetteSohn retweeted

Summer Yue

@summeryue0

about 2 years ago

🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at https://t.co/bRdTbIMd20! Which evals should we build next?

summeryue0's tweet photo. 🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly!

Check out our leaderboards at https://t.co/bRdTbIMd20!

Which evals should we build next? https://t.co/0mCk5hk6kK

10

190

32

55

51K

Who to follow

Le Xue

@Le_Xue01

Researching @ Elorian AI, prev @ Salesforce Research

Tian Xie

@tianxie233

Pretraining @Meta TBD Lab | previously @GoogleDeepMind @character_ai @SFResearch @USC_ISI

Alex Trott

@alexrtrott

Research @DbrxMosaicAI. Neuroscience PhD in a previous life. Whispering models into sentience one parameter at a time. (opinions are my own.)

LynetteSohn retweeted

Caiming Xiong

@CaimingXiong

over 2 years ago

Excited to share our brand new LLM evaluation benchmark 🐠FoFo🐠 on format-following! 🐠FOFO🐠 is a pioneering benchmark for evaluating large language models’ (LLMs) ability to follow complex, domain-specific formats, a crucial yet under-examined capability for their application as AI agents. Link: https://t.co/qBETnrar8r Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: 1. open-source models significantly lag behind closed-source ones in format adherence; 2. LLMs’ format-following performance is independent of their content generation quality; 3. LLMs’ format proficiency varies across different domains. These observations suggest two key points: i) The format-following capacity of LLMs appears independent of their content-following capacity shown in AlpacaEval and MT-Bench, and may necessitate specialized alignment fine-tuning beyond the conventional instruction-tuning of open source LLMs. ii) Format-following capacity is not universally transferable across domains, highlighting the potential utility of our benchmark as a guiding and probing tool for selecting domain-specific AI agent foundation models.

3

93

15

35

12K

LynetteSohn retweeted

Andreas Köpf

@neurosp1ke

almost 3 years ago

Interesting model: https://t.co/dQd6PZV6T1

2

120

21

32

37K

LynetteSohn retweeted

Caiming Xiong

@CaimingXiong

almost 3 years ago

Code LLaMA is finally here. Congrats to @MetaAI @ylecun @syhw, etc.. And we'd like to introduce the finetuned llama2-70B model -- 🐒Lemur-70B🐒 again😉, a complement to Code LLaMA (7B, 13B, 34B) and maintains strong performance in text tasks. Code: https://t.co/MburXL8pC4

2

76

16

9K

LynetteSohn retweeted

Tao Yu @taoyds

almost 3 years ago

🧵Lemur-70B-chat stands out as the top-performing open-source LLM, rivaling ChatGPT across a broader spectrum of tasks when compared to other available open-source LLMs.

0

22

5

1

2K

Chen Xing @LynetteSohn

almost 3 years ago

Glad to be one of the lemurs! @SFResearch

XLANG NLP Lab @XLangNLP

almost 3 years ago

1/6 Open LLMs have traditionally been tailored for either 📚text or 💻code, with limited ability to effectively balance both. 🚀 Introducing #Lemur70B! 🚀: the SOTA open LLM balancing 📚text & 💻code capabilities 🤗Model: https://t.co/BPp7Tn2WsV 📖Blog: https://t.co/LAPYhd7IcZ

XLangNLP's tweet photo. 1/6 Open LLMs have traditionally been tailored for either 📚text or 💻code, with limited ability to effectively balance both.

🚀 Introducing #Lemur70B! 🚀: the SOTA open LLM balancing 📚text & 💻code capabilities

🤗Model: https://t.co/BPp7Tn2WsV
📖Blog: https://t.co/LAPYhd7IcZ https://t.co/qUaWLf1rPT

2

71

34

8

26K

0

4

0

230

LynetteSohn retweeted

XLANG NLP Lab @XLangNLP

almost 3 years ago

1/6 Open LLMs have traditionally been tailored for either 📚text or 💻code, with limited ability to effectively balance both. 🚀 Introducing #Lemur70B! 🚀: the SOTA open LLM balancing 📚text & 💻code capabilities 🤗Model: https://t.co/BPp7Tn2WsV 📖Blog: https://t.co/LAPYhd7IcZ

2

71

34

8

26K

Chen Xing @LynetteSohn

over 3 years ago

@adad8m So I don't know what is chatGPT's advantage compared to to a search engine.

1

0

203

Chen Xing @LynetteSohn

over 3 years ago

@adad8m I played with chatGPT on some basic physics questions, such as , "one person standing on a train heading west with speed 100 km/h, throws a ball to east with speed 100 km/h, what is the speed of the ball to a stationary observer on ground? " chatGPT fails a lot on these questions

3

13

0

1

6K

LynetteSohn retweeted

Zachary Lipton

@zacharylipton

about 5 years ago

I can't imagine being a woman in CS seeing this week's news. I feel shocked and devastated. Foremost for the victims. But also for all the other women who have to put their guard up higher, doubting the authenticity of the professional attention they receive. You deserve better.

3

337

17

5

0

LynetteSohn retweeted

Jörg Tiedemann @TiedemannJoerg

almost 6 years ago

The worst ever feature introduced online is the automatic redirection to localized websites. How can I convince the world that I don’t want them to turn into Finnish?

3

10

1

0