birbbirbbirbbirb @parrotsftw - Twitter Profile

Pinned Tweet

birbbirbbirbbirb @parrotsftw

almost 6 years ago

i've come to realise that 80% of Masters is just stress management

2

10

0

birbbirbbirbbirb @parrotsftw

4 months ago

@ICICILombard if you do reject on the basis of medical history, is it ethical to do so?

0

7

birbbirbbirbbirb @parrotsftw

4 months ago

@ICICILombard hey, is it your policy to reject health insurance to young people with medical history ? I’m ready pay extra premium, and wait longer for lock period. Mine has been rejected, with no reason being given. Im 29 yo, with no chronic illness, and yet my claim is rejected

3

0

15

parrotsftw retweeted

Amy Lu

@amyxlu

7 months ago

For women / under-represented minorities applying to AI PhD programs this year: if you’d like feedback on your application, feel free to consider me a resource, particularly if you’re a non-traditional applicant. DMs open! (Paying it forward for the @WiMLworkshop community 💛)

7

294

35

129

24K

Who to follow

Enjoy reading and local traveling

birbbirbbirbbirb @parrotsftw

7 months ago

@ariahalwong Any tips for the selection process? :)

1

3

0

1

423

parrotsftw retweeted

Deedy

@deedydas

11 months ago

One of the most important papers in AI: a tiny brain-inspired 27M param model trained on 1000 samples outperforms o3-mini-high on reasoning tasks! Still can't believe this tiny lab of Tsinghua grads gets 40% on ARC-AGI, solves hard sudoku and mazes. We're still so early.

deedydas's tweet photo. One of the most important papers in AI: a tiny brain-inspired 27M param model trained on 1000 samples outperforms o3-mini-high on reasoning tasks!

Still can't believe this tiny lab of Tsinghua grads gets 40% on ARC-AGI, solves hard sudoku and mazes.

We're still so early. https://t.co/Wed9X68CrX

242

8K

956

6K

2M

parrotsftw retweeted

Jack Lindsey @Jack_W_Lindsey

11 months ago

Attention is all you need - but how does it work? In our new paper, we take a big step towards understanding it. We developed a way to integrate attention into our previous circuit-tracing framework (attribution graphs), and it's already turning up fascinating stuff! 🧵

18

1K

186

1K

193K

parrotsftw retweeted

vik

@vikhyatk

11 months ago

40

4K

231

302

246K

parrotsftw retweeted

Leshem (Legend) Choshen 🤖🤗 @LChoshen

about 1 year ago

Did you remember all the things to do before sharing a figure? Remove "_" from the labels save as pdf ... A lot of deeper tips to writing are in the shared document, but somehow this practical list was always the most popular. Good luck in #neurips

LChoshen's tweet photo. Did you remember all the things to do before sharing a figure?
Remove "_" from the labels
save as pdf
...

A lot of deeper tips to writing are in the shared document, but somehow this practical list was always the most popular.
Good luck in #neurips https://t.co/EXrfyOiCkc

2

34

5

21

3K

parrotsftw retweeted

Curiosity

@CuriosityonX

11 months ago

NEWS🚨: James Webb confirms there's something seriously wrong with our understanding of the universe — and reveals unknown physics exists.

CuriosityonX's tweet photo. NEWS🚨: James Webb confirms there's something seriously wrong with our understanding of the universe — and reveals unknown physics exists. https://t.co/213xU3n5xk

3K

119K

12K

25K

14M

parrotsftw retweeted

Kyle Corbitt

@corbtt

11 months ago

Big news: we've figured out how to make a *universal* reward function that lets you apply RL to any agent with: - no labeled data - no hand-crafted reward functions - no human feedback! A 🧵 on RULER

45

1K

122

2K

180K

parrotsftw retweeted

Shashwat Goel

@ShashwatGoel7

11 months ago

There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image ⁉️. Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. We show discrimination is easier than even verification, let alone generation. 🤔 But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer. 👨‍🔬We conduct a meta-evaluation by comparing to ground-truth verification on MATH, and human grading on MMLU-Pro and GPQA-Diamond questions. Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced💡 Lets shift the benchmarking ecosystem from MCQs to Answer Matching. Impacts: Leaderboards: We show model rankings can change and accuracies go down making benchmarks seem less saturated. Benchmark Creation: Instead of creating harder MCQs, we should focus our efforts on creating questions with for answer matching, much like SimpleQA, GAIA etc. 🤑 Cost: Finally, to our great surprise, answer matching evals are cheaper to run than MCQs! See our paper for more, its packed with insights. 🧵 has paper and more result figures.

ShashwatGoel7's tweet photo. There's been a hole at the heart of #LLM evals, and we can now fix it.

📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations.

❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image ⁉️.

Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. We show discrimination is easier than even verification, let alone generation.

🤔 But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer.

👨‍🔬We conduct a meta-evaluation by comparing to ground-truth verification on MATH, and human grading on MMLU-Pro and GPQA-Diamond questions. Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced💡

Lets shift the benchmarking ecosystem from MCQs to Answer Matching. Impacts:
Leaderboards: We show model rankings can change and accuracies go down making benchmarks seem less saturated.
Benchmark Creation: Instead of creating harder MCQs, we should focus our efforts on creating questions with for answer matching, much like SimpleQA, GAIA etc.
🤑 Cost: Finally, to our great surprise, answer matching evals are cheaper to run than MCQs!

See our paper for more, its packed with insights. 🧵 has paper and more result figures.

11

227

37

218

36K

parrotsftw retweeted

Chess.com

@chesscom

12 months ago

An important update.

2K

306K

14K

15K

12M

birbbirbbirbbirb @parrotsftw

12 months ago

I was just thinking about FAIR’s research papers the other day—they really are all that thorough…..two papers that come to mind-COCONUT paper and an old paper LRP for transformers implementation pape

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

12 months ago

Facebook AI Research (FAIR) is a small, prestigious lab in Meta. We don't train large models like GenAI or MSL, so it's natural that we have limited GPUs. GenAI or MSL's success or failure, past or future, doesn't reflect the work of FAIR. It is important to make this distinction

ZeyuanAllenZhu's tweet photo. Facebook AI Research (FAIR) is a small, prestigious lab in Meta. We don't train large models like GenAI or MSL, so it's natural that we have limited GPUs. GenAI or MSL's success or failure, past or future, doesn't reflect the work of FAIR. It is important to make this distinction https://t.co/2aN9ZEou7u

15

823

56

357

124K

0

54

parrotsftw retweeted

Ekdeep Singh Lubana @EkdeepL

about 1 year ago

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

5

379

60

406

39K

parrotsftw retweeted

Jack D. Carson

@mtlushan

about 1 year ago

After more than half a year of work, it's finally done! In my new paper I demonstrate a new technique for mesoscopic understanding of language model behavior over time. We show that LM hidden states can be approximated by the same mathematics as govern the statistical properties of microscopic particles. And, more importantly, that this approximation is sufficient to very cheaply predict LLM misalignment and failure modes before they occur during inference. Check it out below!

mtlushan's tweet photo. After more than half a year of work, it's finally done! In my new paper I demonstrate a new technique for mesoscopic understanding of language model behavior over time. We show that LM hidden states can be approximated by the same mathematics as govern the statistical properties of microscopic particles. And, more importantly, that this approximation is sufficient to very cheaply predict LLM misalignment and failure modes before they occur during inference.

Check it out below!

19

589

64

696

99K

parrotsftw retweeted

Mehrdad Farajtabar @MFarajtabar

about 1 year ago

🧵 4/8 Result #1: Three distinct performance regimes 📈📉 Comparing thinking vs non-thinking models under the same inference token compute revealed: 🟡 LOW complexity: Standard LLMs actually outperform reasoning models (and are more efficient!) 🔵 MEDIUM complexity: Reasoning models gain advantage. 🔴 HIGH complexity: Both models completely collapse to 0% accuracy.

MFarajtabar's tweet photo. 🧵 4/8 Result #1: Three distinct performance regimes 📈📉

Comparing thinking vs non-thinking models under the same inference token compute revealed:
🟡 LOW complexity: Standard LLMs actually outperform reasoning models (and are more efficient!)
🔵 MEDIUM complexity: Reasoning models gain advantage.
🔴 HIGH complexity: Both models completely collapse to 0% accuracy.

7

216

29

61

59K

birbbirbbirbbirb @parrotsftw

over 1 year ago

@bookingcom They are not helpful, I will be booking an e-dakhil form tomorrow. Is there any support you can offer-the experience has been really appalling and the manager is refusing any compensation and not even solving the problem?

2

0

28

birbbirbbirbbirb @parrotsftw

over 1 year ago

@bookingcom i booked a hotel through your portal in Darjeeling-called Orsino Spa resort. There was no hot water for 2 days and the manager provided us with no support in this regard. We have more detailed complaints re the hotel, how can we solve this?

1

0

23

birbbirbbirbbirb @parrotsftw

over 1 year ago

We expect prompt resolution and appropriate compensation for these issues. How do you intend to address this situation?”

0

16

birbbirbbirbbirb @parrotsftw

over 1 year ago

“@pinetreeresorts Horrible experience: https://t.co/uruIZwM73D hot water for 2 days. Manager’s solution? Buckets of water—1 hot, 2 cold. Is this how you expect guests to maintain basic hygiene? 2.WiFi is terrible and practically unusable.

1

0

22

birbbirbbirbbirb @parrotsftw

over 1 year ago

https://t.co/YQEndFqSuE 11:45 PM, the manager suggested we wait for maintenance. Is this your idea of hospitality? This is unacceptable for a resort.”

1

0

19

birbbirbbirbbirb

@parrotsftw

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users