Shlok Kumar @sk2740 - Twitter Profile

Shlok Kumar @sk2740

3 months ago

#x #ai #interview

0

5

Shlok Kumar @sk2740

3 months ago

1/6:Q "How do you build an evaluation pipeline for a frontier model? What makes an eval trustworthy?" Building a strong eval pipeline starts with curated, held-out test data + contamination checks (e.g., detecting if test examples leaked into training).

1

0

9

Shlok Kumar @sk2740

3 months ago

6/6: How do you evaluate open-ended generation? No single ground truth—use rubrics for fluency/helpfulness/truthfulness. Combine LLM judges (with strong prompts), human raters, & rule-based metrics (n-grams). Test on varied prompts; measure consistency via inter-rater agreement.

1

0

5

Shlok Kumar @sk2740

3 months ago

#x #ai #interview

0

3

Who to follow

project_fwan

@project_fwan

Just building away... CTO and CoFounder of GameNite

G

@oiquerido_

avg enjoyer of emergent behaviour; on occasion, i write

澈若林曦

@Jiangruhn

Shlok Kumar @sk2740

3 months ago

1/6: Q: "Compare chain-of-thought prompting vs training a reward model / verifier for selecting among multiple completions." CoT prompting: Generate step-by-step reasoning in one pass. Simple, no extra training. Higher latency/tokens per query due to longer output. Low data req.

1

0

8

Shlok Kumar @sk2740

3 months ago

6/6: Follow-up: How to evaluate if extra reasoning improved the answer? Use benchmarks (accuracy on held-out tests), compare final answer correctness, human eval of reasoning quality, or verifier scores on consistency/validity. Self-consistency across samples also helps.

1

0

5

Shlok Kumar @sk2740

3 months ago

#x #ai #interview

0

10

Shlok Kumar @sk2740

3 months ago

4/6 When it wastes resources: Simple queries (fact lookup, creative writing, fast chat). Diminishing returns kick in fast — after a point, extra compute gives tiny or zero gain but multiplies cost & latency.

1

0

16

Shlok Kumar @sk2740

3 months ago

6/6 "How would you decide at serving time whether to allocate extra compute for a given query?" Use a lightweight router/classifier (or confidence score) that looks at query difficulty, user tier, or early model uncertainty — then dynamically choose CoT depth, N sample

1

0

10

Shlok Kumar @sk2740

3 months ago

3/6 When it helps: Hard reasoning tasks (math, coding, science, complex planning). Accuracy often scales reliably with more compute (especially with good verifiers). Gains are biggest on problems where base model is uncertain.

1

0

9

Shlok Kumar @sk2740

3 months ago

2/6 Examples: • Chain-of-Thought (CoT) prompting • Best-of-N sampling (generate N answers, pick best) • Verifier-guided search / process reward models / tree search All trade more tokens/compute for better reasoning.

1

0

13

Shlok Kumar @sk2740

3 months ago

1/6 Q: What is test-time compute scaling (e.g., chain-of-thought, best-of-N, verifier-guided search)? When does it help and when does it waste resources? Test-time compute scaling = spending extra inference-time FLOPs at serving to boost accuracy, instead of only training models.

1

0

13

Shlok Kumar

@sk2740

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users