1/6:Q "How do you build an evaluation pipeline for a frontier model? What makes an eval trustworthy?"
Building a strong eval pipeline starts with curated, held-out test data + contamination checks (e.g., detecting if test examples leaked into training).
6/6: How do you evaluate open-ended generation?
No single ground truth—use rubrics for fluency/helpfulness/truthfulness. Combine LLM judges (with strong prompts), human raters, & rule-based metrics (n-grams). Test on varied prompts; measure consistency via inter-rater agreement.
1/6: Q: "Compare chain-of-thought prompting vs training a reward model / verifier for selecting among multiple completions."
CoT prompting: Generate step-by-step reasoning in one pass. Simple, no extra training. Higher latency/tokens per query due to longer output. Low data req.
6/6: Follow-up: How to evaluate if extra reasoning improved the answer? Use benchmarks (accuracy on held-out tests), compare final answer correctness, human eval of reasoning quality, or verifier scores on consistency/validity. Self-consistency across samples also helps.
4/6 When it wastes resources: Simple queries (fact lookup, creative writing, fast chat). Diminishing returns kick in fast — after a point, extra compute gives tiny or zero gain but multiplies cost & latency.
6/6 "How would you decide at serving time whether to allocate extra compute for a given query?"
Use a lightweight router/classifier (or confidence score) that looks at query difficulty, user tier, or early model uncertainty — then dynamically choose CoT depth, N sample
3/6 When it helps: Hard reasoning tasks (math, coding, science, complex planning). Accuracy often scales reliably with more compute (especially with good verifiers).
Gains are biggest on problems where base model is uncertain.
1/6 Q: What is test-time compute scaling (e.g., chain-of-thought, best-of-N, verifier-guided search)? When does it help and when does it waste resources?
Test-time compute scaling = spending extra inference-time FLOPs at serving to boost accuracy, instead of only training models.