What is AdaBoost? (in ML interviews)
๐ Let's learn together โ
๐๐ฑ๐ฎ๐๐ผ๐ผ๐๐ ๐ถ๐ ๐ฎ๐ป ๐ฒ๐ป๐๐ฒ๐บ๐ฏ๐น๐ฒ ๐บ๐ฒ๐๐ต๐ผ๐ฑ ๐๐ต๐ฎ๐ ๐ฐ๐ผ๐บ๐ฏ๐ถ๐ป๐ฒ๐ ๐๐ฒ๐ฎ๐ธ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ฒ๐ฟ๐ ๐ถ๐ป๐๐ผ ๐ฎ ๐๐๐ฟ๐ผ๐ป๐ด ๐ฐ๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฒ๐ฟ.
It trains models sequentially. Each new model focuses on the mistakes of the previous ones by giving misclassified samples higher weight.
Think: a team where each member specializes in fixing what others got wrong.
๐ ๐ง๐ต๐ฒ ๐ณ๐ถ๐ป๐ฎ๐น ๐บ๐ผ๐ฑ๐ฒ๐น:
H(x) = sign(ฮฃ ฮฑt ht(x)) for t=1 to T
Where:
T โ number of weak learners
ht(x) โ prediction from learner t
ฮฑt โ learner weight (higher for better performers)
โก ๐๐ผ๐ ๐ถ๐ ๐๐ฟ๐ฎ๐ถ๐ป๐:
โ Start with equal sample weights
โก Train weak learner on weighted data
โข Calculate learner weight: ฮฑt = ยฝ ln((1-ฮตt)/ฮตt) where ฮตt is error rate
โฃ Update sample weights: increase weight on misclassified points by e^(ฮฑtยทyiยทht(xi))
โค Normalize weights and repeat
Each round makes hard examples harder to ignore.
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ฟ๐ฎ๐ฑ๐ถ๐ฒ๐ป๐ ๐๐ผ๐ผ๐๐๐ถ๐ป๐ด?
AdaBoost adjusts sample weights and uses exponential loss. Each learner sees a reweighted dataset.
Gradient Boosting fits residuals directly and works with any differentiable loss. Each learner corrects the previous ensemble's errors.
AdaBoost is simpler but less flexible. Gradient Boosting handles regression and custom losses better.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ฑ๐ฎ๐๐ผ๐ผ๐๐:
when you have weak learners (like shallow trees), need a simple boosting baseline, or want interpretable learner weights showing which models matter most.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Precision, Recall & F1 Score? (in ML interviews)
๐ Let's learn together โ
๐ฃ๐ฟ๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป, ๐ฅ๐ฒ๐ฐ๐ฎ๐น๐น, ๐ฎ๐ป๐ฑ ๐๐ญ ๐ฆ๐ฐ๐ผ๐ฟ๐ฒ ๐ฎ๐ฟ๐ฒ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ๐ ๐๐ต๐ฎ๐ ๐บ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฐ๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฒ๐ฟ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ฏ๐ฒ๐๐ผ๐ป๐ฑ ๐ฎ๐ฐ๐ฐ๐๐ฟ๐ฎ๐ฐ๐.
Accuracy can be misleading when classes are imbalanced. If 95% of emails are not spam, a model that always predicts "not spam" gets 95% accuracy but catches zero spam.
Precision and Recall expose this. They measure different types of errors and force you to think about the trade-off between false positives and false negatives.
๐ ๐ง๐ต๐ฒ ๐ณ๐ผ๐ฟ๐บ๐๐น๐ฎ๐:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 ร (Precision ร Recall) / (Precision + Recall)
Where:
TP โ True Positives (correctly predicted positive)
FP โ False Positives (predicted positive, actually negative)
FN โ False Negatives (predicted negative, actually positive)
TN โ True Negatives (correctly predicted negative)
โก ๐ช๐ต๐ฎ๐ ๐๐ต๐ฒ๐ ๐บ๐ฒ๐ฎ๐๐๐ฟ๐ฒ:
๐ฃ๐ฟ๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป: Of all predicted positives, how many are actually positive?
High precision means few false alarms. Use when false positives are expensive (spam filter blocking real emails).
๐ฅ๐ฒ๐ฐ๐ฎ๐น๐น: Of all actual positives, how many did we catch?
High recall means few missed cases. Use when false negatives are expensive (cancer screening missing a tumor).
๐๐ญ ๐ฆ๐ฐ๐ผ๐ฟ๐ฒ: Harmonic mean of precision and recall.
Balances both metrics. Good when you care about both types of errors equally.
๐ฏ ๐ง๐ต๐ฒ ๐๐ฟ๐ฎ๐ฑ๐ฒ-๐ผ๐ณ๐ณ:
You can't maximize both at once.
Lower your classification threshold and recall goes up (catch more positives) but precision drops (more false alarms).
Raise the threshold and precision improves but recall falls (miss real cases).
The precision-recall curve shows this relationship. Pick the threshold that matches your business cost.
๐ง ๐๐ผ๐ ๐ถ๐ ๐๐ญ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ฐ๐ฐ๐๐ฟ๐ฎ๐ฐ๐?
Accuracy treats all errors the same and breaks on imbalanced data.
F1 focuses only on the positive class and penalizes extreme imbalance between precision and recall.
F1 uses harmonic mean (not arithmetic), so if either precision or recall is low, F1 is low.
Accuracy can be high even when your model is useless for the minority class.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ต๐ฒ๐๐ฒ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ๐:
when classes are imbalanced, when different errors have different costs, or when you need to justify your threshold choice in an interview.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is a Confusion Matrix? (in ML interviews)
๐ Let's learn together โ
A confusion matrix is a ๐๐ฎ๐ฏ๐น๐ฒ ๐๐ต๐ฎ๐ ๐๐ต๐ผ๐๐ ๐ต๐ผ๐ ๐๐ผ๐๐ฟ ๐ฐ๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฒ๐ฟ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฒ๐ฑ by comparing predictions to actual outcomes.
It's the foundation for understanding where your model succeeds and where it fails. Every classification metric (precision, recall, F1) comes from this table.
Think: a 2x2 grid that breaks down correct and incorrect predictions by class.
๐ ๐ง๐ต๐ฒ ๐บ๐ฎ๐๐ฟ๐ถ๐ ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ:
Rows = Actual labels
Columns = Predicted labels
Four cells:
TP (True Positive) โ predicted positive, actually positive
FP (False Positive) โ predicted positive, actually negative
FN (False Negative) โ predicted negative, actually positive
TN (True Negative) โ predicted negative, actually negative
๐งฎ ๐๐ฒ๐ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ๐ ๐ฑ๐ฒ๐ฟ๐ถ๐๐ฒ๐ฑ ๐ณ๐ฟ๐ผ๐บ ๐ถ๐:
Precision = TP / (TP + FP)
โ of all positive predictions, how many were right?
Recall = TP / (TP + FN)
โ of all actual positives, how many did we catch?
F1 Score = 2 ร (Precision ร Recall) / (Precision + Recall)
โ harmonic mean balancing both
Accuracy = (TP + TN) / (TP + TN + FP + FN)
โ overall correctness (misleading on imbalanced data)
โก ๐๐ผ๐ ๐๐ผ ๐ฟ๐ฒ๐ฎ๐ฑ ๐ถ๐:
โ Look at the diagonal (TP and TN). These are correct predictions.
โก Check FP. These are false alarms (Type I error).
โข Check FN. These are missed cases (Type II error).
โฃ Calculate metrics based on what matters for your problem.
๐ ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฎ๐ฐ๐ฐ๐๐ฟ๐ฎ๐ฐ๐ ๐ฎ๐น๐ผ๐ป๐ฒ?
Accuracy gives you one number. Looks great on balanced data but hides problems on imbalanced sets.
A confusion matrix shows you exactly which errors you're making. You can see if you're missing positives (low recall) or getting too many false alarms (low precision).
It tells the full story. Accuracy just gives the summary.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐ฎ ๐๐ผ๐ป๐ณ๐๐๐ถ๐ผ๐ป ๐ ๐ฎ๐๐ฟ๐ถ๐ :
every time you evaluate a classifier. It's the first thing you should check to understand model behavior and pick the right metric for your business problem.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Selection Bias? (in A/B test interviews)
๐ Let's learn together โ
๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐ผ๐ป ๐ฏ๐ถ๐ฎ๐ ๐ผ๐ฐ๐ฐ๐๐ฟ๐ ๐๐ต๐ฒ๐ป ๐๐ผ๐๐ฟ ๐๐ฎ๐บ๐ฝ๐น๐ฒ ๐๐๐๐๐ฒ๐บ๐ฎ๐๐ถ๐ฐ๐ฎ๐น๐น๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ต๐ฒ ๐ฝ๐ผ๐ฝ๐๐น๐ฎ๐๐ถ๐ผ๐ป.
Your estimates become wrong even if your model is perfect. The data you observe doesn't represent the reality you care about.
Example: if only high-performing users complete your survey, their average satisfaction will be higher than the true population mean. That gap is selection bias.
๐ ๐ง๐ต๐ฒ ๐บ๐ฎ๐๐ต:
Bias = E[ฮธฬโ] - ฮธ = E[ฮธ | S = 1] - E[ฮธ]
Where:
ฮธฬโ โ estimate from selected sample
ฮธ โ true population parameter
S = 1 โ indicator that unit was selected
E[ฮธ | S = 1] โ expected value in selected sample
โก ๐๐ผ๐ ๐ถ๐ ๐ต๐ฎ๐ฝ๐ฝ๐ฒ๐ป๐:
โ Sample selection depends on outcome or related variables
โก Observed distribution shifts away from population
โข Estimates calculated on biased sample
โฃ Results don't generalize to target population
Common causes: non-random sampling, survivorship (only seeing successes), missing data that's not random, conditioning on a collider variable.
๐ ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฆ๐ฎ๐บ๐ฝ๐น๐ถ๐ป๐ด ๐๐ฟ๐ฟ๐ผ๐ฟ?
Sampling error is random variation from taking a sample. It decreases with larger samples and averages out to zero.
Selection bias is systematic. It doesn't disappear with more data. Your sample is fundamentally unrepresentative, so adding more biased observations just gives you more confident wrong answers.
๐งฎ ๐๐ผ๐ฟ๐ฟ๐ฒ๐ฐ๐๐ถ๐ผ๐ป ๐บ๐ฒ๐๐ต๐ผ๐ฑ๐:
Heckman correction models the selection mechanism first, then adjusts outcome estimates.
Inverse probability weighting reweights observations by 1/P(selected) to recover population distribution.
Randomized designs with intent-to-treat analysis prevent selection by design.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐ฎ๐๐ฐ๐ต ๐ณ๐ผ๐ฟ ๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐ผ๐ป ๐๐ถ๐ฎ๐:
whenever participation is voluntary, data is missing non-randomly, or you're analyzing survivors (customers who didn't churn, products still in market, experiments that finished).
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is DBSCAN? (in ML interviews)
๐ Let's learn together โ
DBSCAN is a ๐ฑ๐ฒ๐ป๐๐ถ๐๐-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐ฐ๐น๐๐๐๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฎ๐น๐ด๐ผ๐ฟ๐ถ๐๐ต๐บ that finds clusters by linking points in dense neighborhoods.
No need to specify k upfront. Outliers fall out automatically as noise. It handles rings, moons, spirals. Things K-Means simply can't do.
The core idea: if enough points live within radius ฮต of a point, that point seeds a cluster and grows it outward by chaining dense neighborhoods together.
๐ ๐ง๐ต๐ฒ ๐ฐ๐ผ๐ฟ๐ฒ ๐ฟ๐๐น๐ฒ:
Nฮต(p) = { q โ D : d(p, q) โค ฮต }
Where:
Nฮต(p) โ all points within distance ฮต of point p
ฮต โ neighborhood radius (you set this)
minPts โ minimum neighbors needed to be a core point
Every point gets one of three roles:
๐ข ๐๐ผ๐ฟ๐ฒ โ has at least minPts neighbors within ฮต. Seeds and grows a cluster.
๐ต ๐๐ผ๐ฟ๐ฑ๐ฒ๐ฟ โ inside a core's ฮต-ball but not dense enough itself. Joins the cluster, never expands it.
๐ด ๐ก๐ผ๐ถ๐๐ฒ โ unreachable from any core. Labeled -1. Free outlier detection.
๐ช ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Pick an unvisited point
โก Check if it has โฅ minPts neighbors within ฮต
โข If yes, start a new cluster and add all neighbors to a queue
โฃ Expand by checking each queued point the same way
โค Repeat until all points are visited. Noise points get label -1.
Density-reachability is asymmetric. A border point can be reached from a core, but can't reach back. Density-connectivity is symmetric. Two points are in the same cluster if some core point density-reaches both.
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐-๐ ๐ฒ๐ฎ๐ป๐?
K-Means needs k upfront, makes hard spherical assignments, and has no concept of outliers.
DBSCAN discovers cluster count from density, handles any shape, and labels noise points automatically.
K-Means breaks on rings and spirals. DBSCAN doesn't.
The tradeoff: DBSCAN struggles when clusters have very different densities. A single ฮต can't fit all scales at once. That's where ๐๐๐๐ฆ๐๐๐ก and ๐ข๐ฃ๐ง๐๐๐ฆ come in. They remove the single-ฮต limit entirely.
๐ฏ ๐ง๐๐ป๐ถ๐ป๐ด ๐๐ถ๐ฝ๐:
minPts โ 2 ร number of dimensions (or ln n for large datasets)
Pick ฮต at the knee of the k-distance plot. Scale your features first or distance metrics become meaningless.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐๐ฆ๐๐๐ก:
when clusters are non-convex, you don't know k, or outlier detection matters. Switch to HDBSCAN if density varies a lot across clusters.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Adam Optimizer? (in ML interviews)
๐ Let's learn together โ
Adam is an ๐ฎ๐ฑ๐ฎ๐ฝ๐๐ถ๐๐ฒ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ฝ๐๐ถ๐บ๐ถ๐๐ฒ๐ฟ that combines momentum with per-parameter learning rates.
It tracks two moving averages: one for gradients (momentum) and one for squared gradients (adaptive rates). This lets it move fast in consistent directions while taking smaller steps in noisy dimensions.
Think of it as SGD with momentum plus RMSprop's adaptive scaling.
๐ ๐ง๐ต๐ฒ ๐๐ฝ๐ฑ๐ฎ๐๐ฒ ๐ฟ๐๐น๐ฒ:
ฮธ(t+1) = ฮธ(t) - ฮท / โ(vฬ(t) + ฮต) ร mฬ(t)
Where:
m(t) โ first moment (momentum)
v(t) โ second moment (squared gradients)
mฬ(t), vฬ(t) โ bias-corrected estimates
ฮท โ learning rate (often 0.001)
ฮฒโ โ momentum decay (typically 0.9)
ฮฒโ โ second moment decay (typically 0.999)
ฮต โ small constant for stability (1e-8)
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Compute gradient g(t)
โก Update first moment: m(t) = ฮฒโm(t-1) + (1-ฮฒโ)g(t)
โข Update second moment: v(t) = ฮฒโv(t-1) + (1-ฮฒโ)g(t)ยฒ
โฃ Correct bias: mฬ(t) = m(t)/(1-ฮฒโแต), vฬ(t) = v(t)/(1-ฮฒโแต)
โค Update parameters using corrected moments
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฆ๐๐ ๐๐ถ๐๐ต ๐ ๐ผ๐บ๐ฒ๐ป๐๐๐บ?
SGD+Momentum uses a fixed learning rate for all parameters and only tracks gradient momentum.
Adam adjusts learning rates per parameter based on gradient history and tracks both first and second moments. It needs less manual tuning but uses more memory.
SGD often needs careful learning rate schedules. Adam works out of the box with default hyperparameters.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ฑ๐ฎ๐บ:
when you want fast prototyping with minimal tuning, training transformers or RNNs, or working with sparse gradients and non-stationary objectives.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is A/B Testing? (in data science interviews)
๐ Let's learn together โ
A/B testing is a ๐๐๐ฎ๐๐ถ๐๐๐ถ๐ฐ๐ฎ๐น ๐บ๐ฒ๐๐ต๐ผ๐ฑ ๐๐ผ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ฟ๐ฒ ๐๐๐ผ ๐๐ฎ๐ฟ๐ถ๐ฎ๐ป๐๐ and determine which performs better on a target metric.
You split users randomly into control (A) and treatment (B) groups, expose them to different versions, then measure if the difference in outcomes is real or just noise.
It's how companies decide whether a new feature actually improves conversion, retention, or revenue.
๐ ๐ง๐ต๐ฒ ๐๐ฒ๐๐ ๐๐๐ฎ๐๐ถ๐๐๐ถ๐ฐ:
Z = (pฬแตฆ - pฬโ) / sqrt(pฬ(1-pฬ)(1/nโ + 1/nแตฆ))
Where:
pฬโ, pฬแตฆ โ observed conversion rates in each group
pฬ โ pooled proportion under null hypothesis
nโ, nแตฆ โ sample sizes per group
๐ช ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Define your metric and minimum detectable effect
โก Calculate required sample size based on ฮฑ (Type I error) and power
โข Randomly assign users to control or treatment
โฃ Run the test until you hit sample size
โค Compute Z-score and compare to critical value
โฅ Reject null if |Z| exceeds threshold, otherwise fail to reject
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ฎ๐๐ฒ๐๐ถ๐ฎ๐ป ๐/๐?
Classical A/B gives you a binary decision (reject or not) based on p-values and requires fixed sample sizes.
Bayesian A/B gives you a probability distribution over the treatment effect, lets you stop early when confident, and directly answers "what's the chance B is better?"
Classical is easier to explain to stakeholders. Bayesian is more flexible but needs prior specification.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐/๐ ๐๐ฒ๐๐๐ถ๐ป๐ด:
when you need to validate product changes with statistical confidence before rolling them out to everyone.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Class Imbalance Handling? (in ML interviews)
๐ Let's learn together โ
Class imbalance is when ๐ผ๐ป๐ฒ ๐ฐ๐น๐ฎ๐๐ ๐ฑ๐ผ๐บ๐ถ๐ป๐ฎ๐๐ฒ๐ ๐๐ต๐ฒ ๐ฑ๐ฎ๐๐ฎ and the model learns to ignore the rare class entirely.
Say you have 190 majority points and 10 minority points. An unweighted model hits recall of 0.2 on the rare class. It's basically guessing majority every time. Reweight the loss, and recall jumps to 0.9. Same data, very different boundary.
The fix isn't one thing. It's a toolkit: reweight, resample, or shift the threshold.
๐ ๐ง๐ต๐ฒ ๐ณ๐ผ๐ฟ๐บ๐๐น๐ฎ๐:
๐ช๐ฒ๐ถ๐ด๐ต๐๐ฒ๐ฑ ๐๐ผ๐๐:
L = -(1/N) ร sum of w_yi ร [yi ร log(pฬi) + (1 - yi) ร log(1 - pฬi)]
Where:
w_yi โ per-sample class weight, inflates rare-class gradients
yi โ true label
pฬi โ predicted probability
๐๐ป๐๐ฒ๐ฟ๐๐ฒ-๐๐ฟ๐ฒ๐พ๐๐ฒ๐ป๐ฐ๐ ๐ช๐ฒ๐ถ๐ด๐ต๐:
w_c = N / (K ร N_c)
Where:
N โ total samples
K โ number of classes
N_c โ samples in class c
๐๐ผ๐ฐ๐ฎ๐น ๐๐ผ๐๐ (for dense/vision tasks):
FL(pt) = -ฮฑt ร (1 - pt)^ฮณ ร log(pt)
Where:
(1 - pt)^ฮณ โ down-weights easy examples, focuses gradient on hard ones
ฮฑt โ class balancing factor
ฮณ โ tunable focus parameter (RetinaNet uses ฮณ=2)
๐ช ๐๐ผ๐ ๐๐ผ ๐ฎ๐ฝ๐ฝ๐น๐ ๐ถ๐ (๐ถ๐ป ๐ผ๐ฟ๐ฑ๐ฒ๐ฟ ๐ผ๐ณ ๐ฒ๐ณ๐ณ๐ผ๐ฟ๐):
โ Start with inverse-frequency weights. Pass class_weight to your loss. No resampling needed. Works well for tabular data with mild skew.
โก Try SMOTE if weights aren't enough. It interpolates k-NN neighbors to synthesize minority samples. Only resample the training fold, never validation or test.
โข Tune the decision threshold. Train on raw data, then move the cutoff from 0.5 to maximize F1 or recall on validation. Cost-sensitive: minimize C_FN ร FN + C_FP ร FP.
โฃ Use focal loss for dense prediction tasks like object detection. It ignores easy majority examples automatically.
โค Pick the right metrics. Drop accuracy entirely. A 99% majority predictor scores 99% accuracy and is useless. Use PR-AUC, F1, or recall@k instead. Stratify your CV folds so each fold preserves the base rate.
๐ง ๐ฅ๐ฒ๐๐ฎ๐บ๐ฝ๐น๐ถ๐ป๐ด ๐๐ ๐ฅ๐ฒ๐๐ฒ๐ถ๐ด๐ต๐๐ถ๐ป๐ด:
Resampling (SMOTE) changes the data itself. It can overfit in high dimensions and adds preprocessing complexity. Best for low-dimensional structured data.
Reweighting changes the loss function. It's cheap, simple, and doesn't touch the data distribution. Works with any model that accepts sample weights.
Threshold tuning doesn't change training at all. It just shifts where you draw the line at inference time. Useful when you have a cost asymmetry between false positives and false negatives.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐น๐ฎ๐๐ ๐๐บ๐ฏ๐ฎ๐น๐ฎ๐ป๐ฐ๐ฒ ๐๐ฎ๐ป๐ฑ๐น๐ถ๐ป๐ด:
any time your rare class is the one that actually matters. Fraud detection, medical diagnosis, churn prediction. Start with reweighting, validate with PR-AUC, and tune the threshold last.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Ridge Regression? (in ML interviews)
๐ Let's learn together โ
๐ฅ๐ถ๐ฑ๐ด๐ฒ ๐ฅ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐ถ๐ ๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ ๐ฟ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐๐ถ๐๐ต ๐๐ฎ ๐ฝ๐ฒ๐ป๐ฎ๐น๐๐.
It shrinks coefficients toward zero to fight overfitting and handle multicollinearity. Unlike Lasso, it never eliminates features completely. All coefficients stay in the model, just smaller.
Think: gentle pressure on all weights instead of aggressive feature selection.
๐ ๐ง๐ต๐ฒ ๐ผ๐ฏ๐ท๐ฒ๐ฐ๐๐ถ๐๐ฒ:
ฮฒฬridge = argmin { ฮฃ(yi - Xiฮฒ)ยฒ + ฮป ฮฃฮฒjยฒ }
Where:
ฮฃ(yi - Xiฮฒ)ยฒ โ RSS (residual sum of squares)
ฮป ฮฃฮฒjยฒ โ L2 penalty on coefficient magnitudes
ฮป โ regularization strength (controls shrinkage)
๐ช ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Start with ordinary least squares setup
โก Add penalty term that grows with coefficient size
โข Solve closed-form: ฮฒฬ = (XแตX + ฮปI)โปยนXแตy
โฃ Tune ฮป via cross-validation to balance bias and variance
The ฮปI term guarantees invertibility even when features are collinear. This stabilizes estimates.
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ฎ๐๐๐ผ?
Lasso uses L1 penalty (absolute values) and drives some coefficients exactly to zero. It does feature selection.
Ridge uses L2 penalty (squared values) and shrinks all coefficients but keeps them nonzero. It keeps all features.
Lasso is sparse. Ridge is smooth.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐ฅ๐ถ๐ฑ๐ด๐ฒ:
when you have correlated features (r > 0.7), more features than samples, or unstable OLS estimates. Also when you want all features to contribute rather than selecting a subset.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Network Interference? (in A/B test interviews)
๐ Let's learn together โ
Network interference happens when ๐๐ฟ๐ฒ๐ฎ๐๐ถ๐ป๐ด ๐ผ๐ป๐ฒ ๐๐๐ฒ๐ฟ ๐ฎ๐ณ๐ณ๐ฒ๐ฐ๐๐ ๐ฎ๐ป๐ผ๐๐ต๐ฒ๐ฟ ๐๐๐ฒ๐ฟ'๐ ๐ผ๐๐๐ฐ๐ผ๐บ๐ฒ.
Standard A/B tests assume independence. But in social networks, marketplaces, or shared resources, users interact. Your treatment group influences your control group through connections.
This breaks SUTVA (Stable Unit Treatment Value Assumption) and makes naive estimates biased.
๐ ๐ง๐ต๐ฒ ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ:
Yi(z) = Yi(zi, z-i) โ Yi(zi)
Where:
Yi(z) โ outcome for unit i under assignment vector z
zi โ treatment assigned to unit i
z-i โ treatment assigned to all other units
The inequality shows unit i's outcome depends on others' assignments, violating independence.
โก ๐ง๐๐ฝ๐ฒ๐ ๐ผ๐ณ ๐ถ๐ป๐๐ฒ๐ฟ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ:
โ Direct effect: treating unit i changes their own outcome
โก Spillover effect: treating unit i changes connected units' outcomes
โข Contamination: control users get exposed through treated neighbors
Real example: you test a referral feature. Treated users invite control users. Control group gets the benefit without the treatment flag.
๐ฏ ๐๐ผ๐ ๐๐ผ ๐บ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ถ๐:
Direct treatment effect = E[Yi(1,0) - Yi(0,0)]
Spillover effect = E[Yi(0,z_N) - Yi(0,0)]
First isolates individual impact. Second captures peer influence from having treated neighbors z_N.
๐ ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฟ๐ฒ๐ด๐๐น๐ฎ๐ฟ ๐/๐ ๐๐ฒ๐๐๐?
Regular A/B tests assume treating one user doesn't affect others and randomize at the user level.
Network interference means users influence each other, requires cluster randomization (groups of connected users), and needs special estimators like Horvitz-Thompson to get unbiased effects.
Standard tests give you 30% upward bias from spillover contamination.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐ผ๐ฟ๐ฟ๐ ๐ฎ๐ฏ๐ผ๐๐ ๐ป๐ฒ๐๐๐ผ๐ฟ๐ธ ๐ถ๐ป๐๐ฒ๐ฟ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ:
social features, marketplaces with supply/demand dynamics, shared inventory, pricing changes, or anything where users interact directly or compete for resources.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Dropout Regularization? (in ML interviews)
๐ Let's learn together โ
Dropout is a ๐ฟ๐ฒ๐ด๐๐น๐ฎ๐ฟ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐๐ฒ๐ฐ๐ต๐ป๐ถ๐พ๐๐ฒ that randomly turns off neurons during training.
By forcing the network to learn without relying on any single neuron, it prevents co-adaptation (where neurons become too dependent on each other) and reduces overfitting.
Think of it like training a team where random members are absent each day. Everyone learns to be self-sufficient.
๐งฎ ๐ง๐ต๐ฒ ๐บ๐ฎ๐๐ต:
ลทโฝหกโพ = rโฝหกโพ * yโฝหกโพ
rโฝหกโพ ~ Bernoulli(p)
Where:
yโฝหกโพ โ neuron activations at layer l
rโฝหกโพ โ binary mask (0 or 1 for each neuron)
p โ keep probability (typically 0.5 for hidden layers)
* โ element-wise multiplication
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ During training: randomly set each neuron to zero with probability (1-p)
โก Active neurons pass their values forward normally
โข Backprop only updates weights connected to active neurons
โฃ At test time: use all neurons but scale outputs by p (or use inverted dropout to avoid this)
Each training batch sees a different "thinned" network.
๐ ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ฎ ๐ฅ๐ฒ๐ด๐๐น๐ฎ๐ฟ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป?
L2 adds a penalty term to the loss function and shrinks all weights proportionally. It's deterministic.
Dropout randomly removes neurons during training and approximates training an ensemble of networks. It's stochastic and often works better for deep networks.
You can use both together.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ฟ๐ผ๐ฝ๐ผ๐๐:
when training deep networks prone to overfitting, especially with limited data. Higher rates (0.5) for large layers, lower (0.2) for smaller ones.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is SHAP? (in ML interviews)
๐ Let's learn together โ
SHAP is a ๐ด๐ฎ๐บ๐ฒ-๐๐ต๐ฒ๐ผ๐ฟ๐ฒ๐๐ถ๐ฐ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐ฒ๐ ๐ฝ๐น๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฝ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป๐.
It answers: "how much did each feature contribute to this specific prediction?" It borrows from cooperative game theory, treating features as players sharing credit for the model's output.
The key insight: a feature's SHAP value is its average marginal contribution across every possible ordering of features. Fair, consistent, and model-agnostic.
๐ ๐ง๐ต๐ฒ ๐ฆ๐ต๐ฎ๐ฝ๐น๐ฒ๐ ๐ณ๐ผ๐ฟ๐บ๐๐น๐ฎ:
ฯแตข = ฮฃ [|S|!(|F|-|S|-1)! / |F|!] ร [f(Sโช{i}) - f(S)]
Where:
ฯแตข โ SHAP value for feature i
S โ a coalition (subset) of other features
f(Sโช{i}) - f(S) โ marginal contribution of adding feature i to coalition S
|F| โ total number of features
The sum of all SHAP values equals f(x) - E[f(X)]. That's the ๐ฒ๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ ๐ฎ๐ ๐ถ๐ผ๐บ. Every bit of the prediction gap is explained.
๐ฏ ๐ง๐ต๐ฒ ๐ณ๐ผ๐๐ฟ ๐ฎ๐ ๐ถ๐ผ๐บ๐:
โ ๐๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐: all SHAP values sum to f(x) - E[f(X)]
โก ๐ฆ๐๐บ๐บ๐ฒ๐๐ฟ๐: two features with equal contributions get equal credit
โข ๐๐๐บ๐บ๐: a feature that changes nothing gets ฯ = 0
โฃ ๐๐ฑ๐ฑ๐ถ๐๐ถ๐๐ถ๐๐: SHAP values from ensemble models can be summed across trees
These axioms make SHAP the only attribution method that satisfies all four simultaneously.
โก ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐บ๐ฎ๐ถ๐ป ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป๐:
โ ๐ง๐ฟ๐ฒ๐ฒ๐ฆ๐๐๐ฃ: exact and fast for tree-based models (XGBoost, LightGBM). Use this by default for tabular data.
โก ๐๐ฒ๐ฒ๐ฝ๐ฆ๐๐๐ฃ: approximation for neural networks using backprop. Fast but less exact.
โข ๐๐ฒ๐ฟ๐ป๐ฒ๐น๐ฆ๐๐๐ฃ: model-agnostic. Fits a weighted linear surrogate around perturbed samples. Works on anything, but cost grows as 2^M so it's slow for many features.
๐ง ๐๐ป๐๐ฒ๐ฟ๐๐ฒ๐ป๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ ๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น ๐ฆ๐๐๐ฃ:
๐๐ป๐๐ฒ๐ฟ๐๐ฒ๐ป๐๐ถ๐ผ๐ป๐ฎ๐น marginalizes features independently. Causal-style. Can produce out-of-distribution inputs.
๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น respects feature correlations via the conditional distribution. Stays in-distribution, but splits credit strangely when features are correlated.
Correlated features are the hard part. Two correlated features can each get low SHAP values even if together they matter a lot. Prefer interventional SHAP when you want cleaner attribution and have correlated inputs.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐ฆ๐๐๐ฃ:
when you need to explain individual predictions, debug a model, satisfy stakeholders, or compare global feature importance across a dataset by aggregating |ฯแตข| values.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Metric Selection for A/B Tests? (in A/B test interviews)
๐ Let's learn together โ
Metric selection is the ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐ ๐ผ๐ณ ๐ฐ๐ต๐ผ๐ผ๐๐ถ๐ป๐ด ๐ฝ๐ฟ๐ถ๐บ๐ฎ๐ฟ๐, ๐ด๐๐ฎ๐ฟ๐ฑ๐ฟ๐ฎ๐ถ๐น, ๐ฎ๐ป๐ฑ ๐ฑ๐ถ๐ฎ๐ด๐ป๐ผ๐๐๐ถ๐ฐ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ๐ for an experiment.
You need to balance statistical sensitivity (can we detect a change?) with business alignment (does this metric actually matter?).
Pick wrong and you either run tests forever or ship changes that hurt the product.
๐ ๐ง๐ต๐ฒ ๐๐ฟ๐ฎ๐ฑ๐ฒ๐ผ๐ณ๐ณ:
Sensitivity vs. Business Alignment
High sensitivity โ low variance, easy to detect changes, but might not reflect real value
High alignment โ captures long-term impact, but noisy and needs huge samples
Sample size formula:
n = (Zฮฑ/2 + Zฮฒ)ยฒ ร 2ฯยฒ / ฮดยฒ
Where:
Zฮฑ/2 โ significance level (usually 1.96 for 95%)
Zฮฒ โ power (usually 0.84 for 80%)
ฯยฒ โ metric variance
ฮด โ minimum detectable effect
๐ฏ ๐ง๐ต๐ฒ ๐ณ๐ผ๐๐ฟ ๐๐๐ฝ๐ฒ๐:
โ ๐ฃ๐ฟ๐ถ๐บ๐ฎ๐ฟ๐ (1 metric)
The single decision metric tied to your hypothesis.
Example: conversion rate for a checkout redesign.
โก ๐๐๐ฎ๐ฟ๐ฑ๐ฟ๐ฎ๐ถ๐น (2-4 metrics)
Must-not-harm constraints that protect the business.
Example: latency (p99) when testing a new recommendation model.
โข ๐ฆ๐ฒ๐ฐ๐ผ๐ป๐ฑ๐ฎ๐ฟ๐ (3-5 metrics)
Explain the mechanism. No launch gate but help you understand why.
Example: CTR to explain why conversion changed.
โฃ ๐๐ถ๐ฎ๐ด๐ป๐ผ๐๐๐ถ๐ฐ (10+ metrics)
Debug and segment drill-down. Page views, error rates, cohort splits.
โก ๐๐ผ๐ ๐๐ผ ๐ฐ๐ต๐ผ๐ผ๐๐ฒ:
โ Start with business goal and map to metrics
โก Calculate MDE for each candidate using variance and sample size
โข Pick primary with best sensitivity/alignment balance
โฃ Add guardrails for anything that could break
โค Pre-register everything to avoid p-hacking
๐ง ๐๐ผ๐บ๐บ๐ผ๐ป ๐บ๐ถ๐๐๐ฎ๐ธ๐ฒ๐:
Using revenue-per-user as primary when you should use conversion rate (sensitivity matters).
Picking too many primaries without multiplicity correction (inflates false positives).
Choosing vanity metrics that move easily but don't correlate with long-term value.
Forgetting to check if your metric is actually movable by the treatment.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ ๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐ผ๐ป:
before you run any A/B test. Pick metrics first, then design the experiment around them.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is the Bias-Variance Tradeoff? (in ML interviews)
๐ Let's learn together โ
The bias-variance tradeoff explains ๐๐ต๐ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ณ๐ฎ๐ถ๐น ๐๐ผ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น๐ถ๐๐ฒ and how complexity affects prediction error.
Every model's error comes from three sources: bias (underfitting), variance (overfitting), and noise you can't avoid. You can't minimize both bias and variance at once. Reducing one increases the other.
This is the fundamental tension in machine learning.
๐ ๐ง๐ต๐ฒ ๐ฑ๐ฒ๐ฐ๐ผ๐บ๐ฝ๐ผ๐๐ถ๐๐ถ๐ผ๐ป:
Expected Error = Biasยฒ + Variance + Irreducible Noise
Where:
Biasยฒ โ error from wrong assumptions (too simple)
Variance โ error from sensitivity to training noise (too complex)
Irreducible Noise โ randomness in data itself
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐ (๐ฎ๐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ฒ๐ ๐ถ๐๐ ๐ถ๐ป๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ๐):
โ Simple models: high bias, low variance (can't fit training data)
โก Moderate models: bias and variance balanced (sweet spot)
โข Complex models: low bias, high variance (memorizes training noise)
โฃ Training error keeps dropping, but test error rises after the sweet spot
๐ง ๐๐ผ๐ ๐ถ๐ ๐ผ๐๐ฒ๐ฟ๐ณ๐ถ๐๐๐ถ๐ป๐ด ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ป๐ฑ๐ฒ๐ฟ๐ณ๐ถ๐๐๐ถ๐ป๐ด?
Overfitting means training error is way lower than test error. The model memorized noise instead of patterns. High variance problem.
Underfitting means both errors stay high. The model is too simple to capture real patterns. High bias problem.
Fix overfitting by adding regularization, dropout, or more data. Fix underfitting by adding features, increasing model capacity, or training longer.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ต๐ถ๐ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ:
when debugging why your model performs poorly, choosing model complexity, or explaining why cross-validation matters in interviews.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Gradient Boosting? (in ML interviews)
๐ Let's learn together โ
Gradient Boosting is an ๐ฒ๐ป๐๐ฒ๐บ๐ฏ๐น๐ฒ ๐บ๐ฒ๐๐ต๐ผ๐ฑ that builds a strong predictor by sequentially adding weak learners.
Each new model fits the negative gradient of the loss function. This means it corrects the errors of all previous models combined.
The result? A powerful additive model that improves step by step.
๐ ๐ง๐ต๐ฒ ๐ฐ๐ผ๐ฟ๐ฒ ๐๐ฝ๐ฑ๐ฎ๐๐ฒ:
Fm(x) = Fm-1(x) + ฮฝ ยท hm(x)
Where:
Fm(x) โ ensemble prediction after m trees
hm(x) โ new weak learner (usually a shallow tree)
ฮฝ โ learning rate (controls step size)
Each tree fits pseudo-residuals, which are the negative gradient of the loss with respect to current predictions.
โก ๐๐ผ๐ ๐ถ๐ ๐๐ฟ๐ฎ๐ถ๐ป๐:
โ Start with an initial prediction (often the mean)
โก Compute pseudo-residuals (negative gradient of loss)
โข Fit a weak learner to those residuals
โฃ Update the ensemble by adding the new tree (scaled by learning rate)
โค Repeat for M iterations
Loss decreases as the ensemble grows. Each tree corrects what came before.
๐ ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฅ๐ฎ๐ป๐ฑ๐ผ๐บ ๐๐ผ๐ฟ๐ฒ๐๐?
Random Forest trains trees in parallel on bootstrapped samples and averages predictions. Trees are independent.
Gradient Boosting trains trees sequentially. Each tree depends on the errors of previous ones. It optimizes a loss function directly.
Random Forest reduces variance. Gradient Boosting reduces bias.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ฟ๐ฎ๐ฑ๐ถ๐ฒ๐ป๐ ๐๐ผ๐ผ๐๐๐ถ๐ป๐ด:
when you need high accuracy on structured or tabular data and can afford careful tuning of depth and learning rate.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is the Synthetic Control Method? (in A/B test interviews)
๐ Let's learn together โ
Synthetic Control is a ๐ฐ๐ฎ๐๐๐ฎ๐น ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐บ๐ฒ๐๐ต๐ผ๐ฑ that builds a weighted combination of untreated units to mimic what would have happened to the treated unit without intervention.
You create a fake control group from real data. The weights are chosen so the synthetic control matches the treated unit's pre-treatment behavior as closely as possible.
Think: if you can't randomize, build the counterfactual yourself.
๐ ๐ง๐ต๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น:
ลถโแดบ = ฮฃ wโฑผ Yโฑผโ for j=2 to J+1
Where:
ลถโแดบ โ synthetic control outcome (what treated unit would've been)
wโฑผ โ donor weights (non-negative, sum to 1)
Yโฑผโ โ observed outcomes from donor units
J โ number of donor units
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Collect pre-treatment data for treated unit and donor pool
โก Solve optimization: minimize distance between treated unit's pre-treatment characteristics and weighted donors
โข Apply those same weights to post-treatment period
โฃ Treatment effect = actual outcome minus synthetic control outcome
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ถ๐ณ๐ณ-๐ถ๐ป-๐๐ถ๐ณ๐ณ?
Diff-in-Diff uses parallel trends assumption and averages all control units equally.
Synthetic Control builds a custom weighted control, doesn't require parallel trends before treatment, and works with just one treated unit.
Diff-in-Diff needs multiple treated units for statistical power. Synthetic Control shines when you have one intervention (like a policy change in one state).
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐ฆ๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐๐ผ๐ป๐๐ฟ๐ผ๐น:
when you have one treated unit, a long pre-treatment period, and a pool of similar untreated units to build your counterfactual from.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is Bayesian Hyperparameter Tuning? (in ML interviews)
๐ Let's learn together โ
Bayesian optimization is a ๐๐บ๐ฎ๐ฟ๐ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐๐๐ฟ๐ฎ๐๐ฒ๐ด๐ that finds optimal hyperparameters using far fewer evaluations than grid or random search.
Instead of blindly testing combinations, it builds a probabilistic model of your objective function and intelligently picks the next point to try.
Think: learning from each experiment to guide the next one, not just guessing.
๐งฎ ๐ง๐ต๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น (๐๐ฎ๐๐๐๐ถ๐ฎ๐ป ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐):
f(x) ~ GP(m(x), k(x, x'))
Where:
f(x) โ your objective (validation accuracy, loss, etc.)
m(x) โ mean prediction at point x
k(x, x') โ covariance function (captures smoothness)
GP โ gives both prediction and uncertainty at every point
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Start with 5-10 random evaluations to initialize
โก Fit a Gaussian Process to observed points
โข Use acquisition function to pick next point (balances exploration vs exploitation)
โฃ Evaluate objective at that point
โค Update GP with new observation
โฅ Repeat until budget exhausted or convergence
The acquisition function is key. Expected Improvement (EI) is most common:
EI(x) = E[max(0, f(x) - f(x*))]
Where:
f(x*) โ current best observed value
EI โ balances high mean (exploitation) with high variance (exploration)
๐ฏ ๐๐ ๐ฝ๐น๐ผ๐ฟ๐ฒ ๐๐ ๐๐ ๐ฝ๐น๐ผ๐ถ๐:
The GP gives uncertainty estimates everywhere. High uncertainty means unexplored regions.
EI picks points that either look promising (high mean) or uncertain (high variance). This balance is automatic.
Grid search wastes time on bad regions. Bayesian opt learns to avoid them.
๐ ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฅ๐ฎ๐ป๐ฑ๐ผ๐บ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต?
Random search samples uniformly without learning. Every trial is independent.
Bayesian optimization builds a model and uses past results to inform future choices. It concentrates trials in promising areas.
Random search needs hundreds of evaluations. Bayesian opt often finds good configs in 20-50 trials.
Random search works for any function. Bayesian opt assumes smoothness (nearby hyperparameters give similar performance).
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ฎ๐๐ฒ๐๐ถ๐ฎ๐ป ๐๐๐ฝ๐ฒ๐ฟ๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐ง๐๐ป๐ถ๐ป๐ด:
when each evaluation is expensive (training deep nets, large datasets) and you need the best config with a limited budget. Tools like Optuna and HyperOpt make this easy.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What are Ensemble Methods? (in ML interviews)
๐ Let's learn together โ
๐๐ป๐๐ฒ๐บ๐ฏ๐น๐ฒ ๐บ๐ฒ๐๐ต๐ผ๐ฑ๐ ๐ฐ๐ผ๐บ๐ฏ๐ถ๐ป๐ฒ ๐บ๐๐น๐๐ถ๐ฝ๐น๐ฒ ๐๐ฒ๐ฎ๐ธ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ฒ๐ฟ๐ ๐ถ๐ป๐๐ผ ๐ผ๐ป๐ฒ ๐๐๐ฟ๐ผ๐ป๐ด ๐ฝ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ผ๐ฟ.
Instead of trusting one model, you train many and aggregate their predictions. This reduces variance, bias, or both depending on the method.
Think: asking 100 people instead of 1 expert. The crowd's average is often better than any individual guess.
๐ ๐ง๐ต๐ฒ ๐ณ๐ผ๐ฟ๐บ๐๐น๐ฎ:
fฬ(x) = ฮฃ ฮฑm hm(x) for m=1 to M
Where:
M โ number of base learners
hm(x) โ prediction from model m
ฮฑm โ weight for model m
fฬ(x) โ final ensemble prediction
โก ๐ง๐๐ผ ๐บ๐ฎ๐ถ๐ป ๐ฎ๐ฝ๐ฝ๐ฟ๐ผ๐ฎ๐ฐ๐ต๐ฒ๐:
๐๐ฎ๐ด๐ด๐ถ๐ป๐ด (parallel training):
โ Train M models on bootstrap samples (random subsets with replacement)
โก Average predictions (regression) or vote (classification)
โข Reduces variance while keeping bias constant
โฃ Example: Random Forest
๐๐ผ๐ผ๐๐๐ถ๐ป๐ด (sequential training):
โ Train model on full data
โก Identify misclassified examples
โข Train next model focusing on those errors
โฃ Weight models by performance
โค Combine with weighted sum
โฅ Reduces bias primarily
โฆ Example: XGBoost, AdaBoost
๐ง ๐๐ผ๐ ๐ถ๐ ๐๐ฎ๐ด๐ด๐ถ๐ป๐ด ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ผ๐ผ๐๐๐ถ๐ป๐ด?
Bagging trains models independently in parallel on random subsets. It reduces variance. Works best with high-variance models like deep trees.
Boosting trains models sequentially where each corrects the previous one's mistakes. It reduces bias. Works best when you need to squeeze out every bit of accuracy.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐ป๐๐ฒ๐บ๐ฏ๐น๐ฒ๐:
when a single model isn't accurate enough, you have diverse base learners, or you're competing on Kaggle and need that extra 2% performance boost.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is a Loss Function? (in ML interviews)
๐ Let's learn together โ
A loss function is a ๐บ๐ฎ๐๐ต๐ฒ๐บ๐ฎ๐๐ถ๐ฐ๐ฎ๐น ๐บ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ผ๐ณ ๐ต๐ผ๐ ๐๐ฟ๐ผ๐ป๐ด ๐๐ผ๐๐ฟ ๐บ๐ผ๐ฑ๐ฒ๐น'๐ ๐ฝ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ฟ๐ฒ.
It takes the difference between predicted and actual values, then converts that into a single number. The model's job during training? Make that number as small as possible.
Every gradient descent step moves in the direction that reduces this loss.
๐ ๐ง๐ต๐ฒ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐น ๐ณ๐ผ๐ฟ๐บ:
โ(ฮธ) = (1/N) ร ฮฃ L(yi, ลทi(ฮธ)) for i=1 to N
Where:
yi โ true label
ลทi โ predicted value
ฮธ โ model parameters
L โ per-sample loss function
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐ (๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐น๐ผ๐ผ๐ฝ):
โ Forward pass: compute predictions ลท
โก Calculate loss: measure error between ลท and y
โข Backward pass: compute gradients โโ/โฮธ
โฃ Update parameters: ฮธ = ฮธ - ฮฑ ร gradient
โค Repeat until loss stops decreasing
๐ฏ ๐๐ผ๐บ๐บ๐ผ๐ป ๐น๐ผ๐๐ ๐ณ๐๐ป๐ฐ๐๐ถ๐ผ๐ป๐:
MSE (Mean Squared Error):
โ = (1/N) ร ฮฃ(yi - ลทi)ยฒ
โ Penalizes large errors quadratically
โ Use for regression problems
โ Sensitive to outliers
Cross-Entropy:
โ = -(1/N) ร ฮฃ[yi ln ลทi + (1-yi) ln(1-ลทi)]
โ Measures probability divergence
โ Use for classification
โ Pairs with softmax outputs
MAE (Mean Absolute Error):
โ = (1/N) ร ฮฃ|yi - ลทi|
โ Linear penalty, more resistant to outliers
โ Gradient doesn't grow with error size
Huber Loss:
Combines MSE for small errors, MAE for large ones
โ Smooth everywhere with bounded gradients
โ Good middle ground for noisy data
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ?
Loss functions are differentiable and used during training to update weights. They need smooth gradients.
Metrics (like accuracy or F1) evaluate final performance but often aren't differentiable. You optimize the loss, then report the metric.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐ฐ๐ต๐ผ๐ผ๐๐ฒ ๐๐ต๐ถ๐ฐ๐ต ๐น๐ผ๐๐:
match the loss to your problem type (regression vs classification), your data distribution (outliers?), and what errors matter most in your application.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r
What is a Switchback Experiment? (in A/B test interviews)
๐ Let's learn together โ
A switchback experiment is a ๐๐ถ๐บ๐ฒ-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐ฟ๐ฎ๐ป๐ฑ๐ผ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ฑ๐ฒ๐๐ถ๐ด๐ป for marketplace experiments.
Instead of assigning users to treatment or control, you alternate entire regions or markets between conditions over time. This solves interference problems when treating one user affects others.
Think: turning a feature on and off across different cities at different hours, not splitting individual riders.
๐ ๐ง๐ต๐ฒ ๐ฒ๐๐๐ถ๐บ๐ฎ๐๐ผ๐ฟ:
ฯฬ = (1/|T|) ร ฮฃ(ศฒtโฝยนโพ - ศฒtโฝโฐโพ)
Where:
ฯฬ โ treatment effect estimate
|T| โ number of time slots
ศฒtโฝยนโพ โ average outcome in treatment periods
ศฒtโฝโฐโพ โ average outcome in control periods
You difference within each region-time cell, then average.
โก ๐๐ผ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐:
โ Divide geography into regions (cities, zones, markets)
โก Split time into slots (hours, days, weeks)
โข Randomly assign treatment/control to each region-time cell
โฃ Alternate assignments so each region sees both conditions
โค Measure outcomes and difference across matched time periods
The grid pattern balances time-of-day and day-of-week effects by design.
๐ง ๐๐ผ๐ ๐ถ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐๐๐ฒ๐ฟ-๐น๐ฒ๐๐ฒ๐น ๐/๐ ๐๐ฒ๐๐๐?
User-level A/B tests randomize individuals and assume no spillover between users.
Switchback randomizes time-region blocks and handles network effects. It reduces bias when one user's treatment affects others (like surge pricing or driver supply). But it increases variance because you have fewer independent units.
Optimal slot length is usually 15-60 minutes. Too short and carryover effects bleed across periods. Too long and you lose statistical power.
โ๏ธ ๐ช๐ต๐ฒ๐ป ๐๐ผ ๐๐๐ฒ ๐๐๐ถ๐๐ฐ๐ต๐ฏ๐ฎ๐ฐ๐ธ๐:
when testing marketplace features like pricing, dispatch algorithms, or supply incentives where user-level randomization creates interference and biased estimates.
๐ Land Data & AI jobs on https://t.co/B83Otkqc2r