Microservices in prod: what patterns keep you from living in incident chat?
1. Bounded context first. If you can’t name the owner + data, it’s not a service.
2. DB per service. Share events, not tables. Accept duplication.
3. Sync for reads, async for writes. Use outbox + idempotency keys.
4. Timeouts everywhere. Start with p95 + 2x, cap retries, add jitter.
5. Circuit breakers + bulkheads. One slow dependency should not drain all threads.
6. Backpressure. Queue limits, 429s, and load shedding beats cascading failure.
7. Contract tests. Version your APIs, never deploy a breaking change without a path.
8. Observability as a feature: trace IDs, RED/USE metrics, SLOs, structured logs.
9. Safe deploys: canary, feature flags, and rollbacks that don’t need a ticket.
10. Kill switch per dependency. You’ll use it more than you think
I was asked this system design problem in 3 out of 11 Big Tech companies I interviewed at last year, including Amazon, Google, Atlassian, Salesforce, Walmart, and others.
For context, I landed 6 offers last year during my 3-month job switch journey:
1. Amazon (Senior Eng. L6)
2. Walmart (Staff Eng.)
3. Atlassian (Principal Eng.)
4. Salesforce (LMTS)
5. Confluent (Sr. SWE 2)
6. Deliveroo (Staff SWE)
What was the problem? It was: Design a distributed job scheduler. I was given different requirements and constraints each time.
Netflix is proof that Kubernetes is not enough.
Most companies think Kubernetes is the platform. Netflix treated it as the starting point.
Netflix streams content to more than 300 million people worldwide, but Kubernetes is not delivering those videos.
Video delivery runs on Open Connect because streaming video needs a level of predictability that general-purpose infrastructure cannot provide.
Kubernetes runs everything else.
But running containers is only part of the problem.
Every workload needs access to AWS services, and giving thousands of containers the same permissions is a security risk waiting to happen.
So Netflix built Titus.
Workloads get fine-grained permissions, reducing the blast radius when something goes wrong.
Then came machine learning.
Data scientists should be training models, not learning Kubernetes internals or writing YAML.
So Netflix built Metaflow.
Engineers write Python, while Kubernetes, GPUs, experiment tracking, and infrastructure complexity stay hidden behind the platform.
Deployments created another challenge.
When hundreds of services are released every day, no engineer can watch every rollout.
So Netflix built Spinnaker.
Traffic shifts gradually, canaries are analyzed automatically, and bad releases roll back before they become incidents.
But reliable systems are not built by avoiding failure.
They are built by expecting it.
That is why Netflix embraced chaos engineering.
Pods fail.
Nodes disappear.
Entire zones can become unavailable.
Netflix tests these scenarios before production discovers them.
The lesson is simple.
Netflix did not scale because it adopted Kubernetes.
Netflix scaled because it built identity, security, deployment safety, developer experience, machine learning platforms, and resilience on top of Kubernetes.
Kubernetes was never the platform.
It was the foundation.
90% of Java interviews in 2026 come down to these 7 points:
1) You can trace a request through the stack: controller, service, repo, DB, queue, cache, logs
2) You know modern Java basics: streams when they help, records, sealed, virtual threads, but not as trivia
3) You understand concurrency: thread safety, executors, backpressure, timeouts, and why blocking hurts throughput
4) You can design APIs: idempotency keys, pagination, retries, versioning, and clear error models
5) You can debug production: heap vs CPU, GC pauses, JFR, thread dumps, slow SQL, p99 latency
6) You can do data + messaging: indexes, isolation, N+1, exactly-once vs at-least-once, dead letters
7) You can ship safely: tests that matter, feature flags, migrations, canaries, rollbacks, observability (RED/USE)
Most Uber-style matching interviews aren’t about maps, they’re about controlling fanout. Components I expect: driver location ingest (stream + geospatial index), rider request service, matcher/dispatcher, pricing/ETA, notification/assignment, and an events log for audits. Bottlenecks show up fast: hot cells in dense areas, candidate search fanout, thundering herds on surge, and push notification latency.
Data model is basically Driver(id, status, geo, last_seen, capacity), Trip(id, rider_id, pickup/dropoff, state, version), and an Assignment(trip_id, driver_id, expires_at) with idempotency keys. APIs: POST /trips, POST /drivers/{id}/location, POST /trips/{id}/accept, POST /trips/{id}/cancel. Scaling tradeoffs: shard by geohash/cell, keep matching state in memory with TTL, use at-least-once streams + dedupe, and prefer eventual consistency over global locks. Failure cases: stale GPS, driver accepts after timeout, double-assign on retries, partition between matcher and notification, and
Linux monitoring tools you SHOULD have
- btop - sleek UI, CPU + GPU stats, lots of themes
- glances - all‑in‑one overview (CPU, memory, disk, network)
- nvtop / nvitop - GPU graphs, PCIe metrics, power & temp (way better than nvidia‑smi)
- duf - high taste disk usage utility
Many engineers default to using UUIDs as primary keys in PostgreSQL without considering the trade-offs.
The problem?
Traditional UUIDs are random.
And random values don't play nicely with B-Tree indexes.
Every new insert can land in a completely different part of the index, causing page splits, fragmentation, and more work for PostgreSQL as your table grows.
At a few thousand rows, you'll never notice.
At 10 million+ rows, you probably will.
That's why many teams are moving towards ULIDs and UUIDv7.
You still get globally unique identifiers.
But you also get time-based ordering.
New records are inserted closer together in the index, which means less fragmentation and more predictable write performance.
Small change.
Big impact.
Especially when you're operating at scale.
Backend interview traps that look easy but break prod fast: idempotency, pagination, rate limiting.
1) Idempotency: retries happen. Show an idempotency key + dedupe store (Redis/DB) + safe semantics for POST/charge/send-email. Talk about TTLs and what happens on key collision.
2) Pagination: offset is fine until deletes/inserts cause skips/dupes. Prefer cursor pagination (created_at,id) with stable ordering. Mention index needs and how to handle backfills.
3) Rate limiting: don’t just say 429. Pick an algorithm (token bucket/leaky bucket), scope it (per user, per IP, per API key), and define headers (Retry-After, X-RateLimit-Remaining). Consider burst vs steady state.
4) Operational detail: what do you log/measure? retries, dedupe hits, page drift, 429 rate, latency impact of limiter storage (local vs Redis).
Cohesity interview experience - SDE 3 for 54 LPA
Round 1 - Coding
Scheduled with SE-3 from Bangalore
https://t.co/em7WSAeE2r
Discussed DFS based and disjoint-set based component count and TC.
The feedback was positive for this round.
Round 2 - Design
Scheduled with SE 3 from Bangalore
LLD - (Can't recall exact constraints/requirements, still added)
An implementation where ordered sequence of values arrive and 10 most recent values are to be displayed when a read is performed. Support low latency writes.
Discussed how it would work in a distributed system. I suggested a queue based approach for writes where we could use idempotency and acid DB to avoid duplicates if queue/ack fails.
The feedback was positive for this round.
Round 3 - Design
Scheduled with a Senior Staff from a US team
Was supposed to be a mix of LLD + HLD. But discussed around my current project of event-driven architecture with high level diagram. The discussion progressively moved towards the backup and recovery work of cohesity.
We discussed about -
uploading large files using synchronous, async and chunked approaches data corruption while upload
storing chunked data and reusability across tenants(if any)
replication and erasure coding to handle data loss
identifying data corruption and recovering it from downloading and more
Was a really great discussion with a hands-on professional.
The feedback was positive for this round.
Round 4 - HM
Scheduled with DoE from US.
General Behavioral questions around how my manager, team, family and friends perceive me.
One small technical question about appending one file's content to another. Both files are 100 GB in size. Discussion on what could go wrong.
Told him about reading certain no of characters (depending on RAM size) and appending to the other file. Discussed memory corruption, no disk space and few other points.
As per the recruiter, the technical signal was mixed from this round. So, they scheduled another Coding round.
Additonal Round - Coding
Scheduled with a Senior Staff from a US team
Another hands-on professional from backup and recovery team. Didn't ask any leetcode style question. Instead he asked me to review a simplified codebase around a streaming architecture. It was in CPP. I don't have CPP exposure apart from college. He helped me understand any unknown syntax.
Told him issues around SOLID principles, retry logic without backoff, API documentation for abstract methods and more.
The feedback was positive for this round.
Netflix-style system design: design a video streaming service. What do interviewers actually want?
1) Components: upload + ingest, transcoding farm (HLS/DASH, multiple bitrates), origin storage, CDN, playback API, auth/DRM, telemetry (QoE), watch-history service
2) Bottlenecks: encoding CPU (queue + autoscale), origin egress on cache misses, CDN invalidation, metadata DB hot keys (popular titles), cold-start latency on first segment
3) Data model: Title(id, assets), Asset(profile, codec, segment_urls), User(id), WatchProgress(user_id, title_id, position_ms, updated_at, device_id) with idempotency key + write coalescing
4) APIs: POST /upload, POST /encode/{asset}, GET /playback/{title} -> manifest URL + signed token, PUT /progress (position, duration, event_time), GET /history?cursor=
5) Tradeoffs + failures: per-segment signed URLs vs tokenized manifests; strong vs eventual progress (resume accuracy vs cost); retries cause progress rewind unless monotonic updates; CDN POP outage f
One engineer interviewing for Senior Engg. at Airbnb was asked to design a notification delivery system.
Another candidate at LinkedIn got a simple messaging attachment upload problem.
Both sounded easy initially.
Until the interviewer kept adding constraints:
- Notifications should arrive within seconds
- Duplicate notifications must never be sent
- Users can go offline and reconnect later
- Attachments should support previews instantly
- System should handle celebrity-level fanout spikes
- Failed deliveries should retry automatically
And suddenly…
you’re no longer building a notification service.
You’re designing a large-scale distributed delivery pipeline.
Here are 10 things you should automatically think about now whenever a system involves messaging, notifications, or attachments:
- Never send notifications synchronously
-> Push delivery should always happen through async workers and queues.
- Separate notification creation from delivery
-> One service decides what to send, another handles how to send it.
- Design idempotent consumers
-> Queue retries will happen. Duplicate sends should not.
- Prioritize notifications differently
-> OTPs and payment alerts should not wait behind marketing pushes.
- Fanout strategy matters a lot
-> Sending to 10 followers vs 10 million followers are completely different problems.
- Store delivery state carefully
-> Sent, delivered, failed, opened, clicked all become important later.
- Retry with exponential backoff
-> Immediate retries during outages usually make incidents worse.
- Support offline users gracefully
-> Mobile clients should sync missed events after reconnecting.
- Generate previews asynchronously
-> Image thumbnails, PDFs, video previews should happen in background jobs.
- Deduplicate aggressively
-> Network retries and client refreshes create duplicate requests constantly.
- Add per-user and per-device rate limits
-> Prevent abuse and accidental notification storms.
One subtle thing most interview discussions miss:
At scale, notification systems are usually optimized more for fanout control than raw delivery speed.
Because it gets difficult when:
one celebrity posts
one payment gateway retries everything
one buggy cron triggers millions of sends
The hardest part often isn’t sending notifications fast.
It’s preventing the entire system from melting during spikes.
The best free resources to study production grade engineering research papers:
1. Amazon Builders’ Library
2. Google SRE Books
3. Google Research Publications
4. Microsoft Research Systems
5. ACM Queue
6. USENIX
7. VLDB / PVLDB
8. arXiv CS
9. Papers We Love
10. The Morning Paper
11. High Scalability
12. Martin Fowler