One of my friends got rejected in a Spring Boot Microservices interview, not because he didn’t know concepts,
but because he couldn’t answer real production issues.
That’s when we realized, interviews are no longer about theory — they’re about real-world problem solving.
Here are 15 real production scenario-based questions:
1. Your Spring Boot service CPU suddenly spikes to 90% in production. How will you investigate and fix it?
2. After deployment, your service starts throwing intermittent 500 errors. How will you debug this issue?
3. One microservice goes down and causes a chain failure in other services. How will you prevent this in future?
4. Your API response time increased from 200ms to 3 seconds after a new release. How will you identify the root cause?
5. Database connections are getting exhausted under load. What steps will you take to fix this?
6. A third-party service you depend on is timing out frequently. How will you handle this in your system?
7. You observe duplicate transactions happening in your system. How will you prevent this?
8. Logs are too large and distributed, making debugging difficult. How will you improve observability?
9. Memory usage keeps increasing and your service crashes after some time. How will you detect and fix memory leaks?
10. Your microservice works fine locally but fails in production. How will you approach debugging?
11. A new deployment breaks one feature but works for others. How will you safely roll back?
12. Traffic suddenly spikes 5x during peak hours and your service becomes slow. How will you scale?
13. Inter-service communication is failing due to network latency. How will you optimize it?
14. You need to trace a single request across multiple services during a failure. How will you implement tracing?
15. A bug in one service causes inconsistent data across multiple services. How will you handle data consistency?
If you can answer these with real solutions, you are already at production-level understanding.
Want to understand system design better?
Study Google File System properly.
It teaches how real systems are designed when machines fail, disks fail, networks fail, and scale becomes too big for textbook thinking.
1. GFS was built for very large files, not tiny files.
Think GBs to TBs. This matters because the design changes completely when your workload is huge logs, crawled web pages, analytics data, backups, and append-heavy data.
A normal file system mindset is: “how do I store files neatly?”
A distributed system mindset is: “how do I store massive data across hundreds or thousands of unreliable machines and still keep it usable?”
2. GFS splits each file into large chunks.
Each chunk is typically 64 MB.
That number is important.
Why large chunks?
Because: fewer metadata entries are needed, fewer client-to-master requests happen, and sequential reads become much faster.
If you made chunks too small, metadata explodes. If you make them large, management becomes simpler at scale.
3. GFS has 1 master and many chunkservers.
This is one of the most important ideas.
The master does not store the actual file data. It stores metadata: file names, chunk mapping, chunk locations, version numbers, lease info.
The chunkservers store the real data.
This separation is a huge lesson in system design: keep the control plane and data plane separate.
Control decides. Workers store and serve.
4. Every chunk is replicated, usually 3 times.
This is how GFS survives machine failure.
If 1 machine dies, 2 more still have the chunk. If a disk gets corrupted, replicas help recover it.
Replication gives durability and availability, but it also introduces cost: more storage, more network copying, more consistency work.
That is the real tradeoff.
System design is always about deciding what cost you are willing to pay.
5. Reads are simple.
The client first asks the master: “where is chunk number X?”
The master replies with chunkserver locations.
After that, the client talks directly to the chunkserver. Not through the master.
This is a very valuable pattern: use central coordination only when needed, but push actual heavy traffic away from the coordinator.
Otherwise your master becomes the bottleneck very fast.
6. Writes are harder than reads.
This is where real distributed systems begin.
When a client writes, data is pushed to all replicas. One replica becomes the primary for that chunk for a short time lease. The primary decides the write order. Then secondaries follow the same order.
Why is this needed?
Because once multiple replicas exist, somebody must decide the exact mutation order. Without that, replicas diverge and your data becomes inconsistent.
7. GFS was optimized for append-heavy workloads.
This is a very underrated point.
A lot of large distributed systems do not behave like your laptop file system.
They are not constantly editing random bytes in the middle of files. They are often appending logs, events, records, crawler outputs, analytics streams.
So GFS introduced record append.
That means multiple clients can append to a file concurrently, while the system ensures the append happens at least once in a consistent way.
Very useful for logs. Very practical for real infra.
8. GFS assumes failures are normal.
This is maybe the biggest lesson of all.
Most beginners design systems as if hardware is trustworthy. Real infra engineers design as if failure is guaranteed.
In GFS: machines die, disks corrupt data, network links break, servers restart, replicas go stale.
So the system continuously monitors chunkservers, re-replicates missing chunks, and checks consistency.
Good system design is all about “how it keeps working when 10 things go wrong together”.
It is about making 3 things very clear:
1. what workload you serve
2. what failures you expect
3. what tradeoffs you accept
If you understand those 3 things, you are already thinking like a real backend engineer.
Distributed systems do not have a 'shared clock', so it becomes very difficult for two nodes to agree on what "now" means. To be honest, this is what makes distributed systems so interesting.
Also, this is where the "happened-before" relationship comes in handy and brings order to this chaos. Let's dig slightly deeper...
If event A happened before event B, it means A could have causally influenced B. Two events with no such relationship are simply concurrent, and the system treats them as such. Here is a concrete example - I will use the classic names Alice and Bob and imagine them editing a shared document.
1. Alice reads the document (event A), makes a change (event B), and saves it.
2. Bob also reads the document (event C) before Alice's save arrives.
3. Bob then saves his version (event D).
Now, without tracking the happened-before relationship, the system has no way of knowing that Bob's read (C) missed Alice's write (B). It will silently overwrite Alice's changes.
With causal tracking, the system knows that event B happened before event D, so Bob's write is based on stale data and should have seen B. It can detect this conflict and ask for a resolution instead of silently losing data.
By the way, this is exactly how systems like DynamoDB and Riak handle concurrent writes without data loss. There are a few practical ways to implement this:
1. Lamport Clocks
Each node increments a counter on every event and updates it on message receipt. Pretty simple to implement.
2. Vector Clocks
Each node tracks a counter per node. This captures true causal relationships and lets you detect concurrent writes precisely. Git's merge detection works on a similar principle (kind of).
3. Hybrid Logical Clocks (HLC)
Combines physical time with logical counters. Used in CockroachDB. Lets you reason about causality while staying close to wall-clock time, so you can efficiently serve time-based queries and maintain a consistent ordering without relying purely on physical clocks.
Of course, the choice depends on your consistency needs and the overhead you can afford. Vector clocks grow with the number of nodes, and HLCs are a good middle ground for most production systems.
By the way, causality does not solve all consistency problems, but it tells you what you cannot ignore: if A happened before B, your system must respect that order. Everything else is negotiable.
This is a rabbit hole in itself, but I hope this gives you enough of a kick to explore further :)
My dear front-end developers (and anyone who’s interested in the future of interfaces):
I have crawled through depths of hell to bring you, for the foreseeable years, one of the more important foundational pieces of UI engineering (if not in implementation then certainly at least in concept):
Fast, accurate and comprehensive userland text measurement algorithm in pure TypeScript, usable for laying out entire web pages without CSS, bypassing DOM measurements and reflow
And if you’re targeting specific companies, add these:
Amazon:
146 LRU Cache
692 Top K Frequent Words
994 Rotting Oranges
863 All Nodes Distance K in Binary Tree
1152 Analyze User Website Visit Pattern
Google:
23 Merge k Sorted Lists
224 Basic Calculator
772 Basic Calculator III
129 Sum Root to Leaf Numbers
358 Rearrange String k Distance Apart
Meta:
560 Subarray Sum Equals K
1762 Buildings With an Ocean View
670 Maximum Swap
987 Vertical Order Traversal of a Binary Tree
1249 Minimum Remove to Make Valid Parentheses
Microsoft:
139 Word Break
56 Merge Intervals
236 Lowest Common Ancestor
240 Search a 2D Matrix II
460 LFU Cache
Uber:
127 Word Ladder
973 K Closest Points
297 Serialize and Deserialize Binary Tree
253 Meeting Rooms II
341 Flatten Nested List Iterator
Netflix:
981 Time Based Key-Value Store
636 Exclusive Time of Functions
721 Accounts Merge
295 Find Median from Data Stream
79 Word Search
---
If you do: 10 patterns x 4-6 good problems each
20-25 company-focused questions
You’ll already be more prepared than most candidates who just say “I’ve done 500+ LeetCode.”
Solve by pattern. Revise by template. Practice by company. That’s how you can crack DSA rounds fast.
One way to avoid getting overwhelmed when you are designing a system is to approach it with a structure/mental model. Here's a simple one I follow...
Think of any system as having 2 separate paths: a read path and a write path. Your database or storage layer sits at the center, and each path has its own set of problems and its own toolkit to solve them.
The write path is about durability and throughput. You are almost always choosing from the same set of levers: a message queue to absorb bursts, write-ahead logging for durability, batching to reduce I/O, and async processing to keep the critical path thin.
The read path is about latency and scale. Again, a fixed toolkit - caching at various layers (CDN, app-level, query-level), read replicas to offload the primary, denormalizing or pre-computing expensive queries, and pagination or cursor-based traversal to avoid full scans.
Once you start seeing these as two separate concerns, you stop designing a system and start answering two smaller questions: what does the write path need? What does the read path need? Then you wire them together through storage.
This 'framework' (if I can call it that) is not just useful during discussions, but also helps you reason through production systems, especially when you are building an understanding, trying to optimize an existing flow, debugging latency spikes, or planning for scale.
Hope this helps.
LeetCode is HARD until you learn these 20 patterns:
1.Two Pointers
2.Sliding Window
3.Dynamic Programming
4.Prefix Sum
5.Depth-First Search (DFS)
6.Breadth-First Search (BFS)
7.Binary Search
8.Backtracking
9.Monotonic Stack
10.Matrix Traversal
11. Fast & Slow Pointers
12. Top ‘K’ Elements (Heap)
13.Overlapping Intervals
14.Binary Tree Traversal
15.Union Find (Disjoint Set)
16.Greedy Algorithms
17.Linked List In-place Reversal
18.Modified Binary Search
19.Bit Manipulation
20.Trie (Prefix Tree)
Bookmark or repost it future use
The reading list that taught me how to think about agentic architecture.
Bookmark this.
1. Brewer's CAP Theorem (2000) — trade-off thinking
2. Netflix Hystrix docs — circuit breaker pattern
3. Martin Fowler: Saga Pattern — distributed rollback
4. The Twelve-Factor App — stateless service design
5. AWS Well-Architected Framework — blast radius thinking
6. "Thinking in Systems" — Donella Meadows
7. Designing Data-Intensive Applications — Kleppmann
8. Google SRE Book Ch.13 — cascading failures
9. OWASP LLM Top 10 (2025) — agent attack surfaces
10. Anthropic: Building Effective Agents (2024)
11. LangGraph docs — stateful agent patterns
12. Microsoft AutoGen paper — multi-agent orchestration
13. Gartner: Agentic AI Hype Cycle (2025)
14. EU AI Act Article 14 — human oversight requirements
Classic distributed systems stuff.
Applied to the next layer of the stack.
Follow for annotated breakdowns → @asmah2107
select name from students where id = 1008
This is the SQL I picked to research what exactly happens when a database query is executed. Going through the parsing, buffer pool memory cache, the file system page cache, the SSD disk controller, the flash translation layer, the SSD page and back all the way up.
In the process I learned that terms like page and block are perhaps the most overloaded concepts in software engineering. There is a database page, an operating system virtual memory page, a file system block, an SSD page, two types of SSD blocks, one called the logical block that maps to the file system and one is the larger unit that is called erase unit which contains multiple pages.
All of these units can have different sizes, some match some don’t.
In this post I walk through this statement and how the different levels of I/O are being performed all the way to physical disk. Like all my posts, I include the fundamentals at the beginning, you may skip if you are familiar.
I titled it: Following a database read to the metal
Hope you enjoy it
https://t.co/STNaifreGR
12 System design concepts engineers should know:
1. Load balancing algorithms explained
↳ https://t.co/VCLCKOZzni
2. gRPC clearly explained
↳ https://t.co/QwgTXr1N9z
3. How HTTPS actually works
↳ https://t.co/wc3CQOsmPS
4. Database caching strategies
↳ https://t.co/23QdZATj2o
5. System design quality attributes
↳ https://t.co/v9WJoUPevt
6. Health checks vs heartbeats
↳ https://t.co/r5SalP6CCh
7. CI/CD pipelines
↳ https://t.co/SM2YvhioIX
8. API gateway vs load balancer vs reverse proxy
↳ https://t.co/Tg3EhT60tU
9. Microservices clearly explained
↳ https://t.co/1CpY04nNxb
10. How JWT works
↳ https://t.co/Kuv7DAj6B9
11. Idempotency in API design
↳ https://t.co/2sItwlz1oe
12. API protocols made simple
↳ https://t.co/2CEu4Wnhsv
What else should make the list?
What concepts would you like me to cover?
👋 PS: Get our System Design Handbook FREE when you join our newsletter. Join 30,001+ engineers: https://t.co/8uVCeyVa1w
--
📌 Save for later.
♻️ Repost to help other engineers learn system design.
➕ Follow Nikki Siapno + turn on notifications.
No bullshit system design guide for backend engineers who want to reach Staff level.
Devs stay stuck at ₹10 to 25 LPA cause they know frameworks, but not systems.
Meanwhile Staff Engineers are often paid:
India: ₹40L to ₹1Cr+
Remote: $120k to $250k+
So if you cannot explain these clearly, you are not ready for senior backend roles, let alone Staff.
1. Load Balancing
2. SQL vs NoSQL
3. Idempotency
4. Message Queues
5. CAP Theorem
6. APIs
7. Batch vs Stream Processing
8. Caching Strategies
9. Webhooks
10. Availability
11. Data Sharding and Partitioning
12. Bloom Filters
13. Stateful vs Stateless Architecture
14. Algorithms in Distributed Systems
15. API Gateways
16. Proxy vs Reverse Proxy
17. Sharding
18. Long Polling vs WebSockets
19. Consistent Hashing
20. gRPC, tRPC, GraphQL, or REST
21. Caching
22. Scaling
23. Cache Eviction Policies
24. Databases in System Design
25. JWTs
26. Services in System Design
27. Concurrency vs Parallelism
28. CDC
29. ACID Transactions
30. CDN
31. Sync vs Async
32. Rate Limiting Algorithms
33. REST
34. gRPC vs REST tradeoffs
35. Fault Tolerance
Truth is, Staff level is not about writing more code.
It is about knowing where systems break, why they break, and how to design so they keep making money even when traffic, failures, and complexity go up.
Frontend System Design is the biggest hurdle for Senior/Staff interview candidates. It’s no longer about syntax; it’s about architecture at scale. 🧵
I found a goldmine of interview logs and guides that break down the "how" and "why" of complex UI systems.
Link: https://t.co/ASQzmEb5FC
please escalate!Tata Nexon stuck 1+ month in Mandi, HP (TATA AIG garage) for minor work. Parts order placed 9 Jan but Sahni Autos Solan still hasn’t supplied. Worst experience ever. Need car urgently for family function. Extremely frustrated.@TATAAIGIndia@TataMotors_Cars
@TataMotors My Tata Nexon is stuck at a TATA AIG certified garage in Mandi, HP for over a month for minor repair. Parts ordered via Sahni Autos (Solan ) on Jan 9, but still not delivered. Worst service experience. Please escalate. Happy to DM full details: Reg no., VIN, claim no, contact etc