Most people misunderstand the CAP theorem.
It does not mean you choose between consistency and availability for every system all the time.
It means that when a network partition happens, a distributed system can either:
stay consistent and reject/delay some requests,
or stay available and allow stale/conflicting data.
You do not “beat” CAP with better code.
You make a product decision about which failure mode is less harmful.
Bank balance? Lean toward consistency.
Social feed likes? Availability is often fine.
CAP is useful because it forces the real question:
When the network breaks, what are you willing to sacrifice?
That is system design. Not in normal conditions. In failure.
#SystemDesign #DistributedSystems #CAPTheorem
Retries look harmless.
They are not.
A retry is not just “trying again.”
It is adding more load to a system that is already failing.
That is why badly designed retries can turn a small slowdown into a full outage.
Good retries need discipline:
timeouts,
bounded attempts,
exponential backoff,
jitter,
and idempotent operations.
Without those, retries become a traffic amplifier.
The goal of retries is not to force success at any cost.
It is to recover from temporary failure without making the system collapse harder.
In distributed systems, resilience is not just about retrying.
It is about knowing when not to.
#DistributedSystems #SystemDesign #Reliability
Retries look harmless.
They are not.
A retry is not just “trying again.”
It is adding more load to a system that is already failing.
That is why badly designed retries can turn a small slowdown into a full outage.
Good retries need discipline:
timeouts,
bounded attempts,
exponential backoff,
jitter,
and idempotent operations.
Without those, retries become a traffic amplifier.
The goal of retries is not to force success at any cost.
It is to recover from temporary failure without making the system collapse harder.
In distributed systems, resilience is not just about retrying.
It is about knowing when not to.
#DistributedSystems #SystemDesign #Reliability
Most people misunderstand the CAP theorem.
It does not mean you choose between consistency and availability for every system all the time.
It means that when a network partition happens, a distributed system can either:
stay consistent and reject/delay some requests,
or stay available and allow stale/conflicting data.
You do not “beat” CAP with better code.
You make a product decision about which failure mode is less harmful.
Bank balance? Lean toward consistency.
Social feed likes? Availability is often fine.
CAP is useful because it forces the real question:
When the network breaks, what are you willing to sacrifice?
That is system design. Not in normal conditions. In failure.
#SystemDesign #DistributedSystems #CAPTheorem
A good API is not just easy to use.
It is hard to misuse.
That means clear naming, predictable responses, strong error messages, stable contracts, sane defaults, and versioning only when necessary.
Most API pain does not come from missing features.
It comes from ambiguity:
unclear field meaning,
inconsistent behavior,
surprising side effects,
and breaking changes disguised as improvements.
The best APIs reduce decision fatigue for the consumer.
They make the common path obvious, edge cases explicit, and failures understandable.
Good API design is not about showing how smart the backend is.
It is about making integration feel boring, reliable, and fast.
#API #SystemDesign #SoftwareEngineering
A good API is not just easy to use.
It is hard to misuse.
That means clear naming, predictable responses, strong error messages, stable contracts, sane defaults, and versioning only when necessary.
Most API pain does not come from missing features.
It comes from ambiguity:
unclear field meaning,
inconsistent behavior,
surprising side effects,
and breaking changes disguised as improvements.
The best APIs reduce decision fatigue for the consumer.
They make the common path obvious, edge cases explicit, and failures understandable.
Good API design is not about showing how smart the backend is.
It is about making integration feel boring, reliable, and fast.
#API #SystemDesign #SoftwareEngineering
A lot of teams think SRE is about uptime.
It’s not.
SRE is about making reliability a deliberate engineering decision, not a vague hope.
That means:
defining what “good enough” looks like,
measuring it with SLIs/SLOs,
and using error budgets to balance reliability vs speed.
Without that, every incident feels urgent,
every feature feels risky,
and reliability becomes whoever shouts the loudest.
Good SRE gives teams a shared language for tradeoffs.
Not “make it perfect.”
Not “move fast and pray.”
Just: be clear about what must be reliable, how reliable it needs to be, and what you’re willing to spend to get there.
#SRE #Reliability #Engineering
Most system design discussions start with scale.
They should start with failure.
Before talking about caching, sharding, queues, or databases, ask:
What happens when a dependency is slow?
What happens when traffic spikes?
What happens when a service is partially down?
What happens when data is duplicated, delayed, or lost?
A system isn’t well-designed because it works when everything is healthy.
It’s well-designed because it fails predictably, degrades gracefully, and recovers quickly.
Good system design is not just about handling growth.
It’s about handling reality.
#SystemDesign #SoftwareEngineering #Architecture
Most system design discussions start with scale.
They should start with failure.
Before talking about caching, sharding, queues, or databases, ask:
What happens when a dependency is slow?
What happens when traffic spikes?
What happens when a service is partially down?
What happens when data is duplicated, delayed, or lost?
A system isn’t well-designed because it works when everything is healthy.
It’s well-designed because it fails predictably, degrades gracefully, and recovers quickly.
Good system design is not just about handling growth.
It’s about handling reality.
#SystemDesign #SoftwareEngineering #Architecture
A lot of teams think SRE is about uptime.
It’s not.
SRE is about making reliability a deliberate engineering decision, not a vague hope.
That means:
defining what “good enough” looks like,
measuring it with SLIs/SLOs,
and using error budgets to balance reliability vs speed.
Without that, every incident feels urgent,
every feature feels risky,
and reliability becomes whoever shouts the loudest.
Good SRE gives teams a shared language for tradeoffs.
Not “make it perfect.”
Not “move fast and pray.”
Just: be clear about what must be reliable, how reliable it needs to be, and what you’re willing to spend to get there.
#SRE #Reliability #Engineering
One of the best production incident lessons came from Cloudflare’s 2019 outage:
A single WAF rule update containing a CPU-expensive regex caused massive 502 errors across their network. Not a cyberattack. Not hardware failure. Not a datacenter outage. Just a bad pattern deployed globally.
That’s the uncomfortable truth about production:
sometimes the biggest incidents don’t come from “big” changes. They come from tiny, high-leverage changes with huge blast radius.
The real lesson isn’t “be careful with regex.”
It’s this:
config is production code
small changes need safe rollout
rollback speed matters more than rollback elegance
systems fail where we assume they’re harmless
The teams that survive incidents best aren’t the ones that never make mistakes.
They’re the ones that detect fast, reduce blast radius, roll back fast, and publish what they learned.
#SRE #IncidentResponse #Engineering #learninpublic
One of the best production incident lessons came from Cloudflare’s 2019 outage:
A single WAF rule update containing a CPU-expensive regex caused massive 502 errors across their network. Not a cyberattack. Not hardware failure. Not a datacenter outage. Just a bad pattern deployed globally.
That’s the uncomfortable truth about production:
sometimes the biggest incidents don’t come from “big” changes. They come from tiny, high-leverage changes with huge blast radius.
The real lesson isn’t “be careful with regex.”
It’s this:
config is production code
small changes need safe rollout
rollback speed matters more than rollback elegance
systems fail where we assume they’re harmless
The teams that survive incidents best aren’t the ones that never make mistakes.
They’re the ones that detect fast, reduce blast radius, roll back fast, and publish what they learned.
#SRE #IncidentResponse #Engineering #learninpublic
What are you building?
I’ll start with mine. I’m building https://t.co/EpKVhLz3s0, a platform where AI researches each lead, analyzes their LinkedIn profile, reads company news, understands what’s happening in their world, and then writes highly personalized emails for them.
Most companies replace {first_name} and call it personalization. We’re changing that.
We will be live soon fully.
Any feedback would be appreciated :)
#buildinpublic
What are you building?
I’ll start with mine. I’m building https://t.co/EpKVhLz3s0, a platform where AI researches each lead, analyzes their LinkedIn profile, reads company news, understands what’s happening in their world, and then writes highly personalized emails for them.
Most companies replace {first_name} and call it personalization. We’re changing that.
We will be live soon fully.
Any feedback would be appreciated :)
#buildinpublic
I'm looking to connect with people interested in:
→ Frontend
→ Backend
→ Full Stack
→ DevOps
→ LeetCode
→ AI/ML
→ Data Science
→ UI/UX
→ Freelancing
→ Startups
Say hi & let's grow together
#Connect
Background Jobs are one of the most underrated concepts in System Design 🧵
When a user hits "Submit", they expect a fast response. But some tasks — sending emails, resizing images, processing payments — take time.
The fix? Don't do it inline. Queue it.
User Request → API → Push to Queue → Return response ✅
↓
Worker picks job & processes it async
Key things to get right:
→ Idempotency — jobs will retry, results must stay consistent
→ Dead Letter Queue — catch & inspect failed jobs
→ Priority Queues — don't let bulk tasks block urgent ones
→ Monitoring — a silent failing worker is a production nightmare
Popular tools: Celery (Python), BullMQ (Node), Sidekiq (Ruby), Hangfire (.NET)
Rule of thumb: If a task can fail silently and be retried later — it belongs in a background job.
Speed + Resilience + Scale. That's the combo.
#SystemDesign #BackendEngineering #SoftwareArchitecture
Background Jobs are one of the most underrated concepts in System Design 🧵
When a user hits "Submit", they expect a fast response. But some tasks — sending emails, resizing images, processing payments — take time.
The fix? Don't do it inline. Queue it.
User Request → API → Push to Queue → Return response ✅
↓
Worker picks job & processes it async
Key things to get right:
→ Idempotency — jobs will retry, results must stay consistent
→ Dead Letter Queue — catch & inspect failed jobs
→ Priority Queues — don't let bulk tasks block urgent ones
→ Monitoring — a silent failing worker is a production nightmare
Popular tools: Celery (Python), BullMQ (Node), Sidekiq (Ruby), Hangfire (.NET)
Rule of thumb: If a task can fail silently and be retried later — it belongs in a background job.
Speed + Resilience + Scale. That's the combo.
#SystemDesign #BackendEngineering #SoftwareArchitecture
Companies where engineers quietly make 40+ LPA in India
NVIDIA → 40–80 LPA
Rubrik → 35–70 LPA
Rippling → 40–75 LPA
LinkedIn → 35–65 LPA
VMware → 30–55 LPA
Google → 35–80 LPA
Microsoft → 30–70 LPA
Adobe → 30–65 LPA
Salesforce → 30–60 LPA
Jane Street → 1–3 Cr
Graviton → 80 LPA – 2 Cr
Quadeye → 80 LPA – 2 Cr
NK Securities → 70 LPA – 1.5 Cr
Most people chase FAANG.
Meanwhile, some of the highest-paying jobs are hiding in plain sight. 👀
Which company would you join without thinking twice