P.S Singh

@pd_singh7

Software Engineering, Rebel, Compassionate, Dreamer and Believer :)

Joined March 2018

941 Following

43 Followers

2.8K Posts

P.S Singh @pd_singh7

about 10 hours ago

@LogicBaaaz Can i do this in tv ??

pd_singh7 retweeted

Ritesh Roushan

@devXritesh

1 day ago

System Design Interview Question: A celebrity with 50M followers posts a tweet. Within seconds: • Millions open the app • Millions refresh feeds • Millions request the same content What fails first? A) Database B) Cache C) API Servers D) Message Queue And how would you prevent it? Explain your architecture 👇

121

60K

pd_singh7 retweeted

Jaydeep

@_jaydeepkarale

1 day ago

A web-based System Design Simulator, where you drag & drop architecture components and actually simulate traffic, failures, latency, and scaling in real time Best way to learn concepts. Link in next post post

383

410

29K

pd_singh7 retweeted

Mashood tried Ops

@fromcodetocloud

1 day ago

After working with Kubernetes in production, these are the 7 errors our team see again and again : Save for later✅ Here is the reason why you'll get this and how to fix ➡️

fromcodetocloud's tweet photo. After working with Kubernetes in production, these are the 7 errors our team see again and again :

Save for later✅

Here is the reason why you'll get this and how to fix ➡️ https://t.co/nx96cOVXcp

Who to follow

Mohammad Rajabloo 🦀

@mrsoftwareDev

Husband, Senior Software Engineer @ Snapp! Js/Ts Developer, Gopher، Rust ... Opinions are my own, not where I'm working for. skill issues: 🟩⬛️⬛️⬛️⬛️⬛️(1/∞)

Software engineer 👨‍💻

P.S Singh @pd_singh7

about 20 hours ago

@be_mewadi Link pls

pd_singh7 retweeted

Rahul

@sairahul1

1 day ago

How to build your first AI agent (Full guide)

634

499K

pd_singh7 retweeted

Jaydeep

@_jaydeepkarale

2 days ago

System Design was hard until I watched these videos

136

129K

pd_singh7 retweeted

Abhishek Singh

@0xlelouch_

1 day ago

Distributed systems interviews: they rarely want trivia. They want to see if you’ve been paged. 1. Consistency tradeoffs: strong vs eventual, read-your-writes, quorum math (R+W>N), and what breaks during a partition 2. Idempotency: retries without double-charging, idempotency keys, dedupe windows, at-least-once vs exactly-once (and why “exactly once” is usually scoped) 3. Time: clocks drift, timeouts vs retries, exponential backoff + jitter, and why deadlines beat fixed timeouts in request chains 4. Failure modes: retry storms, thundering herd, backpressure, circuit breakers, bulkheads, and load shedding 5. Data correctness: distributed transactions vs sagas, outbox pattern, ordering guarantees, and how you reconcile after partial failure

pd_singh7 retweeted

Jaydeep

@_jaydeepkarale

5 days ago

Day 2 in Observability Zero to Hero Metrics, Logs & Traces; TL;DR Metrics → What's happening? Logs → What happened? Traces → Where did it happen?

187

42K

pd_singh7 retweeted

Nandkishor

@devops_nk

6 days ago

One Kubernetes interview question I keep getting asked: "Walk me through what happens when a request travels from the Internet to a Pod." Most candidates answer: User → ALB → Ingress → Service → Pod But that's just the beginning. The follow-up questions are where things get interesting: • Who actually decides which Pod gets the request? • What does kube-proxy do? • How do iptables/IPVS work? • What role does the CNI plugin play? • If kube-proxy crashes, does traffic stop? My biggest learning: - A Kubernetes Service is not a load balancer by itself. - kube-proxy programs iptables/IPVS rules on the nodes, and those rules are what actually route traffic to healthy Pods. Even more interesting: - If kube-proxy crashes, existing traffic usually keeps working because the networking rules already exist in the kernel. - What breaks is the ability to update those rules when Pods or Services change. Sometimes the simplest interview questions expose the deepest gaps in our Kubernetes knowledge.

devops_nk's tweet photo. One Kubernetes interview question I keep getting asked:

"Walk me through what happens when a request travels from the Internet to a Pod."

Most candidates answer:

User → ALB → Ingress → Service → Pod

But that's just the beginning.

The follow-up questions are where things get interesting:

• Who actually decides which Pod gets the request?
• What does kube-proxy do?
• How do iptables/IPVS work?
• What role does the CNI plugin play?
• If kube-proxy crashes, does traffic stop?

My biggest learning:
- A Kubernetes Service is not a load balancer by itself.
- kube-proxy programs iptables/IPVS rules on the nodes, and those rules are what actually route traffic to healthy Pods.

Even more interesting:
- If kube-proxy crashes, existing traffic usually keeps working because the networking rules already exist in the kernel.
- What breaks is the ability to update those rules when Pods or Services change.

Sometimes the simplest interview questions expose the deepest gaps in our Kubernetes knowledge.

143

139

P.S Singh @pd_singh7

6 days ago

@arpit_bhayani @KatiMichel @arpit_bhayani I've spent most of my career working with Java, but I'm now looking to learn and move into Go development. Do you have any advice on what I should focus on, common pitfalls to avoid, or how to gain practical experience ??

pd_singh7 retweeted

Saurav Chaudhary

@sauravstwt

about 1 month ago

Senior DevOps interview question: Your application works perfectly with: 1 pod 5 pods even 20 pods But completely breaks after HPA scales to 50... CPU is normal. No crashes. What’s the first thing you suspect?

716

pd_singh7 retweeted

Tech Fusionist

@techyoutbe

about 1 month ago

Learn Kubernetes from Scratch - Visually 🔥🔥

230

329

27K

pd_singh7 retweeted

Saurav Chaudhary

@sauravstwt

about 1 month ago

You call yourself a Senior DevOps engineer? Then explain this without touching CPU or scaling. 30% of requests jumped from 200 ms → 6 s. Zero 5xx errors. Dashboards looked clean. Only the external API path was slow. This is where most engineers get exposed. Pods on newly launched nodes were slow, while existing ones were fine. Istio showed bimodal latency — some requests normal, others stuck at 4–6 seconds. ss -s revealed hundreds of TCP connections sitting in RTO, waiting to retransmit. That’s your first signal: this is not application latency. This is transport-level pain. Then came the real clue. Cilium eBPF metrics started screaming — CT_EVICTIONS and DROP_FRAG_NEEDED spikes on new nodes. Translation: your packets are getting fragmented or dropped. Root cause? MTU mismatch. Old nodes: ~8900 (jumbo frames) New nodes: 1500 Now add VXLAN overhead (~50 bytes), and suddenly your packets don’t fit. They fragment. They retransmit. Latency explodes — silently. No errors. No CPU spike. No autoscaling trigger. Because your system wasn’t failing. It was degrading in the network layer — the one most engineers ignore. This is the difference: Juniors debug pods. Seniors debug packet flow. They trace: ALB → network → CNI → service mesh → application Because real outages don’t scream. They whisper in layers you don’t monitor.

pd_singh7 retweeted

Abhishek Singh

@0xlelouch_

about 1 month ago

Top Quant Interviewer: Given a 100GB log file of URLs, find all duplicate URLs using only 1GB of RAM.

139

127

23K

pd_singh7 retweeted

Swapna Kumar Panda

@swapnakpanda

about 1 month ago

You will be a WORLD-CLASS SOFTWARE ENGINEER once you read these 20 books:

218

178K

pd_singh7 retweeted

Aditya Tandon

@adityatandon02

about 2 months ago

What happens If API Gateway goes down ? If API gateway is your single entry point and it goes down, your entire system can appear offline even when all downstream services are perfectly healthy. This is why treating the gateway as just a routing layer is a mistake; in reality, it becomes one of the most critical components in your architecture. The moment it fails, authentication, rate limiting, routing, logging, and even basic request forwarding stop working, effectively cutting users off from your platform. API Gateway failure is more dangerous than a single microservice failing because microservices are usually isolated, but the gateway is a centralized control point. It sits in front of everything. So even a small misconfiguration, deployment bug, or scaling issue can cause a full system outage. This is why production systems never rely on a single gateway instance. So how do real systems handle this? The first layer of protection is redundancy. Instead of one API Gateway, you deploy multiple instances behind a load balancer. If one instance crashes, traffic is automatically routed to healthy ones. But here is the catch, what if the load balancer itself fails? That is why systems often use DNS-based failover or multiple regional load balancers to avoid a single point of failure at the entry layer. Next comes geographic distribution. Large-scale systems deploy API Gateways across multiple regions. Traffic is routed using geo-DNS or Anycast routing, so even if an entire region goes down, users are automatically redirected to the nearest healthy region. This is how systems achieve high availability across continents. Another important concept is graceful degradation. If the API Gateway is partially failing or overloaded, instead of completely rejecting traffic, it can fall back to limited functionality. For example, serving cached responses, disabling non-critical APIs, or routing only high-priority requests. This ensures the system does not completely collapse under stress. Caching also plays a surprisingly powerful role here. Many API Gateways cache frequent responses. If downstream services become unreachable, cached responses can still be served temporarily, buying time for recovery. This is especially useful for read-heavy systems like content platforms. Now consider rate limiting and circuit breaking at the gateway level. If downstream services are failing, the gateway can stop forwarding excessive traffic to prevent cascading failures. This isolates problems and protects the rest of the system. So If you design it properly with multiple gateway instances, load balancing, regional failover, and fallback mechanisms, users might not even notice the failure. The system continues to operate because there is no single point of failure. It’s a common sense to ask Why not bypass the API Gateway and let clients directly call services? Because that removes a central control layer for authentication, monitoring, rate limiting, and routing. It increases complexity on the client side and weakens system security. Instead of removing the gateway, we make it highly available and fault tolerant. Real-world systems like Netflix, Amazon, and Google treat API Gateways as distributed systems themselves. They are scaled horizontally, replicated across regions, monitored aggressively, and designed to fail gracefully. The key takeaway is simple. The API Gateway is not just a router, it is a critical infrastructure component. If it goes down and you did not design for failure, your entire system goes down with it. If you design it correctly, failure becomes just another recoverable event. Happy designing ❤️

adityatandon02's tweet photo. What happens If API Gateway goes down ?

If API gateway is your single entry point and it goes down, your entire system can appear offline even when all downstream services are perfectly healthy. This is why treating the gateway as just a routing layer is a mistake; in reality, it becomes one of the most critical components in your architecture. The moment it fails, authentication, rate limiting, routing, logging, and even basic request forwarding stop working, effectively cutting users off from your platform.

API Gateway failure is more dangerous than a single microservice failing because microservices are usually isolated, but the gateway is a centralized control point. It sits in front of everything. So even a small misconfiguration, deployment bug, or scaling issue can cause a full system outage. This is why production systems never rely on a single gateway instance.

So how do real systems handle this?
The first layer of protection is redundancy. Instead of one API Gateway, you deploy multiple instances behind a load balancer. If one instance crashes, traffic is automatically routed to healthy ones. But here is the catch, what if the load balancer itself fails? That is why systems often use DNS-based failover or multiple regional load balancers to avoid a single point of failure at the entry layer.

Next comes geographic distribution. Large-scale systems deploy API Gateways across multiple regions. Traffic is routed using geo-DNS or Anycast routing, so even if an entire region goes down, users are automatically redirected to the nearest healthy region. This is how systems achieve high availability across continents.

Another important concept is graceful degradation. If the API Gateway is partially failing or overloaded, instead of completely rejecting traffic, it can fall back to limited functionality. For example, serving cached responses, disabling non-critical APIs, or routing only high-priority requests. This ensures the system does not completely collapse under stress.

Caching also plays a surprisingly powerful role here. Many API Gateways cache frequent responses. If downstream services become unreachable, cached responses can still be served temporarily, buying time for recovery. This is especially useful for read-heavy systems like content platforms.

Now consider rate limiting and circuit breaking at the gateway level. If downstream services are failing, the gateway can stop forwarding excessive traffic to prevent cascading failures. This isolates problems and protects the rest of the system.

So If you design it properly with multiple gateway instances, load balancing, regional failover, and fallback mechanisms, users might not even notice the failure. The system continues to operate because there is no single point of failure.

It’s a common sense to ask Why not bypass the API Gateway and let clients directly call services?

Because that removes a central control layer for authentication, monitoring, rate limiting, and routing. It increases complexity on the client side and weakens system security. Instead of removing the gateway, we make it highly available and fault tolerant.

Real-world systems like Netflix, Amazon, and Google treat API Gateways as distributed systems themselves. They are scaled horizontally, replicated across regions, monitored aggressively, and designed to fail gracefully.

The key takeaway is simple. The API Gateway is not just a router, it is a critical infrastructure component. If it goes down and you did not design for failure, your entire system goes down with it. If you design it correctly, failure becomes just another recoverable event.

Happy designing ❤️

403

388

22K

pd_singh7 retweeted

Puneet Patwari

@system_monarch

about 2 months ago

Two candidates were interviewing for a Sr. Engineer role @ Uber and were given these requirements...for their system design problem... Requirements: – The system must have low latency – It must handle sudden traffic spikes during peak events – It must enforce strict per-user rate limits Same question. Very different interviews. Hence, Very Different Results. Candidate 1: – For low latency, I will add Redis as a cache – For spikes, I will put Kafka in front of the system – For rate limiting, I will use a Redis-based token bucket On paper, it sounds fine. But this is exactly where many Senior and Staff engineers fail. They treat requirements like keywords. They hear “low latency” and respond “Redis”. They hear “spikes” and respond “Kafka”. They hear “rate limiting” and respond “token bucket”. There is no thinking in between. Candidate 2: “Before I walk through a design, I want to clarify a few things.” – When you say low latency, what are we targeting for p95 and p99? – What is the read-to-write ratio and approximate QPS at peak? – Are we rate-limiting per user, per IP, per API key, or a combination? – How strict is the rate limit? Do we drop requests or degrade gracefully? Only after that, they start. First, they sketch a simple, single region baseline. Then they check it against the numbers. Next, they layer in complexity only where it is needed. When they reach rate limiting, they do not just say “Redis token bucket”. They think out loud. – Local in process rate limiting is simple, but not safe for distributed systems. – A central Redis based limiter works, but what about race conditions and clock skew? – If we add per user, per IP, and per API key limits, how does that affect storage and cost? – What happens when Redis is down, do we fail open or fail closed? Candidate 1 threw tools at the problem. Candidate 2 modeled the problem, quantified it, then selected tools. A few weeks later, Candidate 2 got the offer. Candidate 1 kept wondering what went wrong. In 2025, with AI becoming mainstream, you cannot stand out by naming components. Any decent AI can say “use Redis, Kafka, and a token bucket”. Senior and Staff level signal is not tool recall. It is structured reasoning. If your system design answers sound like a list of technologies, you are competing with a search engine. If your system design answers sound like careful thinking under constraints, you are competing at a Senior and Staff level.

302

308

32K

pd_singh7 retweeted

clovis

@clovisdsdo

2 months ago

This is what you must do at the Cluster level to secure your Kubernetes cluster RBAC (Access Control) - Avoid cluster-admin usage (limit it strictly) - Create roles per team/service (least privilege) - Use RoleBindings instead of ClusterRoleBindings when possible - Regularly audit permissions (kubectl auth can-i) - Remove unused roles and bindings Authentication (Who can access the cluster) - Use OIDC integration (SSO) - Enforce MFA for all users - Avoid static kubeconfig sharing - Use short-lived tokens (no long-term creds) API Server Security - Restrict API access (private endpoint preferred) - Enable audit logs - Disable anonymous access - Use strong authentication/authorization modes etcd Security (Cluster data) - Enable encryption at rest - Restrict access to etcd (control plane only) - Backup etcd regularly - Secure backups (encrypted + access control) Network Policies (Pod-to-Pod traffic) - Deny all traffic by default - Allow only required communication paths - Isolate namespaces - Control east-west traffic inside the cluster Admission Control (Policy Enforcement) - Use OPA / Kyverno for policies - Enforce security standards (no privileged pods, etc.) - Validate configs before deployment - Block risky workloads at admission time Secrets Management - Use Kubernetes Secrets (not ConfigMaps) - Encrypt secrets at rest (KMS integration) - Restrict access via RBAC - Avoid mounting secrets unnecessarily Workload Isolation - Use namespaces per environment/team - Apply ResourceQuotas - Apply LimitRanges - Prevent noisy neighbor issues Logging & Monitoring - Enable audit logs - Collect control plane logs - Monitor API usage patterns - Alert on abnormal behavior Cluster Upgrades & Patching - Keep Kubernetes version up to date - Upgrade control plane and nodes regularly - Verify addon compatibility (CNI, CoreDNS, etc.) - Test upgrades before production rollout

clovisdsdo's tweet photo. This is what you must do at the Cluster level to secure your Kubernetes cluster

RBAC (Access Control)
- Avoid cluster-admin usage (limit it strictly)
- Create roles per team/service (least privilege)
- Use RoleBindings instead of ClusterRoleBindings when possible
- Regularly audit permissions (kubectl auth can-i)
- Remove unused roles and bindings

Authentication (Who can access the cluster)
- Use OIDC integration (SSO)
- Enforce MFA for all users
- Avoid static kubeconfig sharing
- Use short-lived tokens (no long-term creds)

API Server Security
- Restrict API access (private endpoint preferred)
- Enable audit logs
- Disable anonymous access
- Use strong authentication/authorization modes

etcd Security (Cluster data)
- Enable encryption at rest
- Restrict access to etcd (control plane only)
- Backup etcd regularly
- Secure backups (encrypted + access control)

Network Policies (Pod-to-Pod traffic)
- Deny all traffic by default
- Allow only required communication paths
- Isolate namespaces
- Control east-west traffic inside the cluster

Admission Control (Policy Enforcement)
- Use OPA / Kyverno for policies
- Enforce security standards (no privileged pods, etc.)
- Validate configs before deployment
- Block risky workloads at admission time

Secrets Management
- Use Kubernetes Secrets (not ConfigMaps)
- Encrypt secrets at rest (KMS integration)
- Restrict access via RBAC
- Avoid mounting secrets unnecessarily

Workload Isolation
- Use namespaces per environment/team
- Apply ResourceQuotas
- Apply LimitRanges
- Prevent noisy neighbor issues

Logging & Monitoring
- Enable audit logs
- Collect control plane logs
- Monitor API usage patterns
- Alert on abnormal behavior

Cluster Upgrades & Patching
- Keep Kubernetes version up to date
- Upgrade control plane and nodes regularly
- Verify addon compatibility (CNI, CoreDNS, etc.)
- Test upgrades before production rollout

pd_singh7 retweeted

Umesh Kumar Yadav

@Umesh__digital

about 2 months ago

10 caching concepts you must learn for backend: 0. Cache Aside (lazy loading) 1. Write Through 2. Write Behind (async) 3. Cache Invalidation 4. TTL & Explicit Eviction (LRU/LFU) 5. Cache Stampede (meltdown) 6. Distributed Cache Consistency 7. Hot Keys (skew handling) 8. Cache Warming (preload strategy) 9. Multi-layer Caching (L1/L2/CDN)

148

125

P.S Singh

@pd_singh7

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users