kubesimplify

1 day ago

Helm chart anti-pattern : Templating your model name + version into values.yaml. Why it's bad: Every model promotion becomes a git commit → PR review → deploy pipeline. That's minutes or hours when you need seconds. Model metadata should live in a model registry or be injected as an annotation at deploy time. Your serving layer pulls the artifact. Helm never needs to know. Use Helm for infrastructure. Don't use it for model registry.

kubesimplify's tweet photo. Helm chart anti-pattern :
Templating your model name + version into values.yaml.

Why it's bad:
Every model promotion becomes a git commit → PR review → deploy pipeline.
That's minutes or hours when you need seconds.

Model metadata should live in a model registry or be injected as an annotation at deploy time.
Your serving layer pulls the artifact.
Helm never needs to know.

Use Helm for infrastructure. Don't use it for model registry.

0

13

2

6

501

1 day ago

If your platform team is constantly fighting fires, you don't have a platform. You have a service desk that runs Kubernetes. A real platform offers self-service, golden paths and paved roads. It makes the right thing easy. If your team spends more time on tickets than on developer experience, that's not platform engineering. That's operations with extra steps. Firefighting is a symptom, not a strategy.

kubesimplify's tweet photo. If your platform team is constantly fighting fires, you don't have a platform.
You have a service desk that runs Kubernetes.

A real platform offers self-service, golden paths and paved roads. It makes the right thing easy.

If your team spends more time on tickets than on developer experience, that's not platform engineering.

That's operations with extra steps.
Firefighting is a symptom, not a strategy.

0

4

1

0

243

2 days ago

Confidential computing on H100 is no longer a research demo. NVIDIA Confidential Compute Mode is GA. Encrypted GPU memory. Attestation. Works on Kubernetes with the right operator. Use cases that need it: → Healthcare AI inference → Financial models with regulated data → Multi-party ML training If your customers ask "but where does the data go?" this is the answer.

kubesimplify's tweet photo. Confidential computing on H100 is no longer a research demo.
NVIDIA Confidential Compute Mode is GA.
Encrypted GPU memory.
Attestation.
Works on Kubernetes with the right operator.

Use cases that need it:
→ Healthcare AI inference
→ Financial models with regulated data
→ Multi-party ML training

If your customers ask "but where does the data go?" this is the answer.

0

3

1

2

149

Who to follow

Head of Developer Relations @loft_sh | Founder @kubesimplify| He/Him tweeting my opinions | email - [email protected]

Sri

@__karnati

I write about DevOps, infra, Cloud, AI and systems. DM for collab

Kube Careers

@kubecareers

Hand-picked Kubernetes jobs, clear salary ranges and apply directly to companies. Curated by the @Learnk8s team. Mastodon: [email protected]

kubesimplify retweeted

CNCF

@CloudNativeFdn

2 days ago

KubeCon + CloudNativeCon India is just over two weeks away. Join @SaiyamPathak and Saloni in Mumbai this June 18-19 to talk real-world cloud native, production setups, and connect with the community building the future of open source. Following Saiyam to get a special 25% discount and register here: https://t.co/QHujZZUYaF

2

21

2

6K

3 days ago

Day 9 Node Affinity vs NodeSelector vs Taints. They are NOT interchangeable. NodeSelector = exact match, zero fallback Node Affinity = hard constraints OR soft preferences Taints = node says "keep out" unless you have the key Pick wrong → outage. Pick right → peace quick explainer breaking down the exact decision tree ↓ #kubernetes #devops #cloudnative #k8s #engineering

0

7

1

297

kubesimplify retweeted

4 days ago

Got 20-30 minutes today? give this a read -> https://t.co/Y5E7FOOx6s

1

11

2

10

733

3 days ago

@SaiyamPathak lets go!

0

128

kubesimplify retweeted

3 days ago

Running AI on Kubernetes and so much more - curated paid live workshop. Who would be in? If their is enough interest - I might just do it!

3

12

2

0

940

3 days ago

Quick one: Image pull throttling on AI worker nodes is the silent killer of cluster reliability. Mirror your registry. Set ImagePullPolicy: IfNotPresent. Pin by digest. Or pay the egress bill and explain to finance later.

kubesimplify's tweet photo. Quick one:
Image pull throttling on AI worker nodes is the silent killer of cluster reliability.
Mirror your registry.
Set ImagePullPolicy: IfNotPresent.
Pin by digest.
Or pay the egress bill and explain to finance later. https://t.co/Nv7idqAKea

0

4

2

1

231

Garvit Kulshrestha @techwithgarvit

3 days ago

Why your DRA setup is slower than vanilla https://t.co/QvDtzFfAjX → Driver plugin not co-located → cold start on every claim → DeviceClass selectors too broad → scans everything → ResourceClaim lifetime too short → constant recreate churn → Webhook timeouts during structured param validation DRA isn't slow. Misconfigured DRA is slow. The fix list 👇

kubesimplify's tweet photo. Why your DRA setup is slower than vanilla https://t.co/QvDtzFfAjX

→ Driver plugin not co-located → cold start on every claim
→ DeviceClass selectors too broad → scans everything → ResourceClaim lifetime too short → constant recreate churn
→ Webhook timeouts during structured param validation
DRA isn't slow.
Misconfigured DRA is slow.
The fix list 👇

0

2

0

1

135

kubesimplify retweeted

5 days ago

AI Infrastructure Meetup by @vclusterlabs and @cloudera Informative sessions by good set of speakers. Saturday wellspent. 💯 @SaiyamPathak @hrittikhere

techwithgarvit's tweet photo. AI Infrastructure Meetup by @vclusterlabs and @cloudera

Informative sessions by good set of speakers. Saturday wellspent. 💯

@SaiyamPathak @hrittikhere https://t.co/Pjr97fuNaM

1

13

2

0

487

kubesimplify retweeted

5 days ago

Saturday morning housefull at AI Infrastructure meetup. lets go!!

8

83

4

3K

5 days ago

@SaiyamPathak @NVIDIAAI A Must read indeed.

0

2

0

103

kubesimplify retweeted

5 days ago

This is the best Local LLM using @NVIDIAAI DGX spark I am writing - 2 parts are already out and we will just be getting better with every post - this is 7 part series! You should definitely read this if you want to master Local LLM game.

SaiyamPathak's tweet photo. This is the best Local LLM using @NVIDIAAI DGX spark I am writing - 2 parts are already out and we will just be getting better with every post - this is 7 part series! You should definitely read this if you want to master Local LLM game. https://t.co/OZNXJbSQoZ

2

8

3

4

796

5 days ago

@YouTubeIndia This one !

0

189

6 days ago

vNode is quietly becoming the default for tenant isolation in 2026. Linux user namespaces + seccomp. No VM overhead. Real isolation. Use cases that demand it: → Adversarial AI agents executing tools → CI runners on shared clusters → Customer code in your platform → Multi-tenant labs with shared credentials Namespaces alone are not enough.

kubesimplify's tweet photo. vNode is quietly becoming the default for tenant isolation in 2026.

Linux user namespaces + seccomp. No VM overhead. Real isolation.

Use cases that demand it:
→ Adversarial AI agents executing tools
→ CI runners on shared clusters
→ Customer code in your platform
→ Multi-tenant labs with shared credentials

Namespaces alone are not enough.

0

14

3

4

455

6 days ago

Bengaluru, see you tomorrow! 🚀 Saiyam will be at the AI Infrastructure Meetup hosted by vCluster, connecting with the community and talking all things AI infrastructure, Kubernetes and platform engineering. Come say hi if you’re attending #AIInfrastructure #Kubernetes #vCluster

1

8

0

1K

6 days ago

Hot take: Most "AI inference platforms" should just be KServe + a thin internal API. Stop buying $200K/yr SaaS. KServe handles 90% of what your ML team needs. The other 10% is a 50-line wrapper. The hard part isn't serving. It's everything around it.

kubesimplify's tweet photo. Hot take:
Most "AI inference platforms" should just be KServe + a thin internal API.

Stop buying $200K/yr SaaS.

KServe handles 90% of what your ML team needs. The other 10% is a 50-line wrapper.

The hard part isn't serving.
It's everything around it. https://t.co/wsYHI1W0JO

0

14

1

7

411

7 days ago

If you're checkpointing distributed training to a ReadWriteMany PVC and wondering why epoch saves take 12 minutes: It’s the PVC. RWX network filesystems become a bottleneck fast under large distributed checkpoint writes. Use: • async uploads to object storage + local scratch disks • or a data orchestration layer like Fluid Don’t force network filesystems to behave like high-throughput local storage for massive AI workloads.

kubesimplify's tweet photo. If you're checkpointing distributed training to a ReadWriteMany PVC and wondering why epoch saves take 12 minutes:

It’s the PVC.

RWX network filesystems become a bottleneck fast under large distributed checkpoint writes.

Use:
• async uploads to object storage + local scratch disks
• or a data orchestration layer like Fluid
Don’t force network filesystems to behave like high-throughput local storage for massive AI workloads.

0

6

1

2

262