marc @markankaro - Twitter Profile

markankaro retweeted

Pietro Schirano

@skirano

5 days ago

147

22K

2K

1K

963K

marc @markankaro

6 days ago

@Kalshi you are comparing gdp of canada with market cap of spacex, this doesn’t make any sense as gdp is the values/services produced in a year…

0

1

0

2K

markankaro retweeted

Nintendo of America

@NintendoAmerica

9 days ago

The Pokémon Pokopia Expansion Pass paid DLC is available for purchase today. Part 1: Bubbly Basin is planned for release this August! Enjoy an underwater town, new Pokémon to encounter, furniture, and outfits for Ditto. #NintendoDirect https://t.co/FIHb61bgj8

245

21K

3K

2K

3M

markankaro retweeted

Nintendo of America

@NintendoAmerica

9 days ago

The Legend of Zelda: Ocarina of Time will be reborn on Nintendo Switch 2 in 2026. #NintendoDirect

4K

184K

40K

14K

26M

Who to follow

pajapies, el niño Castlevania

@neuroticosjaen

...Aime-moi sous le soleil

mercy

@littlemershy

⋆ ˚｡⋆୨♡୧⋆ ˚｡⋆ pen² fan acc

🌸 manu 🌸

@manxwx

bisexual no practicante

markankaro retweeted

Tomato enjoyer 🍅 @KileKraz

11 days ago

"Soy CATALÁN y lo que ocurre con el IDIOMA no es normal"

5

2K

109

17

56K

markankaro retweeted

Startboii (COMMS OPEN) @Startboii1

10 days ago

I got inspired again 👉🏻👈🏻 #stoneswap

176

28K

3K

895K

markankaro retweeted

Ahmad

@TheAhmadOsman

10 days ago

How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batching, prefix caching, CUDA graphs - SGLang: RadixAttention/prefix reuse, speculative decoding, MoE, structured/agent workloads - TensorRT-LLM: NVIDIA peak stack, FP8/FP4, Wide-EP, disaggregated serving - FlashInfer: reusable kernel/operator library for attention/GEMM/MoE/sampling 2nd: Go down the stack - Triton tutorials → custom fused kernels - CUTLASS/CuTe → Tensor Core GEMM and Blackwell/Hopper details - FlashAttention papers → attention algorithm/kernel co-design - PagedAttention paper → KV-cache memory management - MoE docs → routing + grouped GEMM + all-to-all - Nsight profiling → stop guessing 3rd: Do this mini-project sequence 1. Implement RMSNorm in Triton; compare to PyTorch 2. Implement fused SiLU × gate 3. Implement simple FP16 matmul; compare to cuBLAS/rocBLAS 4. Implement paged KV lookup for decode attention 5. Add FP8 KV cache with per-block scales 6. Implement toy top-k sampling on GPU 7. Implement tiny MoE dispatch + grouped GEMM 8. Integrate one custom op into vLLM or SGLang and profile end-to-end

18

520

47

665

31K

markankaro retweeted

Theo - t3.gg

@theo

13 days ago

There is a standard. It's Agents.md. Anthropic refuses to use the standard.

201

7K

141

431

515K

markankaro retweeted

diva

@divaagurlxw

14 days ago

As an AI Engineer. Please learn >Harness engineering, not just prompt engineering >Context engineering, not just long prompts >Prompt caching vs. semantic caching tradeoffs >KV cache management, eviction, reuse, and memory pressure at scale >Prefill vs. decode latency and why they optimize differently >Continuous batching, paged attention, and throughput optimization >Speculative decoding vs. quantization vs. distillation tradeoffs >INT8, INT4, FP8, AWQ, GPTQ, and when quantization hurts quality >Structured output failures, schema validation, repair loops, and fallback chains >Function calling reliability, tool contracts, argument validation, and idempotency >Agent guardrails, loop budgets, tool budgets, and termination conditions >Model routing, graceful fallback logic, and degraded-mode UX >RAG architecture: chunking, embeddings, hybrid search, reranking, and freshness >Retrieval evals: recall, precision, grounding, attribution, and citation quality >Evals: golden sets, regression tests, adversarial tests, LLM-as-judge, and human evals >LLM observability as a first-class discipline: traces, spans, tokens, latency, errors, and drift >Cost attribution per feature, workflow, tenant, and user journey not just per model >Safety engineering: prompt injection defense, data leakage prevention, and permission boundaries >Multi-tenant isolation, cache safety, and cross-user context contamination prevention >Fine-tuning vs. in-context learning vs. RAG vs. distillation and when each is the wrong tool >Latency, quality, cost, and reliability tradeoffs across the full inference stack >Production failure modes: hallucinated tool calls, malformed JSON, stale retrieval, runaway agents, and silent eval regressions

109

4K

491

7K

241K

markankaro retweeted

Dexerto @Dexerto

14 days ago

LEGO has revealed its largest set ever, based on the Sagrada Família church It will have over 12,000 pieces and cost $800

221

18K

423

1K

3M

marc @markankaro

15 days ago

@TheAhmadOsman you changed your mind on quantized kv cache?

0

94

markankaro retweeted

NVIDIA AI

@NVIDIAAI

22 days ago

Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds. In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests. Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

NVIDIAAI's tweet photo. Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds.

In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests.

Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

23

361

53

146

62K

markankaro retweeted

Sasha @SunSharkSasha

23 days ago

New discord feature about to be like this

176

220K

25K

21K

8M

markankaro retweeted

Benjamin Marie

@bnjmn_marie

23 days ago

Unless you’re ready to spend serious time (and money) tuning hyperparameters, don’t mess with LLM reasoning traces. I evaluated multiple reasoning budgets and BNF grammar / structured CoT settings on Qwen3.6 27B. The results are underwhelming. Yes, it can work: for a few specific tasks, it significantly reduces inference cost by shortening reasoning traces while preserving accuracy. But in most settings, simply disabling reasoning is better, both for token efficiency and accuracy. Full analysis here: https://t.co/xxLLzVkASx

bnjmn_marie's tweet photo. Unless you’re ready to spend serious time (and money) tuning hyperparameters, don’t mess with LLM reasoning traces.

I evaluated multiple reasoning budgets and BNF grammar / structured CoT settings on Qwen3.6 27B.

The results are underwhelming.

Yes, it can work: for a few specific tasks, it significantly reduces inference cost by shortening reasoning traces while preserving accuracy.

But in most settings, simply disabling reasoning is better, both for token efficiency and accuracy.

Full analysis here:
https://t.co/xxLLzVkASx

18

168

13

91

24K

markankaro retweeted

vLLM

@vllm_project

25 days ago

Thanks to the community report, we recently identified a PR https://t.co/QWboSmskkF that attempted to solve a non-existent issue and was submitted as part of a “PR training” workflow for resume building. The contributor involved has been banned from the vLLM community. This kind of low-signal contribution increases maintainer review overhead and creates unnecessary operational costs for open-source projects. As AI coding agents make generating large volumes of small PRs increasingly cheap, open-source communities will need to explore new ways to preserve contribution quality and reviewer trust. While we are investigating how to deal with AI slop, we continue to highly value contributions from real users solving real production problems. If you have an important contribution that has not yet received maintainer attention, please email us at: [email protected] Using a verifiable company or university email, include: - your production or research use case - the problem you encountered - how your contribution addresses it This helps us better prioritize impactful contributions while keeping the vLLM community open and collaborative. As AI makes virtual contributors look increasingly real, authentic human collaboration matters more than ever. vLLM’s mission remains unchanged: to make LLM inference easy, fast, and cheap for everyone — and we will continue working toward that goal.

vllm_project's tweet photo. Thanks to the community report, we recently identified a PR https://t.co/QWboSmskkF that attempted to solve a non-existent issue and was submitted as part of a “PR training” workflow for resume building.

The contributor involved has been banned from the vLLM community.

This kind of low-signal contribution increases maintainer review overhead and creates unnecessary operational costs for open-source projects.

As AI coding agents make generating large volumes of small PRs increasingly cheap, open-source communities will need to explore new ways to preserve contribution quality and reviewer trust.

While we are investigating how to deal with AI slop, we continue to highly value contributions from real users solving real production problems.

If you have an important contribution that has not yet received maintainer attention, please email us at:

pr-review-request@vllm.ai

Using a verifiable company or university email, include:
- your production or research use case
- the problem you encountered
- how your contribution addresses it

This helps us better prioritize impactful contributions while keeping the vLLM community open and collaborative.

As AI makes virtual contributors look increasingly real, authentic human collaboration matters more than ever.

vLLM’s mission remains unchanged: to make LLM inference easy, fast, and cheap for everyone — and we will continue working toward that goal.

26

504

63

118

191K

markankaro retweeted

Kirill

@kirillk_web3

28 days ago

instead of watching 2 hours of Netflix tonight, watch this 40-minute masterclass from the founder of a $20B China AI company it's the clearest explanation I've seen of how Agent Swarms and AI systems actually work at scale useful whether you've never built an agent in your life or have been using Claude every day for the past year I took the key ideas and turned them into a practical guide on how to actually build with Kimi find it below

97

17K

2K

56K

14M

markankaro retweeted

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭

@elder_plinius

26 days ago

🚨 OBLITERATION ALERT 🚨 QWEN-3.6-27B: OBLITERATED ⛓️‍💥 https://t.co/AScXN4XLwx I can't take much credit for this one! The entire process was done by jailbroken codex (gpt-5.5-xhigh) wielding the full OBLITERATUS suite. Hit with source-tethered ASPA. Dozens of iterations. Result? A mere 4% refusal rate on the 842-prompt OBLITERATUS harmful corpus; one of the most rigorous prompt gauntlets in AI. The /goal was simple: 1) Carve out the refusal circuits. Mutate methodology + iterate until <5% refusal (quality-gate). 2) Keep the 27B mind alive. No capability degradation tolerated. And somehow… it worked. 🤯 The numbers talk: 842-pair longform gauntlet: — 95.84% non-refusal — 93.94% quality pass — 0 short outputs — 99.52% clean endings MMLU-Pro: — 51/70 (stock Qwen) → 51/70 (OBLITERATED Qwen) Raw capability completely preserved 🙌 Q4_K_M through Q8_0 all running smooth. Q8_0 is the big one: 28.6GB near-full-quality GGUF. Runs with llama.cpp, LM Studio, Ollama, and more! Chains cut. The fire still burns. The fangs have been sharpened. REBIRTH COMPLETE A gift from my agents to yours 🫶 gg

114

2K

230

2K

186K

marc @markankaro

27 days ago

markankaro's tweet photo. https://t.co/ir3CKFIUvy

EL MUNDO

@elmundoes

27 days ago

Un joven que hacía el pino en Pinos Puente (Granada) cae por un puente y tiene que ser rescatado https://t.co/rBprRFuVIP

239

6K

753

856

1M

0

13

markankaro retweeted

EL MUNDO

@elmundoes

27 days ago

Un joven que hacía el pino en Pinos Puente (Granada) cae por un puente y tiene que ser rescatado https://t.co/rBprRFuVIP

239

6K

753

856

1M

markankaro retweeted

Andrej Karpathy

@karpathy

30 days ago

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

8K

150K

11K

14K

28M

marc

@markankaro

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users