mahesh thipparthi

@mthipparthi

Loves Data, Python

Sydney,Australia

Joined May 2009

3.1K Following

259 Followers

5.1K Posts

mthipparthi retweeted

sunil pai

@threepointone

8 days ago

Good to see this post from our team PM @thomasgauvin. I've been _extremely_ deep in agent reliability for a bit now. Seen things you people wouldn't believe. (Attack ships on fire off the shoulder of Orion...) Flue, Think, and I bet many more on the future, can find a reliable _machine_ in using the agents sdk to build ambitious things. Today is day 1 in that journey. More to come from the platform(s), runtime, libraries, ecosystem. Back to work.

156

37K

mthipparthi retweeted

jack

@jack

8 days ago

we're going to start talking a lot more about our intelligence tools. this is the beginning of the beginning.

223

307

mthipparthi retweeted

Vivek Galatage

@vivekgalatage

15 days ago

One of the finest https://t.co/QmUpWFPcKw

193

56K

mthipparthi retweeted

0xkato

@0xkato

22 days ago

LLMs explained without all that yucky math stuff https://t.co/cpgVukSbJx

183

547K

Who to follow

fly51fly

@fly51fly

BUPT prof | Sharing latest AI papers & insights | Join me in embracing the AI revolution! #MachineLearning #AI #Innovation

Alex

@alexlib

https://t.co/KtleQZBPGn

Michael Herman

@MikeHerman

Engineer @MonitaurAI. Founder and author @TestDrivenio.

mthipparthi retweeted

Noam Dworman

@noam_dworman

21 days ago

Second for second, @tylercowen packs more substance into a talk than anyone I'm aware of. This is a clear, non-hysterical, and somewhat soothing discussion of our AI future.

269

646K

mthipparthi retweeted

Arpit Bhayani

@arpit_bhayani

24 days ago

Yesterday, I was building a small Agentic PR reviewer for myself, and I realised how critical topological sort was in its entire application. Essentially, every AI workflow is fundamentally a dependency problem. We do not just chain the prompts together, but also have situations where we may need to do things in an order that respects the dependencies. This is where the classic DAG and topological sort come in handy. I just wrote an essay on why DAGs and topological sorting are the core primitives required to design, debug, and scale AI workflows. In this article, I break down how the dependency problem breaks linear pipelines, how to unlock "free" parallelism and cycle detection at definition time, and how this mental model scales seamlessly to multi-agent orchestration. If you are building AI workflows (which I am sure you are), RAG pipelines, or multi-agent systems, this essay will give you a solid first-principles framework. Give it a read.

arpit_bhayani's tweet photo. Yesterday, I was building a small Agentic PR reviewer for myself, and I realised how critical topological sort was in its entire application.

Essentially, every AI workflow is fundamentally a dependency problem. We do not just chain the prompts together, but also have situations where we may need to do things in an order that respects the dependencies.

This is where the classic DAG and topological sort come in handy.

I just wrote an essay on why DAGs and topological sorting are the core primitives required to design, debug, and scale AI workflows.

In this article, I break down how the dependency problem breaks linear pipelines, how to unlock "free" parallelism and cycle detection at definition time, and how this mental model scales seamlessly to multi-agent orchestration.

If you are building AI workflows (which I am sure you are), RAG pipelines, or multi-agent systems, this essay will give you a solid first-principles framework.

Give it a read.

684

518

30K

mthipparthi retweeted

Ben Dicken

@BenjDicken

2 months ago

Database table size impacts performance in more ways than one: a) B-tree depth. Using 8k pages and a 16b uuid: 1 level = ~370 rows 2 levels = ~138k rows 3 levels = ~50m rows 4 levels = ~20b rows The lookup cost on a table with 100k rows is not the same as one with 1b rows. This can apply both to the table itself (MySQL cluster index) as well as the indexes. Sometimes a single query requires many of them. b) Small table → fits in RAM → fast reads. The larger the table, the more likely to read from disk plus churn the cache. c) # of indexes. Each adds maintenance overhead for insertions, and for Postgres vacuum overhead as well. Keep an eye on this! It's useful to take regular stock of your tables + indexes. Clean bloat. Remove unused indexes. Partition if needed.

122

212K

mthipparthi retweeted

Armin Ronacher ⇌

@mitsuhiko

10 months ago

In the past I usually wrote tests against postgres stuff by creating a transaction and rolling back. This is getting harder and harder, particularly if you have more than one service. I wonder what people do nowadays for tests to run efficiently and concurrently.

131

22K

mthipparthi retweeted

Stanislav Kozlovski

@kozlovski

over 1 year ago

You have 800GB worth of unique IDs. How do you solve the cardinality problem using just 120MB? Simple. Read the story of how Reddit did it: 👇 In 2017, Reddit wanted to better communicate the scale of its communities to its users. The easiest way to do that? Show a view counter. But with scale comes challenges. 😓 Naively storing a set of unique IDs as longs (8 byte each) can quickly rack up memory and disk usage - a single 10 million view post is 80MB in that implementation. Considering you need to: • read • modify • persist this collection every time a user views the post for their first time, you can imagine how it can become expensive. 💰 Now apply this to thousands of such posts (old ones, etc.) • 10k posts with 10 million views equal 800GB. Almost a terabyte which needs to be accessed concurrently in the system as views come and go on all posts... It suddenly becomes a very hard problem to solve. Thankfully, Reddit realized they can use a very specific set of algorithms that are a perfect fit for this sort of Big Data problem: ✨ Sketch Algorithms ✨ Sketch algorithms are a set of algorithms that trade off accuracy for disproportionately massive efficiency gains. In other words, their result isn't 100% correct. But that's fine because their main benefits are: • small & consistent size 📏 • sub-linear space growth - input data grows linearly while space requirement does not. 🐣 • mathematically-proven error bound 🙅‍♂️ • and more... Because of that, sketch algos see wide use across the industry. The sketch algorithm Reddit used is called HyperLogLog. They leveraged its implementation in Redis, which is designed such that: • it supports up to 2^64 elements - 18 quintillion 🤯 • it uses up to 12KB of memory. That’s just 0.015% usage of the original naive implementation, and said percentage only becomes smaller the more input elements we add. In other words, there is no space growth - you can store 18 quintillion objects in 12KB of memory. • its maximum error rate is 0.81% This means that if you have a set of 1 million IDs, the algorithm will return anything between 991,900 and 1,008,900. A very acceptable error for such massive memory savings. One could call it negligible. Now... of course they use Kafka 😎 It’s actually the key part of their data pipeline. (are you surprised?) The end to end flow of their event counting looks roughly like this: 1. a user views a post → an event gets fired into an event collector server. 2. this server batches the events and produces them to Kafka. 3. Nazar, a Kafka consumer app, reads each event and decides whether it should be counted or not (based on rules in Redis). 4. Nazar produces the event back into Kafka with a boolean denoting the decision on whether to count it. 5. Abacus, another Kafka consumer app, reads the processed events and attempts to count each valid event. 6. To execute the counting, Abacus uses the HyperLogLog data structure in Redis and periodically persists it to Cassandra every 10 seconds. This helps restore it in case it’s evicted from Redis’ memory. That's it. A very simple pipeline, using a well-abstracted "simple" algorithm which solves a very hard big data problem.

kozlovski's tweet photo. You have 800GB worth of unique IDs.

How do you solve the cardinality problem using just 120MB?

Simple. Read the story of how Reddit did it: 👇

In 2017, Reddit wanted to better communicate the scale of its communities to its users.

The easiest way to do that? Show a view counter.

But with scale comes challenges. 😓

Naively storing a set of unique IDs as longs (8 byte each) can quickly rack up memory and disk usage - a single 10 million view post is 80MB in that implementation.

Considering you need to:

• read
• modify
• persist

this collection every time a user views the post for their first time, you can imagine how it can become expensive. 💰

Now apply this to thousands of such posts (old ones, etc.)

• 10k posts with 10 million views equal 800GB.

Almost a terabyte which needs to be accessed concurrently in the system as views come and go on all posts... It suddenly becomes a very hard problem to solve.

Thankfully, Reddit realized they can use a very specific set of algorithms that are a perfect fit for this sort of Big Data problem:

✨ Sketch Algorithms ✨

Sketch algorithms are a set of algorithms that trade off accuracy for disproportionately massive efficiency gains.

In other words, their result isn't 100% correct. But that's fine because their main benefits are:

• small & consistent size 📏
• sub-linear space growth - input data grows linearly while space requirement does not. 🐣
• mathematically-proven error bound 🙅‍♂️
• and more...

Because of that, sketch algos see wide use across the industry.

The sketch algorithm Reddit used is called HyperLogLog. They leveraged its implementation in Redis, which is designed such that:

• it supports up to 2^64 elements - 18 quintillion 🤯
• it uses up to 12KB of memory.

That’s just 0.015% usage of the original naive implementation, and said percentage only becomes smaller the more input elements we add.

In other words, there is no space growth - you can store 18 quintillion objects in 12KB of memory.

• its maximum error rate is 0.81%

This means that if you have a set of 1 million IDs, the algorithm will return anything between 991,900 and 1,008,900.

A very acceptable error for such massive memory savings. One could call it negligible.

Now... of course they use Kafka 😎

It’s actually the key part of their data pipeline. (are you surprised?)

The end to end flow of their event counting looks roughly like this:

1. a user views a post → an event gets fired into an event collector server.
2. this server batches the events and produces them to Kafka.
3. Nazar, a Kafka consumer app, reads each event and decides whether it should be counted or not (based on rules in Redis).
4. Nazar produces the event back into Kafka with a boolean denoting the decision on whether to count it.
5. Abacus, another Kafka consumer app, reads the processed events and attempts to count each valid event.
6. To execute the counting, Abacus uses the HyperLogLog data structure in Redis and periodically persists it to Cassandra every 10 seconds. This helps restore it in case it’s evicted from Redis’ memory.

That's it.

A very simple pipeline, using a well-abstracted "simple" algorithm which solves a very hard big data problem.

649

104

699

38K

mthipparthi retweeted

Hussein Nasser

@hnasr

over 1 year ago

Cloudflare built a global cache purge system that runs under 150 milliseconds. This is how they did it. Using RockDB to maintain local CDN cache, and a peer-to-peer data center distributed system and clever engineering, they went from 1.5 second purge, down to 150 ms. However, this isn’t the full picture, because that 150 ms is actually the P50. In this video I explore Clouldflare CDN work, how the old core-based centralized quicksilver, lazy purge work compared to the new coreless, decentralized active purge. I explore the pros and cons of both systems and give you my thoughts of this system. One of my favorites videos Video: https://t.co/z7HlOFPX27 Audio: https://t.co/WzZDNPuwM9

hnasr's tweet photo. Cloudflare built a global cache purge system that runs under 150 milliseconds. This is how they did it.

Using RockDB to maintain local CDN cache, and a peer-to-peer data center distributed system and clever engineering, they went from 1.5 second purge, down to 150 ms.

However, this isn’t the full picture, because that 150 ms is actually the P50.

In this video I explore Clouldflare CDN work, how the old core-based centralized quicksilver, lazy purge work compared to the new coreless, decentralized active purge.

I explore the pros and cons of both systems and give you my thoughts of this system.

One of my favorites videos

Video: https://t.co/z7HlOFPX27
Audio: https://t.co/WzZDNPuwM9

310

184

17K

mthipparthi retweeted

Vivek Galatage

@vivekgalatage

over 1 year ago

I just read an exciting blog from Zerodha about sending signed PDF reports for daily trading transactions! The scale of operations and the quick turnaround time with their new architecture are fascinating! https://t.co/lIpdNSzBAy

117

mthipparthi retweeted

Simon Willison

@simonw

over 1 year ago

Wrote up some notes on Cloudflare's fascinating new SQLite-backed "Durable Objects" system, which encourages an architectural style where your application creates thousands of tiny read-write SQLite databases scattered across Cloudflare's network https://t.co/ApbEQAuryo

615

424

64K

mthipparthi retweeted

Andrej Karpathy

@karpathy

over 1 year ago

NotebookLM is quite powerful and worth playing with https://t.co/EMHIjc15iU It is a bit of a re-imagination of the UIUX of working with LLMs organized around a collection of sources you upload and then refer to with queries, seeing results alongside and with citations. But the current most new/impressive feature (that is surprisingly hidden almost as an afterthought) is the ability to generate a 2-person podcast episode based on any content you upload. For example someone took my "bitcoin from scratch" post from a long time ago: https://t.co/7ajZNZ0BGi and converted it to podcast, quite impressive: https://t.co/ZZn0LJgsnu You can podcastify *anything*. I give it train_gpt2.c (C code that trains GPT-2): https://t.co/gDrAqix4Iv and made a podcast about that: https://t.co/bgcwmQr5d7 I don't know if I'd exactly agree with the framing of the conversation and the emphasis or the descriptions of layernorm and matmul etc but there's hints of greatness here and in any case it's highly entertaining. Imo LLM capability (IQ, but also memory (context length), multimodal, etc.) is getting way ahead of the UIUX of packaging it into products. Think Code Interpreter, Claude Artifacts, Cursor/Replit, NotebookLM, etc. I expect (and look forward to) a lot more and different paradigms of interaction than just chat. That's what I think is ultimately so compelling about the 2-person podcast format as a UIUX exploration. It lifts two major "barriers to enjoyment" of LLMs. 1 Chat is hard. You don't know what to say or ask. In the 2-person podcast format, the question asking is also delegated to an AI so you get a lot more chill experience instead of being a synchronous constraint in the generating process. 2 Reading is hard and it's much easier to just lean back and listen.

244

829K

mthipparthi retweeted

Andrej Karpathy

@karpathy

almost 2 years ago

Haha we've all been there. I stumbled by this tweet earlier today and tried to write a little utility that auto-generates git commit message based on the git diff of staged changes. Gist: https://t.co/1SbQsHSNwK So just typing `gcm` (short for git commit -m) auto-generates a one-line commit message, lets you to accept, edit, regenerate or cancel. Might be fun to experiment with. Uses the excellent `llm` CLI util from @simonw https://t.co/LnHeCSfiHc

189

332

596K

mthipparthi retweeted

Simon Willison

@simonw

almost 2 years ago

Anyone who professionally builds for the Web should read the Reckoning series by @slightlylate Start with this case-study of how the California food stamps signup site takes 29.5s to become interactive on a rural internet connection: https://t.co/dOjEZHUngP

256

194

27K

mthipparthi retweeted

Fernando

@Franc0Fernand0

almost 2 years ago

In which order does an SQL query run? Understanding the order in which SQL queries run is critical to optimizing them. Typically, SQL queries are processed in a logical order that differs from the one in which the SQL statements are written. Here is the logical order in which SQL queries are processed: 1. FROM + JOIN The first step is to process the data sources (tables, views, etc.) specified in the FROM clause. The data is read from the tables and combined from multiple sources if joins exist. 2. WHERE The WHERE clause filters the rows based on the given conditions. Rows that do not meet the conditions are not considered for further processing. 3. GROUP BY If a GROUP BY clause is present, rows are grouped based on the specified columns. Aggregate functions (like SUM, COUNT, or AVG) are applied to each group. 4. HAVING If a HAVING clause is present, the groups are filtered according to aggregate conditions. Groups that meet the conditions are included. 5. SELECT The SELECT clause is then applied to the result set. Columns are selected to compose the result data. 6. ORDER BY If an ORDER BY clause is present, the result data is sorted based on the specified columns. 7. LIMIT/OFFSET If there is a LIMIT or OFFSET clause (used in some database systems), the final result set is limited to row count or offset accordingly. The main takeaways from this execution order are: - Use the WHERE clause effectively to reduce the size of the data set early in the query process - Since the HAVING clause is executed after the WHERE and GROUP BY clauses, move any filter conditions that don't depend on aggregation from HAVING to WHERE - LIMIT, and OFFSET clauses are applied late in the query process and mainly affect the final result set, not the performance of the query execution

Franc0Fernand0's tweet photo. In which order does an SQL query run?

Understanding the order in which SQL queries run is critical to optimizing them.

Typically, SQL queries are processed in a logical order that differs from the one in which the SQL statements are written.

Here is the logical order in which SQL queries are processed:

1. FROM + JOIN

The first step is to process the data sources (tables, views, etc.) specified in the FROM clause. The data is read from the tables and combined from multiple sources if joins exist.

2. WHERE

The WHERE clause filters the rows based on the given conditions. Rows that do not meet the conditions are not considered for further processing.

3. GROUP BY

If a GROUP BY clause is present, rows are grouped based on the specified columns. Aggregate functions (like SUM, COUNT, or AVG) are applied to each group.

4. HAVING

If a HAVING clause is present, the groups are filtered according to aggregate conditions. Groups that meet the conditions are included.

5. SELECT

The SELECT clause is then applied to the result set. Columns are selected to compose the result data.

6. ORDER BY

If an ORDER BY clause is present, the result data is sorted based on the specified columns.

7. LIMIT/OFFSET

If there is a LIMIT or OFFSET clause (used in some database systems), the final result set is limited to row count or offset accordingly.

The main takeaways from this execution order are:

- Use the WHERE clause effectively to reduce the size of the data set early in the query process

- Since the HAVING clause is executed after the WHERE and GROUP BY clauses, move any filter conditions that don't depend on aggregation from HAVING to WHERE

- LIMIT, and OFFSET clauses are applied late in the query process and mainly affect the final result set, not the performance of the query execution

180

160

15K

mthipparthi retweeted

Michael Muthukrishna

@mmuthukrishna

over 2 years ago

I'm creating a list of 5 of the "best books that will change how to see the world" for @Shepherd_books. I went through the (extensive!) Further Readings list in A Theory of Everyone and have narrowed it down to 15 books. Please help me shortlist with ♥️, 🔁 & 💬! 🧵

mmuthukrishna's tweet photo. I'm creating a list of 5 of the "best books that will change how to see the world" for @Shepherd_books.

I went through the (extensive!) Further Readings list in A Theory of Everyone and have narrowed it down to 15 books.

Please help me shortlist with ♥️, 🔁 & 💬!
🧵 https://t.co/B0W5rQG0zH

627

112

167K

mthipparthi retweeted

Andrej Karpathy

@karpathy

over 2 years ago

New YouTube video: 1hr general-audience introduction to Large Language Models https://t.co/Bl4WNuNyFJ Based on a 30min talk I gave recently; It tries to be non-technical intro, covers mental models for LLM inference, training, finetuning, the emerging LLM OS and LLM Security.

karpathy's tweet photo. New YouTube video: 1hr general-audience introduction to Large Language Models
https://t.co/Bl4WNuNyFJ

Based on a 30min talk I gave recently; It tries to be non-technical intro, covers mental models for LLM inference, training, finetuning, the emerging LLM OS and LLM Security. https://t.co/JHOa2mqjdh

534

17K

11K

mthipparthi retweeted

freeCodeCamp.org

@freeCodeCamp

over 2 years ago

RabbitMQ is an open source message broker tool often used in distributed and pub-sub systems. And you can configure a single instance to use different virtual hosts for each app. In this tutorial, Ridwan walks you through exactly how it all works. https://t.co/vcx1xrLRye

437

197

70K

mthipparthi retweeted

Debasish (দেবাশিস্) Ghosh 🇮🇳

@debasishg

over 2 years ago

Just found this monograph on B-trees that has a fairly holistic perspective on B-trees including the data structures and algorithms part and use of B-tree indexes in databases, transactional techniques and query processing techniques. Modern B-Tree Techniques - Goetz Graefe https://t.co/PaHAf88dqS

19K

mahesh thipparthi

@mthipparthi

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users