Jorge Bay

@jorgebg

Señor Software Engineer @mezmodata. Author of @PolarStreamsIO. Prev: @DataStax. @apachetinkerpop PMC member. Father of 2 kids w/ @verorbe

Joined February 2009

242 Following

341 Followers

4K Posts

jorgebg retweeted

tobi lutke

@tobi

about 2 years ago

Sunday rant. For software engineering, my sense is that the phrase “premature optimization is the root of all evil” has massively backfired. Its from a book on data structures and mainly tried to dissuade people from prematurely write things in assembler. But the point was to free you up to think harder about the data structures to use, not leave things comically inefficient. This context is always skipped when it’s uttered. Not all fast software is world-class, but all world-class software is fast. Performance is _the_ killer feature. If you are in engineering, here is a fantastic anecdote. I refer to this account often. It’s a bit subtile, but the implications are massive- It’s an account of how SQLite became 50% faster, not by doing one specific thing but hundreds of small ones. SQLite is everywhere today because of this work. https://t.co/krLFFps2up We need the engineers in all companies fight for this more. Product leads are not the right owners of the end performance of the software. This needs to be encoded in the professional pride of the software engineering discipline. Leaders in companies need to encourage it and hold engineering accountable. It’s simply not ok to fritter away the performance of the products for random reasons. Every user of your products cares exactly as much about latency as engineers do when typing in their terminal. They just don’t have the words to describe what they don’t like about the experience and neither should they.

208

817

Jorge Bay @jorgebg

about 2 years ago

A really interesting analysis on the state of the book-media industry: https://t.co/uqRM6c6oeA

135

Jorge Bay @jorgebg

about 2 years ago

https://t.co/jT5bI7wWnM

137

jorgebg retweeted

Daniel Lemire

@lemire

about 2 years ago

Fast and concise probabilistic filters in Python Sometimes you need to filter out or filter in data quickly. Suppose that your employer maintains a list of forbidden passwords or URLs or words. You may store them in a relational database and query them as needed. Unfortunately, this process can be slow and inefficient. A better approach might be to use a probabilistic filter. A probabilistic filter is a sort of 'approximate set'. You can ask it whether a key is present in the set, and if it is present, then you will always get 'true' (the correct answer). However, when the key is not present, you may still get 'true', although with a low probability. So the probabilistic filter is sometimes wrong. Why would you accept a data structure that is sometimes wrong? Because it can be several times smaller and faster than querying directly the actual set. The best known probabilistic filter is the Bloom filter, but there are many others. For example, we recently presented the binary fuse filters which are smaller and faster than Bloom filters, for large and immutable sets. The pyxorfilter module in Python is an implementation of the binary fuse filters. It provides support for several filter types but both Fuse8 and Fuse16 are interesting if you have fairly large sets. They provide respectively a 0.39% and 0.0015% false probability rate. So a Fuse16 filter is almost nearly correct. Why would you prefer Fuse8? Because it uses half the memory. We can construct a probabilistic filter in Python like so with the pyxorfilter filter: from pyxorfilter import Fuse8 data = [uuid.uuid4() for i in range(2000000)] filter = Xor16(len(data)) filter.populate(data) Once it is done, you can be certain that all the content of your initial set is in the filter: for d in data: assert filter.contains(d) You can save the filter on disk and check how much memory is used... f = open(filename, 'wb') f.write(filter.serialize()) f.close() print(os.path.getsize(filename)*8.0/len(data)) If your set is large enough (say 1000,000 elements), you will find that the memory usage is about 9 bits per entry. It grows a bit larger per entry as the set gets smaller. For smaller sets, the pyxorfilter module offers an alternative (Xor8) that can be slightly more efficient in these cases. How do you know if you can trust the filter? Just query random inputs (highly likely not to be present) and see how many falsely appear to be in the set: # estimate false positive rate N = 1000000 count = 0 for i in range(N): count += filter.contains(uuid.uuid4()) fpp = count/N*100.0 As I already implied, if you replace Fuse8 by Fuse16, then the memory usage per element goes up to about 18 bits, but the false positive rate is far lower: 0.00200%. I produced a small benchmark. On my laptop, I get that you get over 1 million queries per second (each time checking the presence of a string). On an Intel-based server, I get a lower number, so about half a million per second. For binary fuse filters, it does not matter whether the element is in the set or not as far as performance goes, so I use random inputs. When using a Bloom filter (say), you would typically get worse performance when the elements are in the set. The pyxorfilter module was created Amey Narkhede. It still early days. I expect you can install pyxorfilter the usual way (pip install pyxorfilter) under x64 Linux. Unfortunately, under macOS and Windows, there are issues getting the module installed (see issue 19 and issue 10). However, if you are a Python hacker, you can build it from source relatively easily: git clone --recurse-submodules https://t.co/3gdO2oCyj8 cd pyxorfilter python https://t.co/wL32imr9wG build_ext python https://t.co/wL32imr9wG install I am sure Amey would appreciate it if experienced Python hackers could help resolve the few remaining issues. There are functionalities that pyxorfilter misses. Currently, you can save the filter to disk and recover it later. Sadly, to use it, you need to load the whole filter in memory. That is not needed. It might be more suitable to use memory file mappingor even other lower-level input-output operations. What if you do not care about Python? You can use the xor and binary fuse filters in Go, C or C++, Zig, Rust, etc. I just love working in Python when I can.

Who to follow

David Jones-Gilardi @Langflow

@SonicDMG

A Gen-AI/ Agentic nerd with decades of coding experience who loves learning and helping others do the same! @Langflow

Jake Luciani

@tjake

Dad / Dreamer / Coder

Ekaterina Dimitrova

@EkaterinaDimit9

🇧🇬Software Engineer @DataStax, everything data enthusiast; curious life-long learner who does winter sports, painting, reading,and traveling in between coding

Jorge Bay @jorgebg

over 2 years ago

https://t.co/ZR8nxVRzHD

140

jorgebg retweeted

John Carmack

@ID_AA_Carmack

over 2 years ago

“Coding” was never the source of value, and people shouldn’t get overly attached to it. Problem solving is the core skill. The discipline and precision demanded by traditional programming will remain valuable transferable attributes, but they won’t be a barrier to entry. Many times over the years I have thought about a great programmer I knew that loved assembly language to the point of not wanting to move to C. I have to fight some similar feelings of my own around using existing massive codebases and inefficient languages, but I push through. I had somewhat resigned myself to the fact that I might be missing out on the “final abstraction”, where you realize that managing people is more powerful than any personal tool. I just don’t like it, and I can live with the limitations that puts on me. I suspect that I will enjoy managing AIs more, even if they wind up being better programmers than I am.

526

17K

jorgebg retweeted

Chris @criccomini

over 2 years ago

TIL about Apache DafaFusion Comet. @apple has replaced @ApacheSpark's guts with @ApacheArrow DataFusion. And they're donating it. 🤯 https://t.co/yIyrFpuYmc This is an alternative to @MetaOpenSource's Velox Spark implementation. https://t.co/VIGpFYDnyt /ht @philippemnoel

criccomini's tweet photo. TIL about Apache DafaFusion Comet. @apple has replaced @ApacheSpark's guts with @ApacheArrow DataFusion. And they're donating it. 🤯

https://t.co/yIyrFpuYmc

This is an alternative to @MetaOpenSource's Velox Spark implementation.

https://t.co/VIGpFYDnyt

/ht @philippemnoel https://t.co/leAVBE2g5b

249

118

43K

jorgebg retweeted

Kubernetes Community Days Spain @KCDSpain

over 2 years ago

We are happy to announce our first in-person KCD in Barcelona, hosted by our good friends at @DevBcn 🙌 📅 June 13th-14th 2024 📍La Farga de l'Hospitalet, Barcelona 📎 @DevBcn will host a KCD dedicated track, CFP at https://t.co/zqssbXfDVK 🚀 #cloudnative #cncf #kcd #devops

Jorge Bay @jorgebg

over 2 years ago

Energy generation from CO2-emission free sources goes up to 70%+, including nuclear

Jorge Bay @jorgebg

over 2 years ago

Over 50% of all the energy generated in Spain was from renewable sources in the past year!

280

Jorge Bay @jorgebg

over 2 years ago

This is great news for everybody https://t.co/7QmeaBnINB

jorgebg retweeted

Massimo

@Rainmaker1973

over 2 years ago

30 years ago #Today, iD Software released the game DOOM, now considered one of the most influential titles in video game history, popularizing the first-person shooter genre with its “deathmatch” multi-player mode. https://t.co/OZIJjjBMut

229

639

187

jorgebg retweeted

Werner Vogels

@Werner

over 2 years ago

Amazon S3 Express One Zone sets a new bar for cloud object storage performance. By co-locating storage and compute, it removes network latency to achieve up to 10x faster data access compared to S3 Standard, at half the cost. #aws #reinvent2023 https://t.co/qFcfYTdb1i

366

81K

Jorge Bay @jorgebg

over 2 years ago

Live Stripe dashboard’s stats are impressive https://t.co/aZqC0ocuua

128

jorgebg retweeted

Peter

@peter

over 2 years ago

adulthood is saying “after this week things will slow down a bit” over and over until you die.

11K

560

980K

jorgebg retweeted

Matt Griswold

@griswold

over 2 years ago

The OpenAI story.

249

jorgebg retweeted

Daniel Lemire

@lemire

over 2 years ago

Really nice introduction to Roaring bitmaps by @voberoi

20K

Jorge Bay @jorgebg

over 2 years ago

Evergreen tweet

120

jorgebg retweeted

non aesthetic things

@PicturesFoIder

over 2 years ago

347

127K

10K

jorgebg retweeted

Athan Koutsiouroumbas

@Athan_K

over 2 years ago

🚨🚨"Our thesis is that a primary cause of the rise in mental disorders is a decline over decades in opportunities for children and teens to play, roam, and engage in other activities independent of direct oversight and control by adults." 🚨🚨(1/2)

Athan_K's tweet photo. 🚨🚨"Our thesis is that a primary cause of the rise in mental disorders is a decline over decades in opportunities for children and teens to play, roam, and engage in other activities independent of direct oversight and control by adults." 🚨🚨(1/2) https://t.co/6C2bRVHxzk

365

14K

Jorge Bay

@jorgebg

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users