Sunday rant.
For software engineering, my sense is that the phrase “premature optimization is the root of all evil” has massively backfired. Its from a book on data structures and mainly tried to dissuade people from prematurely write things in assembler. But the point was to free you up to think harder about the data structures to use, not leave things comically inefficient. This context is always skipped when it’s uttered.
Not all fast software is world-class, but all world-class software is fast. Performance is _the_ killer feature.
If you are in engineering, here is a fantastic anecdote. I refer to this account often. It’s a bit subtile, but the implications are massive-
It’s an account of how SQLite became 50% faster, not by doing one specific thing but hundreds of small ones.
SQLite is everywhere today because of this work.
https://t.co/krLFFps2up
We need the engineers in all companies fight for this more. Product leads are not the right owners of the end performance of the software. This needs to be encoded in the professional pride of the software engineering discipline. Leaders in companies need to encourage it and hold engineering accountable. It’s simply not ok to fritter away the performance of the products for random reasons.
Every user of your products cares exactly as much about latency as engineers do when typing in their terminal. They just don’t have the words to describe what they don’t like about the experience and neither should they.
Fast and concise probabilistic filters in Python
Sometimes you need to filter out or filter in data quickly. Suppose that your employer maintains a list of forbidden passwords or URLs or words. You may store them in a relational database and query them as needed. Unfortunately, this process can be slow and inefficient.
A better approach might be to use a probabilistic filter. A probabilistic filter is a sort of 'approximate set'. You can ask it whether a key is present in the set, and if it is present, then you will always get 'true' (the correct answer). However, when the key is not present, you may still get 'true', although with a low probability. So the probabilistic filter is sometimes wrong. Why would you accept a data structure that is sometimes wrong? Because it can be several times smaller and faster than querying directly the actual set.
The best known probabilistic filter is the Bloom filter, but there are many others. For example, we recently presented the binary fuse filters which are smaller and faster than Bloom filters, for large and immutable sets.
The pyxorfilter module in Python is an implementation of the binary fuse filters. It provides support for several filter types but both Fuse8 and Fuse16 are interesting if you have fairly large sets. They provide respectively a 0.39% and 0.0015% false probability rate. So a Fuse16 filter is almost nearly correct. Why would you prefer Fuse8? Because it uses half the memory.
We can construct a probabilistic filter in Python like so with the pyxorfilter filter:
from pyxorfilter import Fuse8
data = [uuid.uuid4() for i in range(2000000)]
filter = Xor16(len(data))
filter.populate(data)
Once it is done, you can be certain that all the content of your initial set is in the filter:
for d in data:
assert filter.contains(d)
You can save the filter on disk and check how much memory is used...
f = open(filename, 'wb')
f.write(filter.serialize())
f.close()
print(os.path.getsize(filename)*8.0/len(data))
If your set is large enough (say 1000,000 elements), you will find that the memory usage is about 9 bits per entry. It grows a bit larger per entry as the set gets smaller. For smaller sets, the pyxorfilter module offers an alternative (Xor8) that can be slightly more efficient in these cases.
How do you know if you can trust the filter? Just query random inputs (highly likely not to be present) and see how many falsely appear to be in the set:
# estimate false positive rate
N = 1000000
count = 0
for i in range(N):
count += filter.contains(uuid.uuid4())
fpp = count/N*100.0
As I already implied, if you replace Fuse8 by Fuse16, then the memory usage per element goes up to about 18 bits, but the false positive rate is far lower: 0.00200%.
I produced a small benchmark. On my laptop, I get that you get over 1 million queries per second (each time checking the presence of a string). On an Intel-based server, I get a lower number, so about half a million per second.
For binary fuse filters, it does not matter whether the element is in the set or not as far as performance goes, so I use random inputs. When using a Bloom filter (say), you would typically get worse performance when the elements are in the set.
The pyxorfilter module was created Amey Narkhede. It still early days. I expect you can install pyxorfilter the usual way (pip install pyxorfilter) under x64 Linux. Unfortunately, under macOS and Windows, there are issues getting the module installed (see issue 19 and issue 10). However, if you are a Python hacker, you can build it from source relatively easily:
git clone --recurse-submodules https://t.co/3gdO2oCyj8
cd pyxorfilter
python https://t.co/wL32imr9wG build_ext
python https://t.co/wL32imr9wG install
I am sure Amey would appreciate it if experienced Python hackers could help resolve the few remaining issues.
There are functionalities that pyxorfilter misses. Currently, you can save the filter to disk and recover it later. Sadly, to use it, you need to load the whole filter in memory. That is not needed. It might be more suitable to use memory file mappingor even other lower-level input-output operations.
What if you do not care about Python? You can use the xor and binary fuse filters in Go, C or C++, Zig, Rust, etc. I just love working in Python when I can.
“Coding” was never the source of value, and people shouldn’t get overly attached to it. Problem solving is the core skill. The discipline and precision demanded by traditional programming will remain valuable transferable attributes, but they won’t be a barrier to entry.
Many times over the years I have thought about a great programmer I knew that loved assembly language to the point of not wanting to move to C. I have to fight some similar feelings of my own around using existing massive codebases and inefficient languages, but I push through.
I had somewhat resigned myself to the fact that I might be missing out on the “final abstraction”, where you realize that managing people is more powerful than any personal tool. I just don’t like it, and I can live with the limitations that puts on me.
I suspect that I will enjoy managing AIs more, even if they wind up being better programmers than I am.
TIL about Apache DafaFusion Comet. @apple has replaced @ApacheSpark's guts with @ApacheArrow DataFusion. And they're donating it. 🤯
https://t.co/yIyrFpuYmc
This is an alternative to @MetaOpenSource's Velox Spark implementation.
https://t.co/VIGpFYDnyt
/ht @philippemnoel
We are happy to announce our first in-person KCD in Barcelona, hosted by our good friends at @DevBcn 🙌
📅 June 13th-14th 2024
📍La Farga de l'Hospitalet, Barcelona
📎 @DevBcn will host a KCD dedicated track, CFP at https://t.co/zqssbXfDVK 🚀 #cloudnative#cncf#kcd#devops
30 years ago #Today, iD Software released the game DOOM, now considered one of the most influential titles in video game history, popularizing the first-person shooter genre with its “deathmatch” multi-player mode.
https://t.co/OZIJjjBMut
Amazon S3 Express One Zone sets a new bar for cloud object storage performance. By co-locating storage and compute, it removes network latency to achieve up to 10x faster data access compared to S3 Standard, at half the cost. #aws#reinvent2023 https://t.co/qFcfYTdb1i
🚨🚨"Our thesis is that a primary cause of the rise in mental disorders is a decline over decades in opportunities for children and teens to play, roam, and engage in other activities independent of direct oversight and control by adults." 🚨🚨(1/2)