@arpit_bhayani Truly one of the best experiences learning from you. Considering how everything is about chasing titles and finding ways around interviews these days, it is so refreshing to see someone’s eyes lighting up when they hear about interesting problems.
Decided to re-read some old classic papers this year, and I thought might as well do some brain dump along the way. For those who are interested to look more into this, here is the original Flume research paper
https://t.co/7AuEdB0C3Q
At Google, one of the most fascinating pieces of tech I've worked with is FlumeJava.
Hadoop is quite popular and most of us would know the MapReduce model, but FlumeJava makes the whole thing far more elegant. Everything in FlumeJava is a PCollection, which is essentially a data structure you can run parallel operations on, and it gives you the flexibility to have any data type. These might look like mere abstractions, but they make it dramatically easier to test entire pipelines with in-memory test data that maps cleanly onto production code.
The most beautiful part about all of this is deferred evaluations. Say you run some operations on a collection A, transform it into B, then feed B into C. But Flume doesn't just blindly run all of this. Before FlumeJava processes anything, it builds a DAG of the dataflow and optimizes away whatever's unnecessary. If you want to look at the objects inside C, you have to materialize it first. This sounds trivial, but in large pipelines there are often code paths that are dead in a given run, and FlumeJava just skips processing them entirely. Moreover, these lead to significantly lesser MapReduce stages, while being as fast as a regular MapReduce pipeline!
Btw, all of this was designed back in 2010 :)