Md Armughanuddin

@armughanuddin

SDE Storage Infra + ML Systems @google | PhD CS @tamu

Sunnyvale, CA

Joined September 2018

166 Following

930 Followers

19 Posts

Md Armughanuddin

@armughanuddin

about 1 month ago

@arpit_bhayani Truly one of the best experiences learning from you. Considering how everything is about chasing titles and finding ways around interviews these days, it is so refreshing to see someone’s eyes lighting up when they hear about interesting problems.

589

Md Armughanuddin

@armughanuddin

about 1 month ago

Decided to re-read some old classic papers this year, and I thought might as well do some brain dump along the way. For those who are interested to look more into this, here is the original Flume research paper https://t.co/7AuEdB0C3Q

507

Md Armughanuddin

@armughanuddin

about 1 month ago

At Google, one of the most fascinating pieces of tech I've worked with is FlumeJava. Hadoop is quite popular and most of us would know the MapReduce model, but FlumeJava makes the whole thing far more elegant. Everything in FlumeJava is a PCollection, which is essentially a data structure you can run parallel operations on, and it gives you the flexibility to have any data type. These might look like mere abstractions, but they make it dramatically easier to test entire pipelines with in-memory test data that maps cleanly onto production code. The most beautiful part about all of this is deferred evaluations. Say you run some operations on a collection A, transform it into B, then feed B into C. But Flume doesn't just blindly run all of this. Before FlumeJava processes anything, it builds a DAG of the dataflow and optimizes away whatever's unnecessary. If you want to look at the objects inside C, you have to materialize it first. This sounds trivial, but in large pipelines there are often code paths that are dead in a given run, and FlumeJava just skips processing them entirely. Moreover, these lead to significantly lesser MapReduce stages, while being as fast as a regular MapReduce pipeline! Btw, all of this was designed back in 2010 :)