Wednesday, May 6, 2015

Sparks May Fly....

Spark and DataBricks!  If you're in the data science field, "big data" (more on that term in a later post), data engineering or happen to work on computers for a living in the Bay then you've probably heard of Spark before.  

You've probably heard that it's the end all and be all of big data processing. It's going to kill Hadoop (map reduce)!  I've heard all of this shit before and then some. I've actually gotten into arguments with "very knowledge" people around the subject and of course I came out on top :).  FYI, I've been working with the Spark eco system now for over a year now so I know a little about it. 

So what is it? Why are people so gaga over it? And why in the hell do I think a lot of it is smoke and mirrors? And some of it is actually kind of cool?  

Spark simple put is a new paradigm for processing large distributed sets of data across a cluster of jvms.  Yes it's a JVM based solution... Just like Hadoop and MR.   So what makes it so special compared to MR? Well for one thing it treats all data as what is called an RDD, or resilient distributed dataset.  Think of it as a data frame from R or Python (pandas) that is actually sitting on many smaller sets across a cluster.  You work with the data like you would any data frame: filter, aggregate, run ML over the set, etc.  but it's all distributed and "in memory".  Why do I use quotes? Well you see, the data only sits on disk and is pulled into memory until something actually needs to happen.  You are basically working on the "meta data" until you actually need to do real work.  The data will then go away (out of memory) unless you tell it not to, ".cache()".  So the same IO and read issues still exist!

So why is this a bad thing? It isn't! If you can cache the data you have in memory across your cluster and you need to iterate over that data... Why do I say iterate? Because if you are only running one aggregation step over the data then you are essentially reading the data off disk then running aggregation then spitting out a result.... Sound familiar? That's essentially map reduce.   

So what is Spark good for? Running ML over large sets of data... Cache the RDD then hit the shit out of it! That's all. Iterate iterate iterate until your boss ass algo converges!

I'm being a little caddy here though, there are other upsides, such as I don't have to write MR jobs in Java, I much prefer scala and manipulating the data outside of the MR paradigm. I also treat the data as if it were one thing, an RDD. It abstracts the distributed nature of the data, that is until you run into an issue with having to broadcast the data to the nodes or coallese it to output or parallelise it, etc. 

The stack also has Spark streaming, a lambda solution for spark.  :::couch couch::: "STORM" anyone? 

Another great feature is that you can use is Spark-SQL, which in all fairness is written overtop the hive-context library..... :) do you see where I'm going with this? 

Fatherly advice time:  the correct tool for the right problem.  As data scientists we are expected to be polyglots of sorts, script in Python or bash, algos in R/Python/C, data modelling in hdfs or MySQL or Postgres, code in C when you need speed, understand algo runtimes and complicated metric spaces, stats, "AB testing" (more on that later)... Etc.  eveything has a correct tool or set there of.  Is Spark the correct tool for single read aggregation? No. Is it the right tool for doing (mostly linear) ML on large sets of data that can fit into memory, yes.  How often do you do that though? There is no implementation of an ANN in MLlib, and why? Because it's hard as hell to do it over the distributed data! And that's from the mouth of the man that created MLlib, I know because I talked to him about it in person. 

Is there a magic bullet in what we do? Fuck no... That's why we do this job though. Not because it's easy, but because it's hard.... And anything worth doing is hard! (Again more fatherly advice)!






No comments:

Post a Comment