Wednesday, May 6, 2015

Sparks May Fly....

Spark and DataBricks!  If you're in the data science field, "big data" (more on that term in a later post), data engineering or happen to work on computers for a living in the Bay then you've probably heard of Spark before.  

You've probably heard that it's the end all and be all of big data processing. It's going to kill Hadoop (map reduce)!  I've heard all of this shit before and then some. I've actually gotten into arguments with "very knowledge" people around the subject and of course I came out on top :).  FYI, I've been working with the Spark eco system now for over a year now so I know a little about it. 

So what is it? Why are people so gaga over it? And why in the hell do I think a lot of it is smoke and mirrors? And some of it is actually kind of cool?  

Spark simple put is a new paradigm for processing large distributed sets of data across a cluster of jvms.  Yes it's a JVM based solution... Just like Hadoop and MR.   So what makes it so special compared to MR? Well for one thing it treats all data as what is called an RDD, or resilient distributed dataset.  Think of it as a data frame from R or Python (pandas) that is actually sitting on many smaller sets across a cluster.  You work with the data like you would any data frame: filter, aggregate, run ML over the set, etc.  but it's all distributed and "in memory".  Why do I use quotes? Well you see, the data only sits on disk and is pulled into memory until something actually needs to happen.  You are basically working on the "meta data" until you actually need to do real work.  The data will then go away (out of memory) unless you tell it not to, ".cache()".  So the same IO and read issues still exist!

So why is this a bad thing? It isn't! If you can cache the data you have in memory across your cluster and you need to iterate over that data... Why do I say iterate? Because if you are only running one aggregation step over the data then you are essentially reading the data off disk then running aggregation then spitting out a result.... Sound familiar? That's essentially map reduce.   

So what is Spark good for? Running ML over large sets of data... Cache the RDD then hit the shit out of it! That's all. Iterate iterate iterate until your boss ass algo converges!

I'm being a little caddy here though, there are other upsides, such as I don't have to write MR jobs in Java, I much prefer scala and manipulating the data outside of the MR paradigm. I also treat the data as if it were one thing, an RDD. It abstracts the distributed nature of the data, that is until you run into an issue with having to broadcast the data to the nodes or coallese it to output or parallelise it, etc. 

The stack also has Spark streaming, a lambda solution for spark.  :::couch couch::: "STORM" anyone? 

Another great feature is that you can use is Spark-SQL, which in all fairness is written overtop the hive-context library..... :) do you see where I'm going with this? 

Fatherly advice time:  the correct tool for the right problem.  As data scientists we are expected to be polyglots of sorts, script in Python or bash, algos in R/Python/C, data modelling in hdfs or MySQL or Postgres, code in C when you need speed, understand algo runtimes and complicated metric spaces, stats, "AB testing" (more on that later)... Etc.  eveything has a correct tool or set there of.  Is Spark the correct tool for single read aggregation? No. Is it the right tool for doing (mostly linear) ML on large sets of data that can fit into memory, yes.  How often do you do that though? There is no implementation of an ANN in MLlib, and why? Because it's hard as hell to do it over the distributed data! And that's from the mouth of the man that created MLlib, I know because I talked to him about it in person. 

Is there a magic bullet in what we do? Fuck no... That's why we do this job though. Not because it's easy, but because it's hard.... And anything worth doing is hard! (Again more fatherly advice)!

Sunday, May 3, 2015

Introductions and First Steps...

What is this blog? This blog will serve as an outlet and hopefully a repo of links and resources and fatherly wisdom from myself (maybe some others) around the art and craft of data science. 

What this blog is not. A place where I go into detail. Details are for you to explore. You must learn by doing and reading and coding and working problems on your own. The best way to learn is to sit down with a problem, read about tools, find a solution and then work through, then fail, then fail again and again and again… until you get it.  Then you will never forget how to solve that problem.  Then you should go back and solve it again a different way.  If there is one key to success in life it is that. (FYI: there is my first piece of fatherly advice).

So the other day I began writing up a list of techniques and tools that I feel all data scientists should know.  Not a list of algos or mathematical concepts but of basic tools for getting around, what we call the backend.  It is suppose to serve as a primer for analysts and backend devs and mostly data scientists.  The first series of post that I will write will follow that list of tools, perhaps why I think Spark is kind of cool but also kind of a religion (I’m an atheist so thats not a good thing), why there will always be a place for MySQL and Postgres, why the term growth hacking and Big Data are just plain stupid and why I think that a lot of sales guys and sales engineers are the devil.

So who am I?  I am the senior technical lead of data science at Humin… formally the senior data scientist at Sellpoints and before that the marketing data scientist at Revolve clothing.  I have a masters degree in pure math with a focus on algebraic geometry and category theory, and an under graduate degree in neuro-behavioral sciences from UMDCP with a focus on pure mathematics (logic and abstract algebra).   I was also a class “away” from a degree in philosophy as well but promptly talked out of going “all the way” from every philosophy grad student I was friends with.  My dreams of studying at the Sorbonne, smoking too many hand rolled "cigs" and climbing at Fontainebleau on the weekends were quickly put to rest, though I still think about it here and there and perhaps in my retirement. 

I decided though to move to the next best place for climbing that didn’t require a passport, California.  And specifically the Bay.  In my previous life here in the Bay I worked in the non-profit sector and taught “under-served” youth math and science.  My first introduction to data science actually came when I had the idea of using our database of donors and the data there to be able to predict out who and when we should hit them up for money… I will write a full post on how I think that non-profists and start-ups are actually the same beast under a different name here shortly but first things first.

I have two beautiful and amazing children Connor, 4 and Emma 10 months and an amazing and VERY VERY understanding wife Jen.  She is amazing not only for the fact that she put up with me in grad school, has moved twice now for my career,  and while I was poor and working for non-profits but she also gave me two hilarious and amazing children that are both a little too much like me and basically (though she wont admit it all the time) her life a very stressful existence.

So thats it.  Thats who I am.  And hold on tight… I plan to take you through the life and process of what it is to be a data scientist in the Bay, FYI the sexiest job of the 21st century according to Business week… but what in the fuck to business people know about data and science in the first place ;) .