At Base, we are all about productivity. We’re building Base to make sales professionals and whole organizations 10 times more productive. But there are other important aspects of productivity and one of them is creating an engineering environment optimized for development speed, where engineers can move fast and build great features quickly while maintaining high quality and minimizing bugs. It’s reflected both in the process and the tools we use.
As the Data Team, we want to embrace this culture as part of working with data. Make it as easy and effective as possible to build and ship data products.
It brings some interesting challenges:
- How to work with data effectively, regardless of its size?
- How to quickly move from prototyping machine learning models to shipping scalable and functional data products?
In this post, I’ll focus on those two challenges and share with you how we use Python’s core data science libraries along with more recent advancements in Python integrations with Big Data to create more consistent and more productive environment for data scientists.
Basic environment for working with data
We’ll start with covering the basics. In order to work with data effectively, you need at least a few tools:
- R&D environment, which allows you to combine code, notes, plots and enables easy collaboration with team members
- a tool for data wrangling & analytics: load data in different formats, filter it, group, compute statistics, fill missing values
- easy and nice-looking plotting library
- powerful machine learning library: lots of algorithms, good abstractions, easy evaluation and fast implementation
Python supports all of the above, with its core data science libraries.
- IPython Notebook: Interactive, web-based environment. You can write a piece of code, execute it on your data, immediately see the results and decide on your next step. You can write notes in Markdown and make plots, all within a single document, allowing you to follow reproducible research principles.
- Pandas: data analysis tool for working with tabular data, with the concept of DataFrames borrowed from R. DataFrames are a great abstraction, perfect mix between fully-fledged programming language and declarative, SQL-like operations: grouping, aggregating and joining.
- GGPlot / Seaborn / Matplotlib: Combo of plotting libraries, with ggplot supporting your basic plots with simple API, and more powerful libraries for more complex plots
- Scikit-learn: machine learning library with the most elegant API ever. All algorithms share the same interface, allowing you to easily switch them, and powerful Pipeline abstraction, which allows you to create complex modeling pipelines.
This is a well-known and established data science stack in Python community, covering most of your needs for working with data on a single machine.
Working with Big Data
As long as we can fit on a single machine — good for us! But what if the amount of data or time needed to process it vastly exceeds what we can do on a single machine? What about Big Data and Hadoop?
Until recently, Python was a second-class citizen in Big Data world. There were a few options like: writing UDFs for Hive, Pig or simple Hadoop Streaming. But there weren’t particularly comfortable to use and performant.
The Hadoop world is changing recently with the emergence of Spark – a fast and general distributed engine for large scale data processing. Spark uses lazy evaluation, keeps data in memory and has a high-level API – it’s simply both faster and easier to use than Hadoop MapReduce. Python is one of the languages officially supported by Spark and it gives us a lot of new possibilities for working with large, distributed datasets. And guess what – it has DataFrames!
Take a look at this example:
>>> # load user interactions data interactions = sqlc.load("s3://.../interactions/", "com.databricks.spark.avro") Out: DataFrame[interaction_id: string, timestamp: bigint, device_name: string, account_id: bigint, user_id: bigint, type: string, metadata: string] >>> from datetime import datetime from pyspark.sql import functions, types # extract hour from timestamp hour = functions.udf(lambda timestamp: datetime.utcfromtimestamp(timestamp).hour, types.IntegerType()) # append column to DataFrame interactions_with_hour = interactions.withColumn('hour', hour(interactions.timestamp)) # count unique activity hours for each user, only Android devices users_activity = interactions_with_hour \ [interactions_with_hour.device_name == 'android'] \ .groupBy('user_id') \ .agg(functions.countDistinct('hour'))
It’s concise, uses high-level abstraction we already know – data frames – but this time they’re distributed and processed on a Hadoop cluster.
Spark DataFrames are also fast! All of those operations on DataFrames are lazy-evaluated – they create a logical plan of execution, which is then translated to an optimized physical plan, utilizing code generation and explicit memory management. Of course, we can also use arbitrary Python code, even reuse some of the code we already developed for local data analysis.
With Spark, you don’t need to declare your job upfront. You can launch SparkContext, and submit multiple jobs interactively. In fact, we can easily launch Spark application within IPython Notebook!
All of that gives us an interactive environment for working with Big Data, which is very similar to a local environment. It also allows us to reuse code and abstractions – and that is a powerful combination.
Deploying data products
Often the hardest thing in building machine learning products is to take the model prototype produced by data scientists and productionize it. Sometimes it requires rewriting the model to fit production infrastructure. It’s very suboptimal, but necessary in some environments.
Python is a general purpose language, performant enough and it can significantly reduce the problem.
One option is to serialize your scikit-learn model and embed it in one of Python’s web frameworks like Django or Flask, creating a request-response classifier API.
Another option, which we use at Base is to make use of distributed streaming framework like Storm, to apply the model on stream of incoming messages from other services. We use Streamparse framework to deploy our models as thin Python topologies, consuming messages from other services through Kafka, and saving the model output to the database. It’s fast, scalable and fault-tolerant.
If you keep a good balance between code quality during the modelling phase and being creative / testing new ideas, hardening the code so that it can be safely deployed to production is a fairly easy step.
There is more!
There are tons of other things you can do with Python and the ecosystem is growing constantly:
- Spark also supports Streaming, and has a distributed machine learning library: MLlib
- Python is concise and expressive, thus a great way to build data-pipelines, and there are several workflow engines built in Python: We use Luigi from Spotify, but there are interesting Python-based alternatives open-sourced recently: Airflow from Airbnb and Pinball from Pinterest
- Libraries and tools for more advanced things like deep learning, e.g. Theano
Python’s ecosystem for data science closes the gap between small data and big data, and between prototyping and production, allowing us to focus on being creative, testing new ideas quickly and shipping data products easier.
Cover photo by Gordon Ednie licensed with Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 Generic License