Using Makefiles for Data Analysis

Some data analysis projects are fairly simple – you fetch some data that is in already clean and in good shape, run some exploratory analysis on it, and maybe run a model on it. Other projects are not so simple. They can involve a number of stages, from reading in raw data, cleaning it, to transforming it, perhaps plotting it, or running a model on it. Depending on how complex the project is, you could potentially end up spending more time managing the data pipeline for the project than actually deriving some sort of business value from its output.

If you have a fairly simple, one off project made up of one or two short scripts, then you probably don’t need to consider using something like a Makefile or breaking the logic up into smaller pieces. For larger projects that you will be re-running or providing future updates to, it can be helpful to break things up into smaller utilities and use a build tool like Make to provide some structure around building (and re-building) your data pipeline.

First off, what exactly is Make? It’s a build automation tool that comes standard on most versions of Unix. It was originally built for automating the build process for really complex software. To use Make, all you need is a file called Makefile in the current directory you are working in for your project. This file is plain text and will describe all of the data transformation steps in your project. Most simply, it is a list of rules that explain how your project should be built. The rules all have the following format:

target: prerequisites
    recipe

The “target” is what you are creating or building in each rule. This could be a dataset, CSV file, anything really.

The “prerequisites” are the files you need to build that target. This can be a SAS/Python/R script, upstream datasets you need as input, etc.

The “recipe” is a set of command(s) to run to build your target. You can execute each rule by running make from the command line.

Below is a simple example using some targets from a Makefile that pulls data on repurchase and revisits from New Customers at a retail site:

newcust_orders.sas7bdt: newcust_orders.sas
    sas newcust_orders.sas -noterminal

newcust_cookies.sas7bdat: newcust_orders.sas7bdat newcust_cookies.sas
    sas newcust_cookies.sas -noterminal

newcust_sessions.sas7bdat: newcust_cookies.sas7bdat newcust_sessions.sas
    sas newcust_sessions.sas -noterminal

So in the example above there are three targets – one dataset called newcust_orders.sas7bdat, one called newcust_cookies.sas7bdat, and one called newcust_sessions.sas7bdat. The only input for the newcust_orders data is the actual script we run to pull it, so the line below we just call sas newcust_orders.sas to build the dataset. So the “recipe” in this case is simply just calling the sas script to pull the New Names.

In the step after that, to build newcust_cookies there are two inputs: the newcust_orders.sas7bdat data (which is built in the previous step), and the newcust_cookies.sas script, which pulls the cookies for the new customers. In the line below we are simply running sas newcust_cookies.sas to build our target. The target after that is the session data, which uses the cookies data as an input, and so on.

It’s important to note that Make will check the updated times for all of the inputs as you execute a step. For example, if you went to run make newcust_cookies.sas7bdat and there wasn’t a newcust_orders.sas7bdat dataset created, or the timestamp on the cookies table is older than the input new name table, Make will re-run the inputs it needs to bring everything up to date. In other words, When a target is ‘run’, Make compares the file system dates on the target to that on all of its pre-requisite files, and if any pre-requisites have been modified more recently than the target, a command is executed to re-build the target.

You can also use Make to automate a lot of tedious tasks like cleaning up files, moving data around, or output steps like exporting to CSV. Frequently, these sort of tasks are among the most time intensive (or annoying parts) of a project. Make provides a structured, reproducible way to map out these tasks and automate them. For example, this rule will save you some time cleaning up old log, .lst, or stale datasets:

clean: 
    rm -f *.lst *.log *.sas7bdat

So if you run make clean from the command line it will go in and delete out old logs, datasets, lst files etc so you can rebuild your data without clutter or old logs sitting in your directory. You can create a lot of these “phony” targets in a Makefile to automate tasks like setting up an environment, cleaning out folders, copying or backing up data, or sending data to a CSV and building a plot with something like R or Python. So instead of slogging around in enterprise guide and manually copying or exporting data, you could script this workflow in a Makefile using something like the below:

In addition to providing a useful sort of “graph” for directing the workflow of your project, Make also gives you access to several useful automatic variables. These can save you a lot of time and frustration in that you can define certain configurations or options once in and automatic variable, and if you need to change it in the future you only need to update it in one place. Additionally, frequently you end up re-typing things like the name of your target or a dependency so having automatic variables available for them will save you a lot of time.

Some of the most useful automatic variables are:

$@ the file name of the target
$< the name of the first prerequisite (i.e., dependency)
$^ the names of all prerequisites (i.e., dependencies)
$(@D) the directory part of the target
$(@F) the file part of the target

Obviously, a full comprehensive guide to Make is outside the scope of this post. The key takeaway is that Makefiles aren’t just useful for complex software development, they can be very useful in an analytics and data science environment as well. Makefiles provide a reproducible graph for the workflow for your project, will keep your dependencies in sync and up to date, and help to automate or streamline some of the most tedious or time consuming steps of a project, like cleaning data or removing junk files. If you have a project with several moving parts, its well worth it to spend a bit of time getting familiar with a build tool like Make to streamline your workflow. Not only will this pay dividends for your own productivity, it also allows someone on your team to quickly reproduce your analysis using the Makefile as a guide.

Ryan M. Maloney

Random Thoughts on Analytics, Programming, and Tech