Customer Segmentation With SAS and R

If you’re examining the digital purchase journey for your newest customers, you’ll often find yourself asking some questions about them:

How long (in days) did it take my newest customers to convert?
How many visits/sessions did it take my newest customers to convert?
How many products were viewed prior to conversion?

These are the types of questions best explained by grouping the data for each of these metrics into bins, and focusing on the distribution, as opposed to your average. The main reason for this is that frequently a small number your best customers will inflate the averages for your customer file. It’s the classic 80/20 rule in action; where 80% of an e-commerce business is driven by the top 20% of customers.So, instead of stating that the average New Customer took 91.7 days to convert, focus on the fact that 40% of your new customers converted within 7 days of the first visit. What could you do or test to try to get that over 50%? As an added bonus, plotting and visualizing binned data is usually pretty straightforward, which also makes it very easy to present and share with other teams.

In data analysis, you frequently find yourself counting things. In e-commerce, you count sessions, conversions, cookies, % to Product Detail, etc. For an example of binning, if you wanted to get a sense of how frequently your customers are visiting your site, you could start by defining a format in SAS, which breaks out bins for the number of visits to your site. Then, you can use the format= option in Proc SQL to tell apply the bins to your visitors.

proc format;
     value visitfmt;
     low - 0 = '0 Visits'
     1       = '1 Visit'
     2-4     = '2-4 Visits'
     5-9     = '5-9 Visits'
     10-high = '10+ Visits'
     ;
run;

proc sql;
    create table work.visitors as select
        cookie_id,
        count(distinct session_id) as visits format=visitfmt.,
    from mylib.session_data
group by cookie_id;
quit;

Then, once you have your visitors table created, if someone wants to know what % of your visitors visited more than 10 times, just use Proc Freq:

proc freq data=work.visitors order=freq;
   tables visits;
run;

With this one statement we can very easily understand how frequently, or infrequently, a group of customers are visiting your site. You could then tweak your query to look at features such as number of visits before or after first purchase, and apply the same format. Metrics like this will allow you to see how many sessions the majority of your customers take to convert, as well as how frequently they are returning after first purchase. This is useful (and actionable) information. If it’s taking most of your customers 10 or more sessions to convert, there are certainly testing opportunities in your checkout funnel to try to improve conversion. If your customers are making their first purchase and never returning to the site, you now know you can reengage these segments via marketing channels to get them to return.

One of the most useful files you can put together for your analytics team is a master file of commonly used formats for binning continuous variables such as sales, visits, or orders. This greatly simplifies reporting, and saves time. Instead of manually binning data or re-defining how the bins should look, simply have your team include the formats file in their projects and they immediately have access to the most commonly defined formats for your digital analytics projects. The formats file on my own teams library has commonly defined format definitions for sales, number of orders, visits, product views, cart actions, app sessions, and more. A really helpful exercise is to meet with your own analytics team and potentially stakeholders in marketing or e-commerce, and whiteboard out common definitions and metrics for sales, digital behavior, and other continuous variables that are frequently binned together. With those definitions in mind, you can then set about defining formats in a SAS file and make them available to your team.

If your organization doesn’t have SAS or leans more on open source, then the equivalent of proc format is easily produced with R. Factors in R are a data type used for categorical variables. If you have an existing data frame in R and would like to bin continuous data into categorical groups, you can do this by using factor() and cut(). It’s often helpful to define your labels for binning continuous variables in a series of vectors, and then use cut() and your labels to produce bins for continuous data such as spend, orders, visits, etc. The cut() function can be a little counter intuitive at first, so as usual its typically easiest to start with some examples.

Let’s say we have a file of customer data from an e-commerce site. For now we’re just exploring the data and want to get a sense of how many customers spent over $200, or how many have made multiple purchases.

So, lets read the file in and take a look at it:

#read in our customer data file
customers<-read.csv("cust_info.csv", header=TRUE)
head(customers)

The data looks pretty straightforward. We have our customer_id, and then for each customer some measures like visits, spend, and total orders. Let's assume that we want to group our customers into bins based on the amount of money they have spent as well as their number of orders:

You can use the cut() function right out of the box to "cut" your data into intervals you define. If you don't define labels to give to the intervals, they will be automatically created for you. The basic prototype for cut() is:

    
    cut(x, breaks, labels=NULL)

Where x is the numeric field you want to cut, breaks are a vector of points where you want cut to "break" the field up (these are your bins), and labels are an optional parameter you can pass in that allows you to define your own labels for the breaks.

Further documentation and examples can be found at the Official Documentation Page

Lets start with a basic example. We'll bin our customers based on the amount of money they spent, and allow R to define the labels for us:

# cut() will create the labels/ranges for us if we don't pass in some labels to use
customers$spend_cat<-cut(customers$DOLLARS, c(0,100,200,300,400,500, Inf))

head(customers)

Setting the vector of breakpoints correctly can be tricky. Start with the lower and upper limits (0 and infinity in our case), then work inward and specify each point that we should "cut" that column based off of. We now have a new column called spend_cat that lists which range of spend each customer falls into. You'll also notice that this new column is a factor, whereas the DOLLARS column remains numeric. We're off to a good start, but I'm not too crazy about the default labels created, so the next step is to create some bins and use our own labels:

#predefine some label groups
orderlabels<-c("1","2-4","5-9", "10+")
spendlabels<-c("0-100","101-300","301-500","501+")


#use our predefined labels and cut to group customers into labeled bins
customers$order_bin<-cut(customers$ORDERS, c(0, 1, 4, 9, Inf), labels=orderlabels)
customers$spend_bin<-cut(customers$DOLLARS, c(0, 100, 300, 500, Inf), labels=spendlabels)

head(customers)

We'll create two variables, one called orderlabels and one called spendlabels. These are just vectors of strings which will be used as labels for our bins for orders and spend. Then we cut our orders and dollars columns same as before, only this time we pass in our own vectors of labels as the third parameter. You'll notice that their are only four labels defined, but 5 breakpoints assigned in the second parameter. This is because you need to specify the lower and upper boundaries of your breaks, as well as where in between you want to cut your data. Essentially, you'll want to have one fewer label than you have breakpoints created.

This can be tricky, so I'd encourage you to play around with different breaks and labels and ensure that they are being applied correctly. Now we have two new useful columns created, order_bin and spend_bin, which have our own user defined labels and effectively group our customer data according to our own business rules and definitions. With just a handful of lines of code in R, we've greatly simplified our work going forward for exploratory analysis, segmentation, or data visualization.

Assets like format dictionaries and common labels for binning your most critical metrics can really help standardize your teams reporting and reduce the time and resources required for ad-hoc asks and analysis. Fortunately, this is dead simple to implement in SAS with proc format as well as in R with functions like cut() and factor(). A good starting point is to draw up a list of some of your team's most critical or watched metrics, and agree on some bins for grouping customer and product data together. Encourage your team to use the same formats and bins. This helps standardize workflows, makes reproducing work much simpler, and allows you to easily communicate metrics to stakeholders outside of your data science or analytics teams.

Ryan M. Maloney

Random Thoughts on Analytics, Programming, and Tech

Leave a Reply Cancel reply