Non-linear relationships: The importance of examining distributions

Recently I was analyzing some data to help answer the question “what are the demographic differences between program graduates and program drop outs?” I did some modelling and found a few predictors, one of which was age.

I compared the average age between the groups and saw that the drop outs had a lower average age (42 years) than graduates (44 years). Simple enough. But this simplistic explanation didn’t jive with anecdotal information the program staff had given me. I wondered if the relationship between age and program completion was linear (i.e., does a change in age always produce a chance in the likelihood of graduating).

As I mentioned in my last post, I’ve been playing around with R. I recently came across something called a violin plot and I wanted to try it out. A violin plot is kind of like a box plot, except that instead of a plain old box it shows you the distribution of your data.

Here is an example of a box plot:

boxplot

The main thing that I immediately see from this chart is that on average, the drop outs were younger than the graduates.

Here is an example of a violin plot:

violin

I get a different takeaway from this plot. You can see from the violin plot that the distribution of age for the drop outs looks a lot different than the distribution of age for the graduates. The bottom of the drop out violin is wider, indicating that the drop outs skew a lot younger than the graduates. This indicates that we should be exploring the relationship between age and graduation more closely.

But what if you don’t use R and can’t create a violin plot? Histograms are standard tools to show distributions and are much more common. A histogram is essentially a column chart that show the frequency of values in your distribution (so for this example, it would show how many participants were 20 years old, 21 years old, 22 years old, you get the idea). Excel actually has a built in feature to create histograms (click here for instructions). The tool bugs me a lot and it isn’t super intuitive to use, but it gets the job done.

Here is the distribution for age for both the drop outs and graduates. Yes, yes, I know that my x-axes aren’t labelled and that my y-axes use different scales but these choices were intentional because I want you to focus on the shape of the distributions, not the content.

histograms

Again, you can see that the age of the drop outs skews to the left (meaning that there is a higher proportion of younger participants than older). The histogram for the graduated group looks quite different.

All of this evidence points to a non-linear relationship, meaning that age has an effect on whether or not a participant graduates for participants in different age groups.

To take a closer look at this relationship, I calculated the drop out rate for different age groupings and put them on a line chart. Aha! If the relationship between age and program completion was linear, we would expect this line to be straight. But it’s not. You can see that the drop-out rate declines with age until we hit age 40 or so. After that it’s more or less flat until age 70, and then goes down again.

dropouts.PNG

This is an important piece of knowledge for program staff to target retention efforts and something that we wouldn’t have uncovered if we simply had stopped at comparing the average age between the drop-outs and the graduates.

The Importance of Context

Recently I was looking at some data and I noticed a trend in a neighbourhood surrounding a community centre that was evaluating the effectiveness of their poverty reduction work. The number of families classified as having a low income had decreased over recently (Neighbourhood A). Several nearby neighbourhoods (Neighbourhoods B and C) had definitely not seen this decrease.

neighbourhoods

(Shout out to Stephanie Evergreen for forever changing my life with small multiples)

At first glance this looked promising – had the poverty reduction campaign contributed to this? People were excited but I had my reservations about claiming success so quickly.

If you’ve recently visited Toronto you know that there are building cranes everywhere. Neighbourhoods are changing (read: gentrifying) very, very quickly as luxury condos go up and lower income families are driven further and further out of the core. It was possible that the income level of residents hadn’t changed – perhaps the low income residents had moved out and more affluent residents had moved in. First piece of evidence: Neighbourhood A had four condominium projects completed in that time frame whereas Neighbourhood B had one and Neighbourhood C had zero.

Next we looked at demographics. Canada completes a census every five years. We had could compare 2006 and 2011 data as the 2016 is not yet available. Second piece of evidence: Neighbourhood A had decreases in children, youth, and seniors (and families overall) but an increase in working age adults). The change wasn’t near as drastic in Neighbourhoods B and C.

Fortunately we had a lot of other data to look at in order to evaluate the program but I thought that this was a nice illustration of why it’s really important to look at the context behind the data and examine other possible explanations before claiming success.