Getting started with mapping and GIS for free (Tips from a non-expert)

gis title

Sometimes it is useful to see results on a map. Maybe you want to see where your participants are coming from or show survey results geographically.

Mapping and GIS (geographic information system) are skills that I had been interested in learning for awhile. I never seemed to have the time to really delve into it and so these interests took the back shelf while other priorities popped up.

This past year I have been working on a project where geography and location are key and so I finally had the push I needed to get up to speed. I had a minimal budget (read: $0) for software. Although there are pretty fancy GIS programs out there (that require minimal know-how) those weren’t in the cards.

Although there are other resources out there, these are the two that I used. They require no coding, making them very beginner-friendly.

1. Google Maps

If you are looking for very basic mapping, Google Maps can actually do quite a bit. You can draw polygons/boundaries, add points, add in directions, and import data (although I think data imports are limited at 50 rows).

Here is a fictitious example of a program location (the purple star) and where the program participants live (the green dots).


The nice thing about Google is that everything is saved on the cloud and you can access your maps from anywhere (and easily share them with others).



I needed to do more complex mapping than Google Maps allows and so I turned to QGIS, an open source GIS tool. I will warn you that it has a steep learning curve but there are many tutorials online (I found QGIS Tutorials and Tips extremely helpful!) and a community over at StackOverflow if you get stuck.

Here is another fictitious example of program locations (the grey circles) mapped in relation to income (red being the lowest income and the darker green being the highest income):


I’d love to hear more from others about this subject. Do you know of a great mapping/GIS tool? Have you used mapping in evaluation? Let me know in the comments!

Using icons to add visual interest to a table


When I write reports I typically only include the most useful/pertinent information in the main body. Detailed information and data tables live in appendices. That way if a reader DOES want the nitty gritty, they can flip to the back and get what they need but all of that extraneous info doesn’t bog down the average reader.

Recently I was including a table of demographic data in an appendix and was looking for ways to add visual interest. Sure, you could just put a boring old table like so:


You could tweak the borders to make it look less ugly:


Nope, not punchy enough.

I had put so much effort into making the main body of the report visually interesting that I wanted to carry this into the appendices. I headed over to and grabbed some icons, added some color, and boom:


I think this much improved and it fits with the main body of the report.

How to make dynamic waffle charts in Excel

Waffle charts can add visual appeal to a report. They seem to be popping up everywhere lately! I recently used them in a report and I wanted to share my process in order to show you how easy they are.


Once you break these charts down you can see that they are just 10×10 grids. And we all know that Excel, er, excels at working with grids.

First off, type out your data in two columns. You can see that I have values for Group A, Group B, and Group C. Then you are going to make some 10×10 grids (since I have three groups, I made three grids). These grids will contain the values 1-100%.


Next we are going to resize the grids. Highlight all of the columns in your grid and drag your mouse to resize. The default line height in Excel is 20 pixels so I like to make my cells 20 pixels wide so that things are perfectly square.


At this point I make my number font really tiny (as in size 6) so that it fits into the cells.

Change your background and text color on the grids to whatever you want your default color to be. I chose a light grey.


Now we are going to change the color of the borders. I find white too harsh of a contrast so I like using a very light grey. Changing border color is a bit quirky. First highlight the columns with your grids. Go to the border button and go down to line color and select the color that you would like.


Your cursor will turn into a little pencil and you will see little black dots in your grids. Higlight the columns that contain the grids again and press the border button. At this point the border colors should be changed.


Next we will use the magic of conditional formatting to fill in our grids.  Highlight the first grid and go to Conditional Formatting and click New Rule.


Next select “Format only cells that contain” and select “less than or equal to” and then select the cell that contains the actual value for Group A. This is telling Excel that you want to change the color of every cell in the grid that is less than or equal to the actual score.


Go to “Format” and change the background fill color AND the font color to whatever color you would like for your group.

Repeat these steps for all of your grids. When you are finished you will have something like this:


Alright, now we’re getting somewhere.

Next we’re going to do some extra Excel kung fu to make pictures from these grids that we can paste anywhere in our workbook (such as a front sheet that you are using to summarize your results).  Not only this, the picture will automatically update if you change your data.

Highlight all of the cells in your first grid and copy. Right click wherever you want your waffle chart to be. Right click, go to paste special, and then go to the little picture with a link on it. This creates something called a linked picture.


You can easily move this picture around on the worksheet and resize it. If you change your data in the previous worksheet, where we set up our grid, the picture will automatically update. Neat, right?

Once you have created linked pictures for all of your waffle charts let’s add some labels so that we can easily tell what scores the charts are representing.

Go to the first waffle chart and insert a text box. Change the fill and outline of the text box to ‘none’. Make sure the (empty) text box is selected. Go up to the formula bar, type = and then navigate to where the score for that group is stored, click that specific cell, and hit enter. Your text box should now contain that group’s score.


Like the linked picture, if you change the group’s score, the text box will automatically update to reflect this.

Repeat the steps above to label your other waffle charts. Tweak the formatting to make the text boxes easy to read. Add a headline that tells the reader the main insight from the charts and there you go, you’re done!


Before you get too excited there is an important downside to waffle charts: They take up a lot of space. Each waffle grid is essentially showing one data point. That is a lot of real estate for one data point!

Let me know your thoughts on waffle charts – Love them? Sick of them? Other thoughts?

Non-linear relationships: The importance of examining distributions

Recently I was analyzing some data to help answer the question “what are the demographic differences between program graduates and program drop outs?” I did some modelling and found a few predictors, one of which was age.

I compared the average age between the groups and saw that the drop outs had a lower average age (42 years) than graduates (44 years). Simple enough. But this simplistic explanation didn’t jive with anecdotal information the program staff had given me. I wondered if the relationship between age and program completion was linear (i.e., does a change in age always produce a chance in the likelihood of graduating).

As I mentioned in my last post, I’ve been playing around with R. I recently came across something called a violin plot and I wanted to try it out. A violin plot is kind of like a box plot, except that instead of a plain old box it shows you the distribution of your data.

Here is an example of a box plot:


The main thing that I immediately see from this chart is that on average, the drop outs were younger than the graduates.

Here is an example of a violin plot:


I get a different takeaway from this plot. You can see from the violin plot that the distribution of age for the drop outs looks a lot different than the distribution of age for the graduates. The bottom of the drop out violin is wider, indicating that the drop outs skew a lot younger than the graduates. This indicates that we should be exploring the relationship between age and graduation more closely.

But what if you don’t use R and can’t create a violin plot? Histograms are standard tools to show distributions and are much more common. A histogram is essentially a column chart that show the frequency of values in your distribution (so for this example, it would show how many participants were 20 years old, 21 years old, 22 years old, you get the idea). Excel actually has a built in feature to create histograms (click here for instructions). The tool bugs me a lot and it isn’t super intuitive to use, but it gets the job done.

Here is the distribution for age for both the drop outs and graduates. Yes, yes, I know that my x-axes aren’t labelled and that my y-axes use different scales but these choices were intentional because I want you to focus on the shape of the distributions, not the content.


Again, you can see that the age of the drop outs skews to the left (meaning that there is a higher proportion of younger participants than older). The histogram for the graduated group looks quite different.

All of this evidence points to a non-linear relationship, meaning that age has an effect on whether or not a participant graduates for participants in different age groups.

To take a closer look at this relationship, I calculated the drop out rate for different age groupings and put them on a line chart. Aha! If the relationship between age and program completion was linear, we would expect this line to be straight. But it’s not. You can see that the drop-out rate declines with age until we hit age 40 or so. After that it’s more or less flat until age 70, and then goes down again.


This is an important piece of knowledge for program staff to target retention efforts and something that we wouldn’t have uncovered if we simply had stopped at comparing the average age between the drop-outs and the graduates.

Chronicling my adventures in R: Why switch from SPSS and favourite packages so far

I’ve been saying that I’m going to start using R for a long time (this post assumes that you know what R is, if you want a brief explanation click here). I’ve officially declared 2017 to be the year that I switch from using SPSS to R for all of my data analysis. I’m going to make a series of posts sharing what I learn along the way.

Some background: I have been using SPSS since I first learned data analysis in 2002. As is (was?) common in the social sciences, all of my undergraduate training, and a lot of my graduate training, was using the GUI (graphical user interface). Over the past 5 years or so I have switched to syntax for reproducibility reasons. I have no prior experience with coding.

First off, why would I bother switching to R if SPSS has served me well for the past 15 years (yikes)? Good question! Here are the reasons that prompted me to make the switch:

  1. R is open source (read: free). SPSS is very expensive. The standard version is now more than $2500 US per year! I also like to support the open source movement which is about collaboration and community.
  2. It’s the leading tool in statistics. R is the most used tool in statistics, data science, and machine learning. Because it is open source, other users are constantly creating packages (there are thousands that anyone can download and use). There is a large, active, and growing community of users and this community is a great resource.
  3. The data visualization capabilities blow SPSS out of the water. Have you tried making a nice chart in SPSS? It’s an awful process and the end result isn’t great. My current workflow is to copy and paste SPSS output into Excel and do my visualization there. It works but wouldn’t it be grand if I could just make nice charts by adding a few lines of code while analyzing the data?
  4. It’s a lot more flexible than SPSS. R is not just a piece of software, it is a programming language. With SPSS you are often ‘locked in’ to the options for analyses that the software gives you. With R, if you can write the code you can do just about anything.

That list sounds great. Why have I waited so long to make the switch? R has a steep learning curve, especially if are you like me and you do not have a coding background. There are various online courses that introduce R. I have taken a few in the past and have found them to be quite helpful. The major thing that I learned from courses is how to “think like a programmer”…this was a large hurdle.

Now that I have the 101 material out of the way, I want to learn by doing and so I have been using R for all of my data analysis so far this year, mainly by following along with various tutorials. For example, recently I was doing a logistic regression. Because R has such an active user community, I was able to Google “R logistic regression tutorial” and bam, I could follow along with my own data.

My first major piece of advice in using R is to use RStudio, which is free for personal use. It has many advantages over base R, including a graphical workspace and a full-featured text editor.

Finally I want to share some packages that I have been using a lot as I get started with R:

  1. knitr – With SPSS I had so many files to go along with my analysis. There were syntax files and then there were output files. These files are difficult to share if the person you are sharing with doesn’t have SPSS, not to mention that it can be difficult to follow along when reading someone else’s output file. Knitr generates a document that has your code (syntax), results (output), and allows you to easily have formatted text with an intro, commentary, and conclusion to your analysis. So at the end of the day you have one file for everything and it can easily be shared as an .html, .doc, or .pdf file. Knitr is seamlessly integrated into RStudio (see above).
  2. ggplot2 – As I said before, the data visualization capabilities were a major draw for me to adopt R. ggplot2 can make gorgeous charts where you can customize almost all of the features.
  3. corrplot – I’ve struggled with presentation of correlation matrices before and I usually use a heatmap/table that I make in Excel. I just stumbled across the corrplot package a few days ago and I immediately fell in love. It is still a type of heatmap, but it makes the correlation matrix a lot more user-friendly to share with non-stats folks.

That sums up everything I wanted to say about my R journey so far. I’m aiming to write one of these posts every month or so and share my learnings. In the meantime I’d love to hear from other evaluators using R. Have you recently made the switch? How has it helped you?



Showing two main points on one chart

It’s (usually) fairly straightforward to choose a chart type when you know what the main point you are trying to get across is. Is your message that there has been a change over time? Do you want to show a difference between groups? There are all kinds of online chart choosers to help you do this (here is one of my favourites). But what about when you have two main points to make?

I was recently working on a chart where I wanted to make the following two points:

  1. 2016 was the only year that participants had a statistically signifcant increase in health ratings; and
  2. participants had lower health ratings pre-program in 2016 vs. other years

I started with the chart below. Here the different color used in 2016 really highlights that something different happend that year (half of point #2), but it is difficult to see the change over time (point #1, half of point #2):


Alright then, let’s change to a line graph. It is much easier to see the change over time. However, the statistical change in pre- and post-test scores was important to the program and they wanted to highlight that. That piece of information isn’t easy to see here.


I added a transparent rectangle to highlight the difference between pre- and post-test scores and this is the result:


I think that this chart nicely conveys the two main points that I wanted to make and is a vast improvement over the first chart. It also goes to show that it’s worthwhile to play around with different chart types while working on reporting!

Note: I have changed the results to fictional data to keep things anonymous

Thinking geospatially

Lately I’ve been having a lot of fun making maps in Tableau (What? Everybody doesn’t make data visualizations for fun?). Tableau is a pricy piece of software but you can use Tableau Public for free (and if you are a charity in Canada you can get the desktop version at a very, very discounted rate through TechSoup).

Mapping data is a skill that I’ve been wanting to build for awhile. Lately I’ve been working with community health data and bar charts can only tell me so much. Seeing the data on a map has made a world of difference.

In order to try out working with maps I downloaded data from Toronto’s open data catalogue. The first map I made was a schematic of Toronto’s subway (the TTC). I adjusted the size of the circles representing stations to show the number of daily riders and added a filter so that the viewer could drill down on a specific subway line. Much more interesting than simply looking at a bar chart that lists each stop, right?

Click to go to data viz

Next up I took a look at service calls place to 311 (the customer service department of Toronto). You can see that some areas of the city have quite a high call volume whereas other areas are relatively low. If you click on a specific area of the map the bar chart below will automatically filter to show you the top 10 reasons for service calls originating in that area. What strikes me the most is that throughout the city the most common reason for calling 311 by far is issues surrounding garbage, recycling, and compost bins.

Click to go to data viz

The third map that I want to share is neighbourhood safety. I took a look at major crimes and other safety-related incidents by Toronto neighbourhood. The data is a little old (2011) but you can instantly see that the majority of incidents are concentrated in a few neighbourhoods. You can filter the map by incident type on the right. Changing incident map changes the map pretty drastically. For example, filter on murder and you can see that the red areas change. Like the 311 map, clicking on a neighbourhood will filter the bar chart below.

Click to go to data viz

So far I’ve been having a lot of success with these interactive maps. It is much easier for people to instantly see and understand the data vs. having to look at a chart and then convert the words into geography in their head.

What about you – what has been your experience with presenting geospatial data?

The Importance of Context

Recently I was looking at some data and I noticed a trend in a neighbourhood surrounding a community centre that was evaluating the effectiveness of their poverty reduction work. The number of families classified as having a low income had decreased over recently (Neighbourhood A). Several nearby neighbourhoods (Neighbourhoods B and C) had definitely not seen this decrease.


(Shout out to Stephanie Evergreen for forever changing my life with small multiples)

At first glance this looked promising – had the poverty reduction campaign contributed to this? People were excited but I had my reservations about claiming success so quickly.

If you’ve recently visited Toronto you know that there are building cranes everywhere. Neighbourhoods are changing (read: gentrifying) very, very quickly as luxury condos go up and lower income families are driven further and further out of the core. It was possible that the income level of residents hadn’t changed – perhaps the low income residents had moved out and more affluent residents had moved in. First piece of evidence: Neighbourhood A had four condominium projects completed in that time frame whereas Neighbourhood B had one and Neighbourhood C had zero.

Next we looked at demographics. Canada completes a census every five years. We had could compare 2006 and 2011 data as the 2016 is not yet available. Second piece of evidence: Neighbourhood A had decreases in children, youth, and seniors (and families overall) but an increase in working age adults). The change wasn’t near as drastic in Neighbourhoods B and C.

Fortunately we had a lot of other data to look at in order to evaluate the program but I thought that this was a nice illustration of why it’s really important to look at the context behind the data and examine other possible explanations before claiming success.


data viz tools

Awhile ago I posted about the data viz catalogue. It’s a neat resource that helps you choose a visualization that best tells the story of your data. The creator has recently posted a roundup of the 20 best tools for data visualization. It includes tools that have no coding required as well as tools for developers. There were definitely a couple that were new to me and I look forward to checking them out.

On my 2016 to-do list: learn enough coding that I can play around with the dev tools.