Switching from SPSS to R: Save scripts, not workspaces!

header_image

I’m back with a quick lesson that I have learned while switching to R for data analysis (if you are curious about why I am doing so, I have a list of reasons here). This was a bit of a painful lessons that cost me a lot of time: SAVE SCRIPTS, NOT WORKSPACES.

What does that mean?

I am going to explain this in non-technical terms (sorry R experts), mainly because I don’t know the technical lingo.

In SPSS, your data file is a tangible thing. You can make changes to it and save it and then go back to the actual file and boom, there is the data just as you left it.

In R things work a bit differently. All changes to data (and analysis, and charts, and everything else) are executed through scripts. You write a block of code that does something. You save this script and each time you open R, you should re-load the script. Objects and dataframes aren’t “real” as they are in SPSS.

Like most R users, I use R Studio. R Studio is amazing and awesome and I love it. But it has a default setting that was allowing me to keep a bad habit I learned from SPSS (i.e., not re-loading a script each time to make sure that it included everything my analysis needed and treating objects as “real”). R Studio has a default setting that will automatically save your workspace and re-load it next time you start the program. Amazing! Or so I thought.

I have been working through the book R for Data Science (a great book which is FREE by the way) and in the workflow section the authors make this point very clearly: save scripts, not workspaces.  I didn’t really get why this was so important. It was so much easier just to open R Studio and have my previous workspace waiting for me.

Unfortunately I learned first-hand why this is so important.

I manually cleared my workspace because I thought I was done with my analysis (and I was sure my script had everything the analysis needed). Turns out my script was missing something pretty important. When I had to go back to my analysis to change something, lo and behold a few objects were missing from my script. I had to manually re-create them from memory.

It wasn’t the end of the world since I was able to do that, but it cost me a lot of time. And what if I had to go back to that analysis a year later? My memory would have certainly faded. If I had been working solely from scripts the entire time this error would have been caught right away (or not have occurred in the first place).

Thankfully you can change the default setting on R Studio so that it doesn’t save your workspace and enable this bad habit. Instructions are here. Don’t repeat my mistake!

Chronicling my adventures in R: Why switch from SPSS and favourite packages so far

I’ve been saying that I’m going to start using R for a long time (this post assumes that you know what R is, if you want a brief explanation click here). I’ve officially declared 2017 to be the year that I switch from using SPSS to R for all of my data analysis. I’m going to make a series of posts sharing what I learn along the way.

Some background: I have been using SPSS since I first learned data analysis in 2002. As is (was?) common in the social sciences, all of my undergraduate training, and a lot of my graduate training, was using the GUI (graphical user interface). Over the past 5 years or so I have switched to syntax for reproducibility reasons. I have no prior experience with coding.

First off, why would I bother switching to R if SPSS has served me well for the past 15 years (yikes)? Good question! Here are the reasons that prompted me to make the switch:

  1. R is open source (read: free). SPSS is very expensive. The standard version is now more than $2500 US per year! I also like to support the open source movement which is about collaboration and community.
  2. It’s the leading tool in statistics. R is the most used tool in statistics, data science, and machine learning. Because it is open source, other users are constantly creating packages (there are thousands that anyone can download and use). There is a large, active, and growing community of users and this community is a great resource.
  3. The data visualization capabilities blow SPSS out of the water. Have you tried making a nice chart in SPSS? It’s an awful process and the end result isn’t great. My current workflow is to copy and paste SPSS output into Excel and do my visualization there. It works but wouldn’t it be grand if I could just make nice charts by adding a few lines of code while analyzing the data?
  4. It’s a lot more flexible than SPSS. R is not just a piece of software, it is a programming language. With SPSS you are often ‘locked in’ to the options for analyses that the software gives you. With R, if you can write the code you can do just about anything.

That list sounds great. Why have I waited so long to make the switch? R has a steep learning curve, especially if are you like me and you do not have a coding background. There are various online courses that introduce R. I have taken a few in the past and have found them to be quite helpful. The major thing that I learned from courses is how to “think like a programmer”…this was a large hurdle.

Now that I have the 101 material out of the way, I want to learn by doing and so I have been using R for all of my data analysis so far this year, mainly by following along with various tutorials. For example, recently I was doing a logistic regression. Because R has such an active user community, I was able to Google “R logistic regression tutorial” and bam, I could follow along with my own data.

My first major piece of advice in using R is to use RStudio, which is free for personal use. It has many advantages over base R, including a graphical workspace and a full-featured text editor.

Finally I want to share some packages that I have been using a lot as I get started with R:

  1. knitr – With SPSS I had so many files to go along with my analysis. There were syntax files and then there were output files. These files are difficult to share if the person you are sharing with doesn’t have SPSS, not to mention that it can be difficult to follow along when reading someone else’s output file. Knitr generates a document that has your code (syntax), results (output), and allows you to easily have formatted text with an intro, commentary, and conclusion to your analysis. So at the end of the day you have one file for everything and it can easily be shared as an .html, .doc, or .pdf file. Knitr is seamlessly integrated into RStudio (see above).
  2. ggplot2 – As I said before, the data visualization capabilities were a major draw for me to adopt R. ggplot2 can make gorgeous charts where you can customize almost all of the features.
  3. corrplot – I’ve struggled with presentation of correlation matrices before and I usually use a heatmap/table that I make in Excel. I just stumbled across the corrplot package a few days ago and I immediately fell in love. It is still a type of heatmap, but it makes the correlation matrix a lot more user-friendly to share with non-stats folks.

That sums up everything I wanted to say about my R journey so far. I’m aiming to write one of these posts every month or so and share my learnings. In the meantime I’d love to hear from other evaluators using R. Have you recently made the switch? How has it helped you?