The rise of R for statistical analysis

Written on March 10, 2017

Note: This is an opinion piece written for the Department of Nutritional Sciences (University of Toronto) newsletter NutriNews. Link to the piece will be put up after the newsletter is published.

Like all scientific research, the foundation of nutritional sciences is the use of statistics to determine if a scientific finding is likely due to chance. In the past, studies were simple, software was rudimentary or non-existent, and statistics could generally be done by hand. Now, a single study can generate massive amounts of data and rapid technological advancement has created powerful software and hardware. Scientists are making use of these changes to do more complicated studies and use the software to run their statistics.

Not long ago, it was mainly private corporations who created and distributed proprietary (closed) statistical software. One such example is SAS, which is a company as well as the software they sell. Scientists relied on these companies to ensure that the statistics they ran were accurate and reliable, but at the cost of a hefty price tag and restrictive licensing to use the software.

However, over the last decade and a half, open source software has exponentially risen in popularity and use. Open source means any software where the code for the software is publicly available to look at and improve, and is usually free to use. You may not know this, but our economy, the web, and almost all electronic gadgets and devices you use in your everyday life run on open source software. It is the engine that keeps all of modern human activity continuing and functional.

One such open source statistical software is R. It has rapidly gained massive traction within the scientific community because of it’s powerful statistical capabilities and programming features. I personally use R for my research, as well as teach it. And while I may be a bit biased, I believe that more scientific research should be using open source software such as R rather than closed software (e.g. SAS, SPSS, etc) for several strong reasons. To keep things short, I’ll only list three reasons.

The first is that because it is open source, anyone is free to look over the code and verify that the results are accurate. With proprietary software, there is no way to verify that the code correct. Instead, you have to trust the corporations when they say it works. This goes against the scientific principles of reproducibility and replication.

Second, anyone can contribute to or develop extensions for R. Unlike proprietary software that is bound by corporate bureaucracy and policies, cutting edge techniques can be distributed much sooner, which may or may not ever be developed in proprietary software. Because the vast majority of statisticians globally use and contribute to R, advanced statistical techniques are available in R long before companies can incorporate it into their product. Science is about using the latest tools to understand and analyze research results, which is a major advantage to using R over other software.

Lastly, R is free and will always be free. It is legally licensed so that it can never be under closed, proprietary control. Because scientific funding has been steadily declining, any way to reduce expenses should appeal to scientists so that their funding can last longer. An argument made against R in this regard is we can’t trust the results because it is free and it doesn’t have a company behind it. The problem with this argument is that the warranty on the accuracy of results is the same for any software, both closed and open: There is no warranty that the results will be correct.

These three reasons alone are strong reasons to use R, but there is also a value-based reason: the values underlying open source match the values of science itself. Modern science is about sharing (though we could get better at it) and collaborating, with as little barriers as possible. Using R allows us as researchers to fit with this value. Plus, R is just more fun to use!