Loops and Forests: Running and presenting multiple tests of linear regression
If you do any type of data-heavy work, you likely have had to run many
tests of a regression. As the number of response and explanatory
variables increases, the number of potential combinations of course
also increases. There is no way you are going to type out dozens of
different regressions… You also have the challenge of presenting
this much information. In this post, I’m going to go over a way to
loop through each of the possible combinations. I’m also going to
advocate that any time that many results of the same test are shown,
that the tabular format for these results is probably the absolute
worst way to show your data… and that plots, in particular a
modified forest plot, are the best way to present your data. In both
the loop and the forest plot case, I’ve created several functions to
carry out this task for a generalized estimating equations analysis on
with an example found on the
plotForest function example section
Setup: Load the necessary packages and set up data
Load up the incredibly useful
dplyr, via the
magrittr package, allows us
to use the
%>% pipe command, which is so absolutely amazing, I don’t
know why this type of command wasn’t made sooner!
Let’s create a fake dataset to play with and assign it a
tbl_df() command. A
tbl_df class makes the
printing nicer, so that not all the data is printed.
I’ve made some of the response variables purposefully related to the explanatory variables so that there are at least some statistically significant associations. You’ll also notice that I’ve scaled (mean centered and standardized) all the variables as this will allow the regression results to be comparable across tests. This is especially important when showing them on a forest plot.
TODO list for running and plotting many regressions:
Before we begin, it’s good to list out what exactly needs to be done in order to get the end result. So:
- Create some way to allow each combination of the 20 by 6 variables
- Apply the regression to each combination
- Extract only the relevant values from the regressions
- Send the output data to ggplot as a forest plot
1. Creating either a formula list or wrangle the data
There are two (probably more) ways that we could run a regression on each combination of response and explanatory variable. The first is to create a list of formulas for each combination.
You can see that this creates a list of formulas, which can then be
passed into a regression. However, another way to run a regression on
all combinations is to use the
gather function from the
tidyr package. This converts the
dataframe into a format that allows a regression to run on groups of
the response and explanatory variables. I really like this method
(having only recently discovered it) as you can operate on the data
directly, rather then through the formula list.
2. Apply a regression to each combination
So now we have the formula list or the data in the format needed for
next processing. First, let’s do the regression on the formula list.
Here I use the
tidy function from the
broom package, which basically
tidies up the regression output to make it cleaner and easier to work
This output is still in a list format, so it will evenutally need to be unlisted. Alright, let’s try it with the wrangled data.
In this case, the command above uses the
do function from the
dplyr package, which let’s me run
the regression on each ‘group’. Compared to using the formula list,
this approach is much nicer in that you don’t have to unlist the
output, it’s cleaner and easier to read, and also includes the
Indep variable names. Because of these reasons (and because it
is a bit frustrating to try to wrangle the formula list approach into
a format that is useable), I’m going to continue with using the
3. Extract the relevant information
At this point, it is fairly trivial to subset the data and add any
relevant variables to this output dataset. Given I only want the
Indep variable regression information, let’s filter the dataset down.
Ok, so it’s in a good format to be plotted. However, there is something I want to include in the plot. To help make the visualization easier and quicker to interpret, I want to make the regression results on the plot bigger as the significance becomes stronger. So, I need to create a new variable that represents the levels of significance for each regression.
The summary of the levels of significance show mostly non-significance (which is expected given the dataset is random), but since I made some of the variables purposefully related, there is some significant associations. We can now pass this dataset into the forest plot.
4. Make a forest plot of the regression output
I chose using a forest plot because it is a perfect plot to represent
results with a confidence band, especially when used to compare across
multiple tests. I use the incredible
ggplot2 to make the forest
plot. While this part takes the most amount of code, it is simple
code for specifying the elements of the plot.
And there you go! A forest plot, with dots that increase in size and opacity as the statistical significance increases.
Visualizations should be used more often to represent results from scientific research. I think that, in general, results should almost always be shown visually, and especially when many regression (or other tests) are run. Forest plots in particular should be used for presenting regression results rather than in the table format. The main reasons being:
- A forest plot is a perfect plot device to be able to compare regression results across variables.
- Given that the beta coefficient with the confidence interval is presented, the magnitude of an association and the uncertainty that the beta coefficient may represent the population estimate can be quickly inferred and compared across regressions. Text is a very poor tool to being able to evaluate the magnitude of an association in comparison to other test results.
- When many statistical tests are run, there is the concern for false positives. Because of the visualization of the confidence intervals and the size of the lines and dots depicting higher statistical significance, you can determine whether an association is a false positive better than when showing in a table. For instance, when confidence intervals are very wide and close to the line, this likely means a false positive.
- Humans are visual by nature. Text in tabular format is difficult and cumbersome to work through and understand. Making a forest plot is being considerate to the reader and reviewer of the research. Scientists are busy people and the general public doesn’t always know how to interpret scientific results.
The more work you as the researcher put into making your results visually easy to interpret and understand, the better it will be for you and for your audience!
I hope this post helps others work better present their results!