Understanding variance and linear regression

Written on March 17, 2015

I’ve written this blog because I want to get a better understanding of how the foundation of most statistics, the variance, works and how it applies within the context of a simple linear regression. I hope that this blog will also be useful for others learning about statistics.

Load up the R packages and set up data

If you don’t know who Hadley Wickham (check out his GitHub repos), you likely will know more about him as time goes on because he is the author of several incredible R packages, including ggplot2, dplyr, and tidyr, among many others! He also contributes to RStudio, which is a great platform to use R on. Anyway, I’m loading the below packages as I’ll be using their functions in this post.

library(ggplot2) ## For the plots
library(dplyr) ## To use the %>% operator (other functions)

Next, I’ll assign some variables into x and y to make later code easier. I’m also using the built in R dataset airquality.

aq <- airquality %>%
    tbl_df()

x <- aq$Wind
y <- aq$Temp

The goal: Show how variance works in linear regression

The main goal here is to show how variance works in linear regression and how easy a simple linear regression is calculate. To get started, the common formula used to denote a linear regression is:

Another way to write it is:

Where \(m\) is the intercept and \(b\) is the slope. The way the slope is calculated is by least squares, which minimizes the sum of the squared residuals (or error terms). Bit by bit I’ll show how this formula is truly calculated, starting with the variance.

Spread in the data: The variance statistic

The basic foundation to analyzing data and making comparisons or making predictions essentially use how spread out the data are from one another to make a conclusion. So the formula for variance of \(x\) is:

If you take a look at the formula, you can see that as values of \(x\) get further and further from the mean \(\bar{x}\), the square will make them larger and always positive. So, the more spread out the data is from the mean, the higher the value of \(\sigma_{x}^2\), or the variance, is. If we code this in R, this would be:

## Raw formula
sum(((x - mean(x)) ^ 2)) / (length(x) - 1)
## [1] 12.41154
## ... or as a function in base R
var(x)
## [1] 12.41154

Unlike the standard deviation which I will talk about next, the variance is difficult to visualize as it doesn’t directly represent the spread of the data. It merely indicates to what degree it is spread. Nonetheless, we can plot the distribution of the data to get a sense of the spread.

qplot(x, geom = 'density')

Visualising the variance in x.

Since the variance is directionless, it would be nice to have some way to indicate the degree to which a variable spreads in each univariate direction. This is where the standard deviation comes in.

Standard deviation: Gives the direction and magnitude of spread

The standard deviation derives from the variance statistic. The formula being very simple:

Or more directly:

Take some time to look at this formula and to understand it. You can see that because of the square root and the associated interpretation that this \(\sigma_x\) value operates about the mean, there is an assumption that the data is spread equally about the mean. Hence this is why the standard deviation assumes a normal (or Gaussian) distribution. If the spread is unequal about the mean, the standard deviation is not a useful statistic to use. This concept will come back into play later on.

So, the R code to calculate the standard deviation is:

## Raw formula
sqrt(sum(((x - mean(x)) ^ 2)) / (length(x) - 1))
## [1] 3.523001
## Standard function in R
sd(x)
## [1] 3.523001

And if we multiply two standard deviations together, we get the variance! In R sd(x) * sd(x) = var(x) = 12.4115385

Unlike the variance, the standard deviation can be visualized.

ggplot() +
  geom_density(aes(x), fill = 'orange', alpha = 0.2) +
  geom_vline(aes(xintercept = mean(x)), linetype = 'dashed',
             colour = 'blue', size = 1.25) +
  geom_vline(aes(xintercept = c((mean(x) - sd(x)),
                                (mean(x) + sd(x)))),
             linetype = 'dotted', colour = 'blue', size = 1)

Standardized spread: Standard deviation and mean of x.

The variance and standard deviation are useful for univariate statistics, but since the simple linear regression consists of two variables, there needs to be statistics which take into consideration spread between variables…

The spread between two variables: Covariance

Just like variance, the covariance is a value that indicates the degree to which the spread between two variables is related or not. The formula is similar to the variance, except for the additional of another term:

While this formula appears similar to the variance, the addition of the \(y\) term changes the interpretation quite substantially. Take a look at the formula. The value of the covariance depends upon how related the spread of \(x\) and \(y\) are to each other, and unlike the variance, there can be negative covariance. So:

  • If \(x\) tends to be more positively spread from the mean and at the same time \(y\) tends to be more negatively spread from the mean, this gives the covariance a negative value (a positive times a negative equals a negative). Likewise, a positive \(x\) and a positive \(y\) will give positive covariance.
  • If \(x\) and \(y\) tend to spread far from the mean together, this gives a larger covariance. As they spread less from the mean together, covariance will be lower.

Thus, the covariance is a measure of how related two variables are with each… how much of a change in one variable will there also be a change in another variable. So, a covariance of zero means that there is no relationship between the two variables.

Again, look at the formula. In any given observation (or row), either an \(x\) or a \(y\) is missing, the formula doesn’t work. So an assumption of the covariance is that the data be complete cases (no missingness). Also, because the covariance values depend upon the size of the values in \(x\) and \(y\), there is no ‘standardized’ way of comparing across different variables.

The R code for calculating the covariance is:

## Raw formula
sum((x - mean(x)) * (y - mean(y))) / (length(x) - 1)
## [1] -15.27214
## The function in R
cov(x, y)
## [1] -15.27214

In fact, if we substitute the \(y\) for \(x\) in the formula above, we get the variance! And computed in R:

cov(x, x)
## [1] 12.41154
var(x)
## [1] 12.41154

Just like the variance, showing the covariance on a plot is difficult. However, a standard way to present bivariate information is through a scatter plot.

qplot(x, y, geom = 'point')

Visualizing covariance: Scatter plot of x and y

However, just like variance, there needs to be a way to standardize the covariance so that it is interpretable across different variables and gives a sense of direction and magnitude. As with the standard deviation, there is the correlation statistic we can use.

Standardized way of comparing two variables: The correlation

In this case, the correlation statistic is known as the Pearson correlation. There are other types of correlations you can use, like Spearman, but I won’t get into that. The formula for the Pearson correlation is:

Take a good look at the formula. Does something look familiar? If you notice, the top part is the same as the covariance and the bottom part has two formulas for the variance (of both \(x\) and \(y\)). So, if I re-write this formula to simplify it:

We could simplify it even more. Do you remember what the square root of the variance gives us? The standard deviation! So:

Calculating the correlation in R is easy!

## Raw formula without the sd()
cov(x, y) / (sqrt(var(x)) * sqrt(var(y)))
## [1] -0.4579879
## Raw formula with the sd() 
cov(x, y) / (sd(x) * sd(y))
## [1] -0.4579879
## Using the R function
cor(x, y)
## [1] -0.4579879

Contrary to what may often be shown or thought, the correlation value does not represent the slope between two variables. The correlation is simply a way to standardize the representation of how related two variables are to each other, or rather how changes in one variables are related to changes in another (without implying which changed which). You can see this from the formula.

So, how does this fit in with linear regression??

Linear regression: Incorporating correlation and variance

Bringing the formula for the linear regression back down from above, the simplest equation is:

Based on least squares estimation, the line of best fit in the simple linear regression case can be calculated as:

If you’ll notice, the formula on the top is the covariance and on the bottom is the variance of \(x\). So, we can simplify:

And if we add the standard deviation of y to the top and bottom we get:

What does the left side of the division equation look like? It is the formula for the correlation! So we can simplify it to show the role of the correlation in linear regression:

Then, \(\alpha\) can be calculated by simply solving the equation \(\alpha = \bar y - \beta \bar x\).

Knowing the formulas for how these statistics are calculated can give you some insight into why they have the assumptions that they do. For instance, since the least squares approach minimizes the sum of the squared residuals (ie: the variance of the residuals), this suggests that the residuals should be approximately normally distributed. But because the residuals are calculated based on the covariance of \(x\) and \(y\), this suggests that the covariance should be normally distributed… and not the univariate distributions (a common misconception)!

Likewise, since it is a line of best fit, linear regression is for linear relationships. Not non-linear ones.

So, with the R code:

## Raw formula
cov(x, y) / var(x)
## [1] -1.230479
## If we want to generalize it a bit more, expanding it gets:
(cov(x, y) / (sd(x) * sd(y))) * (sd(y) / sd(x))
## [1] -1.230479
## Which allows us to use the correlation
cor(x, y) * (sd(y) / sd(x))
## [1] -1.230479
## Compare to the linear regression function:
coef(lm(y ~ x))[2]
##         x 
## -1.230479
## Calculating the intercept:
b <- cor(x, y) * (sd(y) / sd(x))
mean(y) - (b * mean(x))
## [1] 90.13487
## Compared to the lm:
coef(lm(y ~ x))[1]
## (Intercept) 
##    90.13487

And in this case, it is easy to plot the linear regression line.

qplot(x, y, geom = c('point', 'smooth'), method = 'lm')

Linear regression line

A nifty thing about linear regression and it’s use of the correlation is that based on the formula, when \(x\) and \(y\) are scaled (centered and standardized) with a mean of 0 and a standard deviation of 1, the correlation coefficient is the linear regression estimate! That’s because, looking at the formula above, the sd(x) = 1 and sd(y) = 1, so they cross out, leaving the correlation coefficient! Check it in R:

## Scale the data
s.x <- scale(x)
s.y <- scale(y)

## Check it!
cor(s.x, s.y)
##            [,1]
## [1,] -0.4579879
cor(s.x, s.y) * (sd(s.y) / sd(s.x))
##            [,1]
## [1,] -0.4579879
coef(lm(s.y ~ s.x))[2] %>% signif(7)
##        s.x 
## -0.4579879
qplot(s.x, s.y, geom = c('point', 'smooth'), method = 'lm')

Scaled variables and linear regression line

To conclude:

In conclusion, what I’ve learned from getting more into understanding the basics is a new found appreciation of statistics and a better sense of what the numbers mean. I think that it’s important to at least understand what the formulas mean. I don’t think it’s important to memorize them, but definitely to appreciation what the formulas can, and can’t, do.

Anyway, I hope to do more of these blogs in the future. Stay tuned!