# Understanding variance and linear regression

I’ve written this blog because I want to get a better understanding of how the foundation of most statistics, the variance, works and how it applies within the context of a simple linear regression. I hope that this blog will also be useful for others learning about statistics.

# Load up the R packages and set up data

If you don’t know who Hadley Wickham (check out
his GitHub repos), you likely will know
more about him as time goes on because he is the author of several
*incredible* R packages, including ggplot2,
dplyr, and
tidyr, among
many others! He also contributes to
RStudio, which is a great platform to use R
on. Anyway, I’m loading the below packages as I’ll be using their
functions in this post.

Next, I’ll assign some variables into `x`

and `y`

to make later code
easier. I’m also using the built in R dataset `airquality`

.

# The goal: Show how variance works in linear regression

The main goal here is to show how variance works in linear regression and how easy a simple linear regression is calculate. To get started, the common formula used to denote a linear regression is:

Another way to write it is:

Where \(m\) is the intercept and \(b\) is the slope. The way the slope is
calculated is by
*least squares*, which
minimizes the sum of the squared residuals (or error terms). Bit by
bit I’ll show how this formula is truly calculated, starting with the
variance.

# Spread in the data: The variance statistic

The basic foundation to analyzing data and making comparisons or making predictions essentially use how spread out the data are from one another to make a conclusion. So the formula for variance of \(x\) is:

If you take a look at the formula, you can see that as values of \(x\) get further and further from the mean \(\bar{x}\), the square will make them larger and always positive. So, the more spread out the data is from the mean, the higher the value of \(\sigma_{x}^2\), or the variance, is. If we code this in R, this would be:

Unlike the standard deviation which I will talk about next, the variance is difficult to visualize as it doesn’t directly represent the spread of the data. It merely indicates to what degree it is spread. Nonetheless, we can plot the distribution of the data to get a sense of the spread.

Since the variance is directionless, it would be nice to have some way to indicate the degree to which a variable spreads in each univariate direction. This is where the standard deviation comes in.

# Standard deviation: Gives the direction and magnitude of spread

The standard deviation derives from the variance statistic. The formula being very simple:

Or more directly:

Take some time to look at this formula and to understand it. You can see that because of the square root and the associated interpretation that this \(\sigma_x\) value operates about the mean, there is an assumption that the data is spread equally about the mean. Hence this is why the standard deviation assumes a normal (or Gaussian) distribution. If the spread is unequal about the mean, the standard deviation is not a useful statistic to use. This concept will come back into play later on.

So, the R code to calculate the standard deviation is:

And if we multiply two standard deviations together, we get the
variance! In R `sd(x) * sd(x)`

= `var(x)`

= 12.4115385

Unlike the variance, the standard deviation can be visualized.

The variance and standard deviation are useful for univariate
statistics, but since the simple linear regression consists of two
variables, there needs to be statistics which take into consideration
spread *between* variables…

# The spread between two variables: Covariance

Just like variance, the covariance is a value that indicates the degree to which the spread between two variables is related or not. The formula is similar to the variance, except for the additional of another term:

While this formula appears similar to the variance, the addition of the \(y\) term changes the interpretation quite substantially. Take a look at the formula. The value of the covariance depends upon how related the spread of \(x\) and \(y\) are to each other, and unlike the variance, there can be negative covariance. So:

- If \(x\) tends to be more
*positively*spread from the mean and at the same time \(y\) tends to be more*negatively*spread from the mean, this gives the covariance a negative value (a positive times a negative equals a negative). Likewise, a*positive*\(x\) and a*positive*\(y\) will give*positive*covariance. - If \(x\) and \(y\) tend to spread far from the mean
*together*, this gives a larger covariance. As they spread less from the mean*together*, covariance will be lower.

Thus, the covariance is a measure of how related two variables are with each… how much of a change in one variable will there also be a change in another variable. So, a covariance of zero means that there is no relationship between the two variables.

Again, look at the formula. In any given observation (or row), either an \(x\) or a \(y\) is missing, the formula doesn’t work. So an assumption of the covariance is that the data be complete cases (no missingness). Also, because the covariance values depend upon the size of the values in \(x\) and \(y\), there is no ‘standardized’ way of comparing across different variables.

The R code for calculating the covariance is:

In fact, if we substitute the \(y\) for \(x\) in the formula above, we get the variance! And computed in R:

Just like the variance, showing the covariance on a plot is difficult. However, a standard way to present bivariate information is through a scatter plot.

However, just like variance, there needs to be a way to standardize the covariance so that it is interpretable across different variables and gives a sense of direction and magnitude. As with the standard deviation, there is the correlation statistic we can use.

# Standardized way of comparing two variables: The correlation

In this case, the correlation statistic is known as the Pearson correlation. There are other types of correlations you can use, like Spearman, but I won’t get into that. The formula for the Pearson correlation is:

Take a good look at the formula. Does something look familiar? If you notice, the top part is the same as the covariance and the bottom part has two formulas for the variance (of both \(x\) and \(y\)). So, if I re-write this formula to simplify it:

We could simplify it even more. Do you remember what the square root of the variance gives us? The standard deviation! So:

Calculating the correlation in R is easy!

Contrary to what may often be shown or thought, the correlation value does not represent the slope between two variables. The correlation is simply a way to standardize the representation of how related two variables are to each other, or rather how changes in one variables are related to changes in another (without implying which changed which). You can see this from the formula.

So, how does this fit in with linear regression??

# Linear regression: Incorporating correlation and variance

Bringing the formula for the linear regression back down from above, the simplest equation is:

Based on least squares estimation, the line of best fit in the simple linear regression case can be calculated as:

If you’ll notice, the formula on the top is the covariance and on the bottom is the variance of \(x\). So, we can simplify:

And if we add the standard deviation of y to the top and bottom we get:

What does the left side of the division equation look like? It is the formula for the correlation! So we can simplify it to show the role of the correlation in linear regression:

Then, \(\alpha\) can be calculated by simply solving the equation \(\alpha = \bar y - \beta \bar x\).

Knowing the formulas for how these statistics are calculated can give
you some insight into why they have the assumptions that they do. For
instance, since the least squares approach minimizes the sum of the
squared *residuals* (ie: the variance of the residuals), this suggests
that the residuals should be approximately normally distributed. But
because the residuals are calculated based on the covariance of \(x\)
and \(y\), this suggests that the covariance should be normally
distributed… and not the univariate distributions (a common
misconception)!

Likewise, since it is a *line* of best fit, linear regression is for
linear relationships. Not *non*-linear ones.

So, with the R code:

And in this case, it is easy to plot the linear regression line.

A nifty thing about linear regression and it’s use of the correlation
is that based on the formula, when \(x\) and \(y\) are scaled (centered
and standardized) with a mean of 0 and a standard deviation of 1, the
correlation coefficient **is** the linear regression estimate! That’s
because, looking at the formula above, the `sd(x)`

= 1 and `sd(y)`

=
1, so they cross out, leaving the correlation coefficient! Check it in
R:

# To conclude:

In conclusion, what I’ve learned from getting more into understanding
the basics is a new found appreciation of statistics and a better
sense of what the numbers mean. I think that it’s important to at
least understand what the formulas mean. I don’t think it’s important
to memorize them, but definitely to appreciation what the formulas
can, and *can’t*, do.

Anyway, I hope to do more of these blogs in the future. Stay tuned!