Welcome!

This page is an in-the-works solution manual for The Elements of Statistical Learning. I’m writting the page using RMarkdown as I work through the book. This setup allows for intermixing R code and LaTeX mathematics, both vital components when working through the textbook exercises.

I’ll be adding more References and Problems as time goes on.

Before starting, load the necessary R libraries:

library(ggplot2)      # For visualization
library(RColorBrewer) # For colors
library(dplyr)        # Piping and data.frame transformation
library(latex2exp)    # Use LaTeX in `ggplot` axes and titles

References

Introduction

This is the reference section.
We use this section when as a notebook for topics not directly related to solving the exercises, but that are useful nonetheless.

Matrix Calculus

This section introduces and proves a few basic facts about matrix calculus. Specifically, we look at what happens when we take the derivative of:

A scalar with respect to a vector
A vector with respect to a scalar
A vector with respect to another vector

For all of the material in this section, I’ll use $x$ and $y$ as scalar values. $X$ will be an $n$-dimensional column vector and $Y$ will be an $m$-dimensional column vector.

\[ X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \text{ and } Y = \begin{bmatrix} y_1 \\ y_y \\ \vdots \\ x_m \end{bmatrix} \]

All of this material (except the proofs) can be found on the Matrix Calculus Wikipedia page.
For even more fun on multi-dimensional calculus, consider looking through Ricci Calculus, a simplified version of calculus on tensors.

Definitions

Let’s start by taking the derivative of a scalar $y$ with respect to a column vector $X$. The result is pretty intuitive, just take the derivative of $y$ with respect to each component in $X$: \[ \dfrac{\partial y}{\partial X} = \partial y / \partial \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \ftag{MC.1} \] If $X$ had been a row vector, then the result would have been a row vector as well. It’s interesting to notice that this quantity is actually just the gradient $y$: \[ \nabla y \leftrightarrow \dfrac{\partial y}{\partial X} \]

Next, we take the derivative of a column vector $Y$ with respect to a scalar $x$. The result turns out similar to $\ifref{MC.1}$ above, but transposed. $Y$ is a column vector, but $\partial Y / \partial x$ is a row vector: \[ \dfrac{\partial Y}{\partial x} = \dfrac{\partial}{\partial x} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} = \begin{bmatrix} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} & \cdots & \frac{\partial y_m}{\partial x} \end{bmatrix} \ftag{MC.2} \] If $Y$ had been a row vector, then the result would have been a column vector instead.

Why do we transpose the output in this case? The transposition stems from a choice of notation. Currently, there exist two standards for doing matrix calculus called the numerator notation and the denominator notation.

In numerator notation, the output is always laid out according to the numerator and/or the transpose of the denominator.
In denominator notation, the output is always laid out according to the denominator and/or the transpose of the numerator. This is the notation that the authors of Elements of Statistical Learning seem to use and the notation I used in $\ifref{MC.1}$ and $\ifref{MC.2}$.

The following table summarizes the both of the notations:

	`Numerator` Notation	`Denominator` Notation
\[\dfrac{\partial y}{\partial X}\]	\[\begin{bmatrix} \Box & \Box & \cdots \Box \end{bmatrix}\]	\[\begin{bmatrix} \Box \\ \Box \\ \vdots \\ \Box \end{bmatrix}\]
\[\dfrac{\partial Y}{\partial x}\]	\[\begin{bmatrix} \Box \\ \Box \\ \vdots \\ \Box \end{bmatrix}\]	\[\begin{bmatrix} \Box & \Box & \cdots \Box \end{bmatrix}\]

Unless specified otherwise, I’ll be using denominator notation throughout the rest of these solutions.

At this point, you might be wondering why we need either of these notations at all. Why can we not simply make the output a column vector in either case? As you’ll see below, this is not always possible.

We close this section by defining the derivative of a $m$-dimensional column vector $Y$ with respect to a $n$-dimensional column vector $X$. If both $X$ and $Y$ are column vectors, which will dictate the columns of the matrix and which will dictate the rows? Well, since we’re using denominator notation, the output must be a $n \times m$ matrix. This guarantees that $X$ (an $n \times 1$ matrix) stays in column form and $Y$ (an $m \times 1$ matrix) changes into row form. If we were using numerator notation, the output would have been an $m \times n$ matrix instead. \[ \begin{array}{rccl} \dfrac{\partial Y}{\partial X} & = & \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} & \ftag{MC.3} \\ & & n \times m & \end{array} \]

Below is an easy way to remember how this notation works. In denominator notation, the denominator stays the same while the numerator gets transposed: \[ \dfrac{\partial Y}{\partial X} \leftrightarrow \dfrac{[m \times 1]}{[n \times 1]} = [n \times m] \] One immediate success of this notation comes from considering the derivative of $X$ with respect to itself. Using the above definition, we arrive at: \[ \begin{array}{rccl} \dfrac{\partial X}{\partial X} & = & \begin{bmatrix} \frac{\partial x_1}{\partial x_1} & \frac{\partial x_2}{\partial x_1} & \cdots & \frac{\partial x_n}{\partial x_1} \\ \frac{\partial x_1}{\partial x_2} & \frac{\partial x_2}{\partial x_2} & \cdots & \frac{\partial x_n}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial x_1}{\partial x_n} & \frac{\partial x_2}{\partial x_n} & \cdots & \frac{\partial x_n}{\partial x_n} \end{bmatrix} & \fref{MC.3} \\ & = & \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & & & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} \\ \dfrac{\partial X}{\partial X} & = & I_n \ftag{MC.4} \end{array} \] This should line up with expectations. We knew that $\partial X / \partial X$ must be a $n \times n$ matrix, and the only matrix that makes sense here is the $n \times n$ identity matrix $I_n$.

Basic Properties

When you first started working with single variable calculus, you most likely learned rules like: \[ \dfrac{d}{dx} a x = a \text{, } \dfrac{d}{dx} x^2 = 2x \]

Not to mention more general rules like the product rule: \[ \dfrac{d}{dx} uv =\dfrac{dv}{dx} u + \dfrac{du}{dx} v \]

In this section, we’ll learn some elementary rules regarding matrix calculus. These rules will look similar to their univariate cousins above.

First up, consider a constant $n$-dimensional column vector $A$ and its inner product with $X$, $A^{T}X$. Since this inner product is scalar, we can try to take its derivative with respect to $X$, like we did in $\ifref{MC.1}$. We get: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} A^{T}X & = & \dfrac{\partial}{\partial X} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) & \\ & = & \begin{bmatrix} \frac{\partial}{\partial x_1} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) \\ \frac{\partial}{\partial x_2} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) \\ \vdots \\ \frac{\partial}{\partial x_n} (a_1 x_1 + a_2 x_2 + \cdots + a_n x_n) \end{bmatrix} & \fref{MC.1} \\ & & n \times 1 & \\ \\ & = & \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} & \\ & & n \times 1 & \\ \dfrac{\partial}{\partial X} A^{T}X & = & A \ftag{MC.5} \end{array} \] This should remind you of the classic result of taking a derivative of a linear function: \[ \dfrac{\partial}{\partial X} A^{T}X = A \leftrightarrow \dfrac{d}{dx} ax = a \]

Next, we take a look at the matrix calculus version of the product rule. Let $U$ and $V$ be $m$-dimensional vectors that depend on $X$. We will attempt to take the derivative of their inner product $U^{T}V$. Just like in $\ifref{MC.5}$, we know $U^{T}V$ is a scalar value so we can take its derivative with respect to $X$. The derivation below gets tedious because of all the indices, but there’s no new information here: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} U^{T} V & = & \dfrac{\partial}{\partial X} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) & \\ & = & \begin{bmatrix} \frac{\partial}{\partial x_1} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) \\ \frac{\partial}{\partial x_2} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) \\ \vdots \\ \frac{\partial}{\partial x_n} (u_1 v_1 + u_2 v_2 + \cdots + u_m v_m) \end{bmatrix} & \fref{MC.1} \\ & & n \times 1 & \\ \\ & = & \begin{bmatrix} (u_1 \frac{\partial v_1}{\partial x_1} + \frac{\partial u_1}{\partial x_1} v_1) + (u_2 \frac{\partial v_2}{\partial x_1} + \frac{\partial u_2}{\partial x_1} v_2) + \cdots + (u_m \frac{\partial v_m}{\partial x_1} + \frac{\partial u_m}{\partial x_1} v_m) \\ (u_1 \frac{\partial v_1}{\partial x_2} + \frac{\partial u_1}{\partial x_2} v_1) + (u_2 \frac{\partial v_2}{\partial x_2} + \frac{\partial u_2}{\partial x_2} v_2) + \cdots + (u_m \frac{\partial v_m}{\partial x_2} + \frac{\partial u_m}{\partial x_2} v_m) \\ \vdots \\ (u_1 \frac{\partial v_1}{\partial x_n} + \frac{\partial u_1}{\partial x_n} v_1) + (u_2 \frac{\partial v_2}{\partial x_n} + \frac{\partial u_2}{\partial x_n} v_2) + \cdots + (u_m \frac{\partial v_m}{\partial x_n} + \frac{\partial u_m}{\partial x_n} v_m) \end{bmatrix} & \fref{Product.Rule} \\ & & n \times 1 & \\ \\ & = & \begin{bmatrix} u_1 \frac{\partial v_1}{\partial x_1} + u_2 \frac{\partial v_2}{\partial x_1} + \cdots + u_m \frac{\partial v_m}{\partial x_1} \\ u_1 \frac{\partial v_1}{\partial x_2} + u_2 \frac{\partial v_2}{\partial x_2} + \cdots + u_m \frac{\partial v_m}{\partial x_2} \\ \vdots \\ u_1 \frac{\partial v_1}{\partial x_n} + u_2 \frac{\partial v_2}{\partial x_n} + \cdots + u_m \frac{\partial v_m}{\partial x_n} \\ \end{bmatrix} + \begin{bmatrix} \frac{\partial u_1}{\partial x_1} v_1 + \frac{\partial u_2}{\partial x_1} v_2 + \cdots + \frac{\partial u_m}{\partial x_1} v_m \\ \frac{\partial u_1}{\partial x_2} v_1 + \frac{\partial u_2}{\partial x_2} v_2 + \cdots + \frac{\partial u_m}{\partial x_2} v_m \\ \vdots \\ \frac{\partial u_1}{\partial x_n} v_1 + \frac{\partial u_2}{\partial x_n} v_2 + \cdots + \frac{\partial u_m}{\partial x_n} v_m \\ \end{bmatrix} \\ & & n \times 1 \hspace{13em} n \times 1 & \\ \\ & = & \begin{bmatrix} \frac{\partial v_1}{\partial x_1} & \frac{\partial v_2}{\partial x_1} & \cdots & \frac{\partial v_m}{\partial x_1} \\ \frac{\partial v_1}{\partial x_2} & \frac{\partial v_2}{\partial x_2} & \cdots & \frac{\partial v_m}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial v_1}{\partial x_n} & \frac{\partial v_2}{\partial x_n} & \cdots & \frac{\partial v_m}{\partial x_n} \end{bmatrix} \cdot \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \end{bmatrix} + \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_2}{\partial x_1} & \cdots & \frac{\partial u_m}{\partial x_1} \\ \frac{\partial u_1}{\partial x_2} & \frac{\partial u_2}{\partial x_2} & \cdots & \frac{\partial u_m}{\partial x_2} \\ \vdots & & & \vdots \\ \frac{\partial u_1}{\partial x_n} & \frac{\partial u_2}{\partial x_n} & \cdots & \frac{\partial u_m}{\partial x_n} \end{bmatrix} \cdot \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_m \end{bmatrix} \\ & & \hspace{4em} n \times m \hspace{5em} m \times 1 \hspace{5em} n \times m \hspace{5em} m \times 1 & \\ \\ \dfrac{\partial}{\partial X} U^{T} V & = & \dfrac{\partial V}{\partial X} U + \dfrac{\partial U}{\partial X} V \ftag{MC.6} & \fref{MC.3} \end{array} \] Phew! The result in $\ifref{MC.6}$ shows exactly what you would expect. It demonstrates that the product rule generalizes from univariate calculus to matrix calculus. \[ \dfrac{\partial}{\partial X} U^{T}V = \dfrac{\partial V}{\partial X} U + \dfrac{\partial U}{\partial X} V \leftrightarrow \dfrac{d}{dx} uv =\dfrac{dv}{dx} u + \dfrac{du}{dx} v \]

Just like with the univariate product rule, $\ifref{MC.6}$ allows us to derive other, more specific results. For example, consider the case when $U = V = Y$. Then, by the product rule, we have: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} Y^{T} Y & = & \dfrac{\partial Y}{\partial X} Y + \dfrac{\partial Y}{\partial X} Y & \fref{MC.6} \\ & = & 2 \dfrac{\partial Y}{\partial X} Y \ftag{MC.7} & \end{array} \] Furthermore, if we let $Y = X$, the we arrive at: \[ \begin{array}{rccl} \dfrac{\partial}{\partial X} X^{T} X & = & 2 \dfrac{\partial X}{\partial X} X & \fref{MC.7} \\ & = & 2 I_n X & \fref{MC.4} \\ & = & 2 X \end{array} \] This result should be very reminiscent of the classical derivative of a quadratic function: \[ \dfrac{\partial}{\partial X} X^{T}X = 2 X \leftrightarrow \dfrac{d}{dx} x^2 = 2x \]

Matrix Operations

Trace

We define the trace of a matrix as the sum of all its diagonal elements. In particular, if we have a $n \times n$ matrix $A$ with elements $a_{i j}$, then its trace is: \[ tr(A) = \sum\limits_{i = 1}^{n} a_{i i} = a_{1 1} + a_{2 2} + \cdots + a_{n n} \ftag{MO.1} \]

Visually, this looks like: \[ \require{enclose} tr(A) = tr\left( \begin{bmatrix} \enclose{circle}{a_{1 1}} & a_{1 2} & \cdots & a_{1 n} \\ a_{2 1} & \enclose{circle}{a_{2 2}} & \cdots & a_{2 n} \\ \vdots & \vdots & & \vdots \\ a_{n 1} & a_{n 2} & \cdots & \enclose{circle}{a_{n n}} \end{bmatrix} \right) = \sum\limits_{i = 1}^{n} a_{i i} \]

Note that, since rectangular matrices do not have proper diagonals, we can only define the $tr(\Box)$ function on square matrices. This point will prove important when we consider the trace of products below.

Trace of a Sum:
If we have a sequence of $K$ $n \times n$ matrices $A_k$, then the trace of the sum of this sequence is just the sum of the traces. This result comes from the fact that matrix addition happens element-wise. Thus: \[ tr\left( \sum\limits_{k = 1}^{K} A_k \right) = \sum\limits_{k = 1}^{K} tr(A_k) \ftag{MO.2} \]

Trace of a Product:
The trace of a product has a much more complicated structure than the trace of a sum ($\ifref{MO.2}$). Consider two matrices $X$ and $Y$, where $X$ has dimensions $n \times m$ and $Y$ has dimensions $m \times n$, and their product $X Y = Z$: \[ \begin{array}{cccc} X & Y & = & Z \\ n \times m & m \times n & & n \times n \end{array} \]

Since $Z$ is an $n \times n$ matrix, we can ask about its trace $tr(Z) = tr(X Y)$. How do we calculate this quantity? Note that we can’t take the traces of $X$ and $Y$ separately because neither of them are square. Instead, we have to go straight to the definition of matrix multiplication to arrive at an answer. Recall that, when we multiply two matrices $X$ and $Y$, the output is another matrix: \[ \begin{array}{cccc} X & Y & & Z \\ \begin{bmatrix} x_{1 1} & x_{1 2} & \cdots & x_{1 m} \\ x_{2 1} & x_{2 2} & \cdots & x_{2 m} \\ \vdots & \vdots & & \vdots \\ x_{n 1} & x_{n 2} & \cdots & x_{n m} \end{bmatrix} & \begin{bmatrix} y_{1 1} & y_{1 2} & \cdots & y_{1 n} \\ y_{2 1} & y_{2 2} & \cdots & y_{2 n} \\ \vdots & \vdots & & \vdots \\ y_{m 1} & y_{m 2} & \cdots & y_{m n} \end{bmatrix} & = & \begin{bmatrix} z_{1 1} & z_{1 2} & \cdots & z_{1 n} \\ z_{2 1} & z_{2 2} & \cdots & z_{2 n} \\ \vdots & \vdots & & \vdots \\ z_{n 1} & z_{n 2} & \cdots & z_{n n} \end{bmatrix} \\ n \times m & m \times n & & n \times n \end{array} \]

Here, each $z_{i j}$ is defined as: \[ z_{i j} = \sum\limits_{k = 1}^{m} x_{i k} \cdot y_{k j} \]

In effect, to get $Z$, we sum over the middle dimension of size $m$ and, as a result, $Z$ ends up with $n \cdot n$ elements. If we take the above result and combine it with $\ifref{MO.1}$, we arrive at: \[ \begin{array}{rcll} tr(X Y) & = & \displaystyle\sum\limits_{i = 1}^{n} z_{i i} & \fref{MO.1} \\ tr(X Y) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{m} x_{i k} \cdot y_{k i} \ftag{MO.3} & \end{array} \]

This expression allows us to calculate the trace of a product directly. However, the real power of $\ifref{MO.3}$ comes from noticing what occurs when we exchange summations: \[ \begin{array}{rcll} tr(X Y) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{m} x_{i k} \cdot y_{k i} & \fref{MO.3} \\ & = & \displaystyle\sum\limits_{k = 1}^{m} \sum\limits_{i = 1}^{n} x_{i k} \cdot y_{k i} & \\ & = & \displaystyle\sum\limits_{k = 1}^{m} \sum\limits_{i = 1}^{n} y_{k i} \cdot x_{i k} & \\ tr(X Y) & = & tr(Y X) \ftag{MO.4} \end{array} \]

This shows that, when calculating a trace of a product, we are allowed to interchange the two terms. Notice that this is not at all obvious since since $X Y$ is an $n \times n$ matrix while $Y X$ is an $m \times m$ matrix.

Does $\ifref{MO.4}$ generalize to a product of more than two matrices? Consider, for example, the following three matrices and their product:

$A$, an $n \times p$ matrix with elements $a_{i j}$
$B$, a $p \times q$ matrix with elements $b_{i j}$
$C$, a $q \times n$ matrix with elements $c_{i j}$
$D = A B C$ with matrix elements $d_{i j}$

As before, we defined the three matrices above so that their product $A B C = D$ will be a square $n \times n$ matrix: \[ \begin{array}{ccccc} A & B & C & = & D \\ n \times p & p \times q & q \times n & & n \times n \end{array} \]

Notice that we immediately arrive at the conclusion that $\ifref{MO.4}$ does not generalize in a pairwise sense. For example, if we interchange $A$ and $B$ in our product then it is not true that $tr(A B C) = tr(B A C)$ for the simple reason that $B A C$ is undefined: \[ \begin{array}{ccccc} B & A & C & = & ??? \\ p \times q & n \times p & q \times n & & ??? \end{array} \]

Once again, we go back to the definition of matrix multiplication to get the elements of $A B C = D$: \[ d_{i j} = \sum\limits_{k = 1}^{q} \left( \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \right) \cdot c_{k j} = \sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \cdot c_{k j} \]

Here, notice how we sum over the inside dimensions $q \leftrightarrow k$ and $p \leftrightarrow l$ to leave just the final $n \cdot n$ terms. To get $tr(D)$ we pile on another summation, this time over $n$: \[ \begin{array}{rcll} tr(D) & = & \displaystyle\sum\limits_{i = 1}^{n} d_{i i} & \fref{MO.1} \\ tr(D) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \cdot c_{k i} \end{array} \]

Can we start interchanging summations here as well? Yes, but we have to be careful. For example, if we interchange just the summations over $n \leftrightarrow i$ and $q \leftrightarrow k$, the result will no longer correspond to valid matrix multiplication. Instead, we must cycle the summations so that the inner terms always cancel out: \[ \begin{array}{rcccl} tr(D) & = & \displaystyle\sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} a_{i l} \cdot b_{l k} \cdot c_{k i} & = & tr(A B C) \\ tr(D) & = & \displaystyle\sum\limits_{k = 1}^{q} \sum\limits_{l = 1}^{p} \sum\limits_{i = 1}^{n} b_{l k} \cdot c_{k i} \cdot a_{i l} & = & tr(B C A) \\ tr(D) & = & \displaystyle\sum\limits_{l = 1}^{p} \sum\limits_{i = 1}^{n} \sum\limits_{k = 1}^{q} c_{k i} \cdot a_{i l} \cdot b_{l k} & = & tr(C A B) \end{array} \]

The above example with matrices $A$, $B$, and $C$ shows that the trace operator is invariant under only cyclic permutations. In general if we have a sequence of $K$ matrices $A_k$ such that $A_1 A_2 \cdots A_K$ is defined, we can write: \[ tr(A_1 A_2 \cdots A_K) = tr(A_2 A_3 \cdots A_K A_1) = tr(A_3 A_4 \cdots A_K A_1 A_3) = \dots = tr(A_K A_1 \cdots A_{K-1}) \ftag{MO.5} \]

Variance

This section covers the basics of variance and covariance. I generally assume that the reader already knows how to calculate variance/covariance for random variables and mostly focus on the variance/covariance of random vectors. In particular, we cover some basic properties of the variance-covariance matrix and go into properties of cross-variance as well.

Throughout this section we’ll use $x$ and $y$ to represent scalar random variables. $X$ will represent an $n$-dimensional random column vector and $Y$ will represent an $m$-dimensional random column vector: \[ X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \text{ and } Y = \begin{bmatrix} y_1 \\ y_y \\ \vdots \\ x_m \end{bmatrix} \]

We’ll use $a$ to represent a constant scalar and $A$ to represent a constant $n$-dimensional random vector.

Definitions

Variance/Covariance for Random Variables:
Elementary statistics tells us that, if we have two random variables $x$ and $y$, we can calculate their covariance as follows: \[ Cov(x,y) = \mathbb{E}\left( (x - \mathbb{E}(x)) (y - \mathbb{E}(y)\right) \ftag{V.1} \]

Intuitively, this quantity calculates the dependence between $x$ and $y$ by looking at how far each random variable is from it’s mean and comparing those distances.

If $x$ lies above its mean at the same time that $y$ lies above its mean, and if $x$ lies below its mean at the same time $y$ lies below its mean, then the two random variables have a positive covariance.
If $x$ lies above its mean at the same time that $y$ lies below its mean (and vice-versa), $\ifref{V.1}$ will come out negative.
Uncorrelated random variables will have an approximately even mixture of the two bullet points above.

# Define number of datapoints
N <- 1000;

# Define positive, negative, and zero covariance trends
x <- rnorm(1:N, mean = 0, sd = 10);
e <- rnorm(1:N, mean = 0, sd = 1);

y.p <- x + 5 + e;
y.n <- -x + 5 + e;
y.z <- 5 + 5*e;

# Combine data into one object
cov.data <- rbind(
  data.frame('x' = x, 'y' = y.p, 'type' = 'Positive Covariance'),
  data.frame('x' = x, 'y' = y.n, 'type' = 'Negative Covariance'),
  data.frame('x' = x, 'y' = y.z, 'type' = 'No Covariance')
);

# Graph the results
ggplot(data = cov.data) + 
  geom_point(mapping = aes(x, y), 
             alpha = 0.2, 
             color = brewer.pal(3, 'Set2')[2]) + 
  facet_grid(. ~ type);

Often, we are not interested in the relationship between two random variables, but simply want to know more about a single one. In particular, we might be interested in how far a random variable spreads away from its mean. We call this quantity the variance and calculate it as the covariance between a random variable and itself: \[ \begin{array}{rcll} Var(x) & = & Cov(x, x) & \fref{V.1} \\ & = & \mathbb{E} \left( (x - \mathbb{E}(x)) (x - \mathbb{E}(x)) \right) & \\ Var(x) & = & \mathbb{E} \left( (x - \mathbb{E}(x))^2 \right) \ftag{V.2} \end{array} \]

For calculation purposes, we can derive a more convenient formula for covariance by multiplying out the inner terms: \[ \begin{array}{rcll} Cov(x, y) & = & \mathbb{E} \left( (x - \mathbb{E}(x))(y - \mathbb{E}(y) \right) & \fref{V.1} \\ & = & \mathbb{E} \left( xy - x\mathbb{E}(y) - \mathbb{E}(x)y + \mathbb{E}(x)\mathbb{E}(y) \right) & \\ & = & \mathbb{E}(xy) - \mathbb{E}(x\mathbb{E}(y)) - \mathbb{E}(\mathbb{E}(x)y) + \mathbb{E}(\mathbb{E}(x)\mathbb{E}(y)) & \\ & = & \mathbb{E}(xy) - \mathbb{E}(x)\mathbb{E}(y) - \mathbb{E}(x)\mathbb{E}(y) + \mathbb{E}(x)\mathbb{E}(y) & \\ Cov(x, y) & = & \mathbb{E}(xy) - \mathbb{E}(x)\mathbb{E}(y) \ftag{V.3} & \end{array} \]

Variance/Covariance for Random Vectors:
In reality, we work more often with random vectors as opposed to just one or two random variables. This creates complexity because, even for a single random vector, we now have to calculate $Var(\Box)$ and $Cov(\Box,\Box)$ terms. For example, if $X$ is a random $n$-dimensional column vector, then we can potentially end up calculating $n^2$ terms: \[ X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \implies \text{ There are } n^2 \text{ } Cov(x_i, x_j) \text{ combinations.} \]

To help us cope with this increasing complexity, we generalize the idea of variance and covariance through the variance-covariance matrix. Once again, if $X$ is a random $n$-dimensional column vector then we define: \[ \begin{array}{rccl} Var(X) & = & \begin{bmatrix} Var(x_1) & Cov(x_1, x_2) & \cdots & Cov(x_1, x_n) \\ Cov(x_2, x_1) & Var(x_2) & \cdots & Cov(x_2, x_n) \\ \vdots & & & \vdots \\ Cov(x_n, x_1) & Cov(x_n, x_2) & \cdots & Var(x_n) \end{bmatrix} & \ftag{V.4} \\ & & n \times n & \end{array} \]

This definition simply orders all $n^2$ variance and covariance terms mentioned above into a neat format. Fortunately, we actually don’t need to use $\ifref{V.4}$ very often. The variance-covariance matrix has a familiar and more compact form that utilizes matrix multiplication: \[ Var(X) = \mathbb{E}((X-\mathbb{E}(X))\cdot(X - \mathbb{E}(X))^{T}) \ftag{V.5} \]

The proof that the $\ifref{V.4}$ and $\ifref{V.5}$ are identical follows straight from definitions; we just expand $\ifref{V.5}$ into vector form and multiply everything out. To keep things from blowing up and going off the page, let’s define some shorthands first: \[ r_i = x_i + \mathbb{E}(x_i) \]

This allows us to rewrite covariance $\ifref{V.1}$ and variance $\ifref{V.2}$ for single random variable as: \[ \begin{array}{rcl} Cov(x_i, x_j) & = & \mathbb{E}(r_i \cdot r_j) \\ Var(x_i) & = & \mathbb{E}(r_i^2) \end{array} \]

Armed with this, we can now expand $\ifref{V.5}$ and see what we get: \[ \begin{array}{rcll} Var(X) & = & \mathbb{E}((X-\mathbb{E}(X))\cdot(X - \mathbb{E}(X))^{T}) & \fref{V.5} \\ & = & \mathbb{E} \left( \left( \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} - \mathbb{E}\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \right) \cdot \left( \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} - \mathbb{E}\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \right)^{T} \right) \\ & = & \mathbb{E} \left( \begin{bmatrix} x_1 - \mathbb{E}(x_1) \\ x_2 - \mathbb{E}(x_2) \\ \vdots \\ x_n - \mathbb{E}(x_n) \end{bmatrix} \cdot \begin{bmatrix} x_1 - \mathbb{E}(x_1) & x_2 - \mathbb{E}(x_2) & \cdots & x_n - \mathbb{E}(x_n) \end{bmatrix} \right) \\ & = & \mathbb{E}\left( \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_n \end{bmatrix} \cdot \begin{bmatrix} r_1 & r_2 & \cdots & r_n \end{bmatrix} \right) \\ & = & \mathbb{E}\left( \begin{bmatrix} r_1^2 & r_1 \cdot r_2 & \cdots & r_1 \cdot r_n \\ r_2 \cdot r_1 & r_2^2 & \cdots & r_2 \cdot r_n \\ \vdots & & & \vdots \\ r_n \cdot r_1 & r_2 \cdot r_n & \cdots & r_n^2 \end{bmatrix} \right) \\ & = & \begin{bmatrix} \mathbb{E}(r_1^2) & \mathbb{E}(r_1 \cdot r_2) & \cdots & \mathbb{E}(r_1 \cdot r_n) \\ \mathbb{E}(r_2 \cdot r_1) & \mathbb{E}(r_2^2) & \cdots & \mathbb{E}(r_2 \cdot r_n) \\ \vdots & & & \vdots \\ \mathbb{E}(r_n \cdot r_1) & \mathbb{E}(r_2 \cdot r_n) & \cdots & \mathbb{E}(r_n^2) \end{bmatrix} \\ Var(X) & = & \begin{bmatrix} Var(x_1) & Cov(x_1, x_2) & \cdots & Cov(x_1, x_n) \\ Cov(x_2, x_1) & Var(x_2) & \cdots & Cov(x_2, x_n) \\ \vdots & & & \vdots \\ Cov(x_n, x_1) & Cov(x_n, x_2) & \cdots & Var(x_n) \end{bmatrix} \end{array} \]

The idea of covariance for random vectors generalizes similarly to that of variance. If $X$ is an $n$-dimensional random column vector and $Y$ is an $m$-dimensional random column vector, then we define the cross-covariance matrix as: \[ \begin{array}{rcc} Cov(X, Y) & = & \begin{bmatrix} Cov(x_1, y_1) & Cov(x_1, y_2) & \cdots & Cov(x_1, y_m) \\ Cov(x_2, y_1) & Cov(x_2, y_2) & \cdots & Cov(x_2, y_m) \\ \vdots & & & \vdots \\ Cov(x_n, y_1) & Cov(x_n, y_2) & \cdots & Cov(x_n, y_m) \end{bmatrix} \\ & & n \times m \\ \\ Cov(X, Y) & = & \mathbb{E}((X - \mathbb{E}(X)) \cdot (Y - \mathbb{E}(Y))^{T}) \ftag{V.6} \end{array} \]

The proof that the compressed and expanded forms of cross-covariance are equivalent follows the exact same structure as the proof for variance-covariance above, so we won’t repeat it.

Note that, just as in $\ifref{V.2}$, we can define rewrite the variance-covariance matrix of a random column vector $X$ as the cross-covaraiance of $X$ with itself: \[ \begin{array}{rcll} Var(X) & = & \mathbb{E}((X-\mathbb{E}(X))\cdot(X - \mathbb{E}(X))^{T}) & \fref{V.5} \\ Var(X) & = & Cov(X, X) \ftag{V.7} & \fref{V.6} \end{array} \]

Next, we prove some basic properties of the variance-covariance matrix that will be useful in the problems below.

Basic Properties

Overview:
In this subsection, we’ll prove the following results:

$Var(A) = Cov(A, A) = Cov(X, A) = Cov(A, X) = 0$
$Var(X)^{T} = Var(X)$
$Var(A^{T}X) = A^{T} Var(X) A$
$Var(X + A) = Var(X)$

The first bullet point states that, if a constant vector is involved in calculating the cross-variance matrix or the variance-covariance matrix, then the result will be the zero matrix.
The second bullet point states that the variance-covariance matrix is symmetric.
The final two bullet points come from the bilinearity of cross-covariance.

Variation in a Constant:
Intuitively, constants like $A$ don’t vary so their variance should be $0$. This becomes clear when you realize that the expected value of any constant is itself so that $\mathbb{E}(A) = A$. This allows us to write: \[ \begin{array}{rcll} Var(A) & = & Cov(A, A) & \fref{V.7} \\ & = & \mathbb{E}(\;(A - \mathbb{E}(A)) \;\cdot\; (A - \mathbb{E}(A))^{T}\;) & \fref{V.6} \\ & = & \mathbb{E}(\;(A - A) \;\cdot\; (A - A)^{T}\;) & \\ & = & \mathbb{E}(\;0 \;\cdot\; 0^{T}\;) & \\ Var(A) & = & 0 \ftag{V.8} \end{array} \]

The other variants like $Cov(X, A)$ and $Cov(A, X)$ have almost identical proofs, so we’ll omit them.

Symmetry:
The symmetry property is easy to prove. We know that $Var(X) = Var(X)^{T}$ will be true if we can show that $Cov(x, y) = Cov(y, x)$, the covariance for random variables is symmetric. But, according to our definition $\ifref{V.1}$: \[ \begin{array}{rcll} Cov(x, y) & = & \mathbb{E}\left( (x - \mathbb{E}(x)) (y - \mathbb{E}(y)\right) & \fref{V.1} \\ & = & \mathbb{E}\left( (y - \mathbb{E}(y)) (x - \mathbb{E}(x)\right) \\ Cov(y, x) & = & Cov(y, x) & \fref{V.1} \end{array} \]

Thus, if we transpose the matrix $Var(X)$, we still get the same matrix. \[ Var(X)^{T} = Var(X) \ftag{V.9} \]

Bilinearity:
Next, we’ll look at bilinearity of $Cov(X, Y)$. To start, let’s quickly review the difference between linearity and bilinearity. We say a function $f$ that takes a single vector parameter is linear if: \[ \begin{array}{rcl} f(A^{T}X) & = & A^{T} f(X) \\ & \text{ and } & \\ f(X + Y) & = & f(X) + f(Y) \end{array} \]

Matrix multiplication is a common example of a linear function. As the name implies, bilinearity applies to functions $f$ that take two parameters instead of one, like $Cov(\Box, \Box)$. We say that a function $f$ that takes two vector parameters is bilinear if: \[ \begin{array}{rcl} f(A^{T}X, Y) & = & A^{T} f(X, Y) \\ f(X, A^{T}Y) & = & f(X, Y) A \\ & \text{ and } & \\ f(X_1 + X_2, Y) & = & f(X_1, Y) + f(X_2, Y) \\ f(X, Y_1 + Y_2) & = & f(X, Y_1) + f(X, Y_2) \end{array} \]

In essence, bilinearity simply means linear in both inputs independently.

We now prove that $Cov(X, Y)$ is a bilinear function. To do this, we prove that:

$Cov(A^{T}X, A^{T}Y) = A^{T}Cov(X, Y)A$. This is basically a combination of the first two lines above. Note that $A$ is an $n$-dimensional column vector of constants, but our proof could also work with any compatible matrix.
$Cov(X_1 + X_2, Y_1 + Y_2) = Cov(X_1, Y_1) + Cov(X_1, Y_2) + Cov(X_2, Y_1) + Cov(X_2, Y_2)$. This is a combination of the second two bilinearity conditions.

First bilinearity condition: \[ \begin{array}{rcll} Cov(A^{T}X, A^{T}Y) & = & \mathbb{E}((A^{T}X - \mathbb{E}(A^{T}X)) \cdot (A^{T}Y - \mathbb{E}(A^{T}Y))^{T}) & \fref{V.6} \\ & = & \mathbb{E}((\;A^{T}(X - \mathbb{E}(X))\;) \cdot (\; A^{T}(Y - \mathbb{E}(Y)) \;)^{T}) & \\ & = & \mathbb{E}(A^{T}(X - \mathbb{E}(X)) \cdot (Y - \mathbb{E}(Y))^{T}A) & \\ & = & A^{T} \mathbb{E}((X - \mathbb{E}(X)) \cdot (Y - \mathbb{E}(Y))^{T})A \\ Cov(A^{T}X, A^{T}Y) & - & A^{T} Cov(X, Y) A \ftag{V.10} & \fref{V.6} \end{array} \]

Using $\ifref{V.7}$ and $\ifref{V.10}$, we can derive a useful corollary involving the variance-covariance matrix. We simply substitute: \[ \begin{array}{rcl} Var(A^{T}X) & = & Cov(A^{T}X, A^{T}X) & \fref{V.7} \\ & = & A^{T} Cov(X, X) A & \fref{V.10} \\ Var(A^{T}X) & = & A^T{} Var(X) A \ftag{V.11} & \fref{V.7} \end{array} \]

Next, we prove the second set of bilinearity conditions. To accomplish this, we first define a few shorthands: \[ R_i = X_i - \mathbb{E}(X_i) \\ S_i = Y_i - \mathbb{E}(Y_i) \]

These definitions allow us to rewrite the definition of cross-variance $\ifref{V.6}$ as: \[ \begin{array}{rcll} Cov(X, Y) & = & \mathbb{E}(\;(X - \mathbb{E}(X)) \;\cdot\; (Y - \mathbb{E}(Y))^{T}\;) & \fref{V.6} \\ Cov(X, Y) & = & \mathbb{E}(\;R \;\cdot\; S^{T}\;) \end{array} \]

Armed with this, we now go into the proof: \[ \begin{array}{rcll} Cov(X_1 + X_2, Y_1 + Y_2) & = & \mathbb{E}(\;(X_1 + X_1 - \mathbb{E}(X_1 - X_2)) \;\cdot\; (Y_1 + Y_2 - \mathbb{E}(Y_1 + Y_2))^{T}\;) & \fref{V.6} \\ & = & \mathbb{E}(\;(X_1 - \mathbb{E}(X_1) + X_2 - \mathbb{E}(X_2)) \;\cdot\; (Y_1 - \mathbb{E}(Y_1) + Y_2 - \mathbb{E}(Y_2))^{T}\;) & \\ & = & \mathbb{E}(\;(R_1 + R_2) \;\cdot\; (S_1 + S_2)^{T}\;) & \\ & = & \mathbb{E}(\;(R_1 + R_2) \;\cdot\; (S_1^{T} + S_2^{T})\;) & \\ & = & \mathbb{E}(\; R_1 S_1^{T} + R_1 S_2^{T} + R_2 S_1^{T} + R_2 S_2^{T} \;) & \\ & = & \mathbb{E}(\; R_1 S_1^{T} \;) + \mathbb{E}(\; R_1 S_2^{T} \;) + \mathbb{E}(\; R_2 S_1^{T} \;) + \mathbb{E}(\; R_2 S_2^{T} \;) & \\ Cov(X_1 + X_2, Y_1 + Y_2) & = & Cov(X_1, Y_1) + Cov(X_1, Y_2) + Cov(X_2, Y_1) + Cov(X_2, Y_2) \ftag{V.12} & \end{array} \]

Just as with the first bilinearity condition, we can use $\ifref{V.7}$ and $\ifref{V.12}$ to find out more about the variance-covariance matrix. Note that we can’t, in general, look at $Var(X + Y)$ because $X$ is an $n$-dimensional column vector while $Y$ is an $m$ dimensional column vector, but we can look at $Var(X + A)$: \[ \begin{array}{rcll} Var(X + A) & = & Cov(X + A, X + A) & \fref{V.6} \\ & = & Cov(X, X) + Cov(X, A) + Cov(A, X) + Cov(A, A) & \fref{V.12} \\ & = & Cov(X, X) + 0 + 0 + 0 & \fref{V.8} \\ Var(X + A) & = & Var(X) \ftag{V.13} & \end{array} \]

Ordinary Least Squares

This tab includes proofs for Ordinary Least Squares (OLS) regression that the authors of Elements of Statistical Learning didn’t elaborate on. Since most of this material (except for the proofs) can be found by reading the text, I don’t go as in-depth as with other sections.

Besides the main text, the following links should be useful in helping you understand OLS:

The Ordinary Least Squares Wikipedia article
The Wikipedia repository of Ordinary Least Squares Proofs
This Stanford resource OLS in Matrix Form

Linear Model

The linear regression model assumes that our response variable $y$ is a linear function of our independent variable $X$, so that $y = X\beta$ for some unknown constant $\beta$. Of course, since we want a statistical model, we have to acknowledge that $X$ cannot perfectly determine $y$. To account for this we add a noise term, a random variable called $\epsilon$, to the model. This term accounts for variation in $y$ outside of the predictions made by $X$. Thus, the full model is: \[ y = X \beta + \epsilon \ftag{OLS.1} \]

Let’s get into the specifics of each term in $\ifref{OLS.1}$. We’ll briefly consider two cases, when we have only one datapoint, and when we have $N$ datapoints.

Single Datapoint:
In the case that we only have a single datapoint, $\ifref{OLS.1}$ is a scalar equation. $X$ ends up as a $p + 1$ dimensional row vector that corresponds to a scalar output $y$ and, to make the dimensions work out, our constant $\beta$ ends up as a $p + 1$ dimensional column vector: \[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \ftag{OLS.2} \] Our noise term $\epsilon$ is a single random variable with zero mean and constant variance. Intuitively, a mean of zero implies that $X$ doesn’t consistently mispredict $y$. A constant variance, known as homoscedasticity, implies that the noise doesn’t grow or shrink in response to $X$. \[ \mathbb{E}(\epsilon) = 0 \text{ and } Var(\epsilon) = \sigma^2 \ftag{OLS.3} \] Overall, for a single datapoint, our model looks as follows: \[ \begin{array}{rccccc} & & X & & \beta & \\ y & = & \begin{bmatrix} 1 & x_1 & x_2 & \cdots & x_p \end{bmatrix} & \cdot & \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} & + & \epsilon \\ & & 1 \times (p + 1) & & (p + 1) \times 1 & \end{array} \]

Multiple Datapoints:
We only need a few small changes when moving our model from a single datapoint to multiple datapoints. In particular, we must change our independent variable $X$ into an $N \times (p + 1)$ matrix where $N$ represents the number of datapoints in our model and $p$ represents the number of features we are using to predict $y$. Just as before, $X$ has $p + 1$ columns (instead of $p$) to account for the single $y$-intercept term. The variable $y$ changes to an $N$ dimensional vector that holds the response of the model for every one of the $N$ datapoints.

Similarly, the noise term $\epsilon$ must also transform to an $N$ dimensional vector. In this case it consists of identically distributed and independent random variables with zero mean and constant variance: \[ \begin{array}{cl} \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} & \\ \mathbb{E}(\epsilon) = 0 \text{ and } Var(\epsilon) = \sigma^2 I & \ftag{OLS.4} \end{array} \] Since all $\epsilon_i$ terms are identically distributed and independent, we often refer to $\epsilon$ as a single variable instead of a vector. For example, if all $\epsilon_i \sim N(0,1)$, then we’ll usually just say $\epsilon \sim N(0,1)$, like in $\ifref{OLS.3}$ and omit the indexes.

We can put everything together to get: \[ \begin{array}{cccccccl} y & & X & & \beta & & \epsilon & \\ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} & = & \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,p} \\ 1 & x_{2,1} & \cdots & x_{1,p} \\ \vdots & & & \vdots \\ 1 & x_{N,1} & \cdots & x_{N,p} \end{bmatrix} & \cdot & \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} & + & \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} & \ftag{OLS.5} \\ N \times 1 & & N \times (p + 1) & & (p + 1) \times 1 & & N \times 1 & \end{array} \]

The matrix equation above shows how each datapoint $1 \leq i \leq N$ is represented as a row in the output: \[ y_i = 1\cdot\beta_0 + x_{i,1}\beta_1 + \cdots + x_{i,p}\beta_p + \epsilon_i \]

# Number of sample points and features to use
N <- 5000;
p <- 1;

# Define slope and intercept
beta <- matrix(
  c(6, 2),
  nrow = 2,
  ncol = 1
);

# Define noise
sigma <- 2;

# Function for generating 2D linear model plots
plot.2D.lin <- function(N, p, beta, sigma, opacity = 1) {
  epsilon <- matrix(
    rnorm(N, mean = 0, sd = sigma),
    nrow = N,
    ncol = 1
  );
  
  x.sim <- rnorm(N, mean = 5, sd = sqrt(10));
  X <- matrix(
    c(rep.int(1, N), x.sim),
    nrow = N,
    ncol = p + 1
  );
  y <- X %*% beta + epsilon;
  
  data <- data.frame('x' = x.sim, 'y' = y);
  
  # Plot
  plot <- ggplot() + 
    geom_point(data = data, 
               mapping = aes(x, y), 
               alpha = opacity, 
               color = brewer.pal(3, 'Set2')[2]) +
    geom_hline(yintercept = 0) +
    geom_vline(xintercept = 0) +
    ggtitle('Linear Model');
  
  
  list(y = y, X = X, plot = plot);
}

ex.plot <- plot.2D.lin(N, p, beta, sigma, opacity = 0.2)$plot;

Say we want to predict the value of $y$ using a single scalar value $x$, so that we have $p = 1$ features to work with. To do this, we gather $N = 5000$ datapoints and arrive at the following distribution between them:

The graph above approximates the joint distribution between $y$ and $x$. Here, it’s clear that $y$ tends to linearly follow $x$ (that’s how I generated the data in the first place), so it’s a good fit for a linear model. Notice, however, that $x$ does not completely determine $y$; given a particular $x$, there’s still some variation in $y$. This is why we need an $\epsilon$ term in our model.

Since this example involves only a single feature $p = 1$, our unknown $\beta$ vector has $p + 1 = 2$ components, one for the slope of the line and one for the $y$-intercept. \[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \]

Our $y$ and $X$ matrices look as expected: \[ y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{5000} \end{bmatrix} \text{ and } X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_{5000} \end{bmatrix} \]

In general, regression attempts to estimate the unknown linear behavior between $y$ and $X$ by finding the “best” $\beta$ it can. OLS in particular uses the method of least squares to find this “best” estimate. We discuss how in the next section.

Estimating $\beta$

Once we have data in the form of our $X$ matrix, we can attempt to estimate $\beta$ with an estimator $\hat{\beta}$. To do this, we minimize our loss function, the sum of squared residuals, abbreviated as $RSS$, as a function of $\beta$: \[ RSS(\beta) = \sum_{i = 1}^{N} \left( y_i - \sum_{j = 0}^{p} x_{i,j}\beta_j \right)^2 \] It’s useful to note that we can write $RSS$ in vector form. This will allow us to minimize the sum of squared residuals using matrix calculus, making derivation of $\hat{\beta}$ much simpler. \[ RSS(\beta) = (y - X\beta)^{T}(y - X\beta) \ftag{OLS.6} \] To minimize $RSS$ we take its derivative with respect to the vector $\beta$ (find the gradient) and set this value equal to $0$: \[ \begin{array}{rcll} RSS & = & (y - X\beta)^{T}(y - X\beta) & \\ \dfrac{\partial RSS}{\partial \beta} & = & 2 \dfrac{\partial}{\partial \beta} (y - X\beta) \cdot (y - X\beta) & \fref{MC.7} \\ 0 & = & -2X^{T}(y - X\hat{\beta}) \ftag{OLS.7} & \fref{MC.5} \end{array} \] Note that once we set the derivative equal to zero in $, we changed $\beta$ into $\hat{\beta}$. We solve to get: \[ \begin{array}{rcll} 0 & = & X^{T}(y - X\hat{\beta}) & \fref{OLS.7} \\ 0 & = & X^{T}y - X^{T}X\hat{\beta} & \\ X^{T}X\hat{\beta} & = & X^{T}y \\ \hat{\beta} & = & (X^{T}X)^{-1}X^{T}y \ftag{OLS.8} & \end{array} \] This calculation says that, to find the $\beta$ that minimizes $RSS$, we calculate the inner product $X^{T}y$ and multiply it by the inverse of the square matrix $X^{T}X$.

# Number of datapoints for this experiment
N <- 100;

# Recreate plot with fewer datapoints
ex.out <- plot.2D.lin(N, p, beta, sigma);
ex.plot <- ex.out$plot;

X <- ex.out$X;
y <- ex.out$y;

# Estimate of beta
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% y;

We continue our example from the previous section, a simple regression with $p = 1$ features. This time, because we will be rerunning our experiment many times, we’ll restrict ourselves to only $N = 100$ datapoints. Just as before, to generate the data we use a value of $\beta = \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix}^{T} = \begin{bmatrix} 6 & 2 \end{bmatrix}^{T}$ and a noise term $\epsilon \sim N(0, \sigma)$ where $\sigma = 2$. For a single run, we arrive at the following distribution:

Now, we want to try and get an estimate for our data using $\ifref{OLS.8}$. We calculate $\hat{\beta}$ and get: \[ \hat{\beta} = \begin{bmatrix} 5.440155 \\ 2.1307021 \end{bmatrix} \]

# Number of times we rerun the simulation
M <- 5000;

# Define a function that will generate a single beta estimate
beta.experiment <- function(N, beta, sigma) {
  # Simulate `N` datapoints
  epsilon <- matrix(
    rnorm(N, mean = 0, sd = sigma),
    nrow = N,
    ncol = 1
  );

  x.sim <- rnorm(N, mean = 5, sd = sqrt(10));
  X <- matrix(
    c(rep.int(1, N), x.sim),
    nrow = N,
    ncol = p + 1
  );
  y <- X %*% beta + epsilon;
  
  # Estimate beta and return
  beta.hat <- solve(t(X) %*% X) %*% t(X) %*% y;
  as.data.frame(x = t(beta.hat));
}

# Run the experiment `M` times
beta.data <- data.frame();
for(i in 1:M) {
  beta.data <- rbind(beta.data, 
                     beta.experiment(N, beta, sigma));
}
colnames(beta.data) <- 
  sprintf('beta%d', seq(length(beta)) - 1);

# Graph the result
beta.plot <- ggplot() + 
  geom_point(data = beta.data,
             mapping = aes(beta0, beta1),
             alpha = 0.2, 
             color = brewer.pal(3, 'Set2')[2]) +
  xlab(TeX('$\\hat{\\beta_0}$')) + ylab(TeX('$\\hat{\\beta_1}$')) +
  ggtitle( TeX('Estimates of $\\hat{\\beta}$') );

Note that $\hat{\beta}$ is a random variable, so that the above values only work in this one instance. If we reran the simulation, we would get a different value $\hat{\beta}$ each time. This is our next step. We choose to rerun our simulation $M = 5000$ times, each time recalculating our $\hat{\beta}$ estimate using $N = 100$ datapoints. Once we do this, we can begin approximating the distribution of our estimator.

The above figure shows that the density of $\hat{\beta}$ is centered at $\beta = \begin{bmatrix} 6 & 2 \end{bmatrix}^{T}$ which strongly suggest that the OLS estimator is unbiased.

Next, we derive some properties of $\hat{\beta}$ and related quantities, affirming what the example above suggests.

Important Matrices

Prediction and Error:
When we make predictions of future values of $y$ we use the estimator $\hat{y}$. This value depends on our OLS estimator $\hat{\beta}$ and makes predictions in the expected way: \[ \begin{array}{rclll} \hat{y} & = & X\hat{\beta} & \ftag{OLS.9} & \\ & = & X(X^{T}X)X^{T}y & & \fref{OLS.8} \\ \hat{y} & = & H y & \ftag{OLS.10} & \end{array} \] Here, we defined the projection matrix: \[ H = X(X^{T}X)^{-1}X^{T} \ftag{OLS.11} \]

This matrix is also called the “hat” matrix because it takes $y$ and outputs $\hat{y}$.

The predictions made by $\hat{y}$ will contain errors because we assumed initially that our data had noise ($\ifref{OLS.3:OLS.4}$). We can indirectly measure this noise through residuals, the difference between predicted values $\hat{y}$ and the observed values $y$. Thus we define residuals as an $N$-dimensional column vector: \[ \begin{array}{rclll} r & = & y - \hat{y} & \ftag{OLS.12} & \\ & = & y - X\hat{\beta} & & \fref{OLS.9} \\ & = & y - X(X^{T}X)^{-1}X^{T}y & & \fref{OLS.8} \\ & = & (I_N - X(X^{T}X)X^{T}) \; y & & \\ r & = & R y & \ftag{OLS.13} & \end{array} \]

Here, we defined the residual matrix, so called because is takes $y$ and outputs the residual vector $r$: \[ R = I_N - X(X^{T}X)^{-1}X^{T} = I_N - H \ftag{OLS.14} \]

Overview of the Matrices:
Unsurprisingly, proofs involving OLS tend to also involve the terms $H = X(X^{T}X)^{-1}X^{T}$ and $R = I_N - X(X^{T}X)^{-1}X^{T}$. Additionally, the matrix $V = (X^{T}X)^{-1}$ comes up as well (it’s in the definition of both $H$ and $R$). These matrices have useful properties that we outline below:

Definition	Size	Symmetric	Idempotent
$V = (X^{T}X)^{-1}$	$(p + 1) \times (p + 1)$	$V^{T} = V$	$V^{2} \neq V$
$H = XVX^{T} = X(X^{T}X)^{-1}X^{T}$	$N \times N$	$H^{T} = H$	$H^{2} = H$
$R = I_N - XVX^{T} = I_N - X(X^{T}X)^{-1}X^{T}$	$N \times N$	$R^{T} = R$	$R^{T} = R$

We’ll now go through each of the above terms and discuss them in detail.

The $V$ Matrix:
This matrix serves as a building block of both the hat/projection matrix $H$ and the residual matrix $R$. The matrix also comes up when we want to estimate the variance of our OLS estimator $\hat{\beta}$ (see $\ifref{OLS.24}$).

Size
Both the matrices $X^{T}X$ and $V = (X^{T}X)^{-1}$ have dimensions $(p + 1) \times (p + 1)$: \[ \begin{array}{cccc} X^{T} & X & & X^{T} X \\ \begin{bmatrix} x_{1,0} & x_{2,0} & \cdots & x_{N,0} \\ x_{1,1} & x_{2,1} & \cdots & x_{N,1} \\ \vdots & & & \vdots \\ x_{1,p} & x_{2,p} & \cdots & x_{N,p} \end{bmatrix} & \begin{bmatrix} x_{1,0} & x_{1,1} & \cdots & x_{1,p} \\ x_{2,0} & x_{2,1} & \cdots & x_{2,p} \\ \vdots & & & \vdots \\ x_{N,0} & x_{N,1} & \cdots & x_{N,p} \end{bmatrix} & = & \begin{bmatrix} X_0^{T}X_0 & X_0^{T}X_1 & \cdots & X_0^{T}X_p \\ X_1^{T}X_0 & X_1^{T}X_1 & \cdots & X_1^{T}X_p \\ \vdots & & & \vdots \\ X_p^{T}X_0 & X_p^{T}X_1 & \cdots & X_p^{T}X_p \end{bmatrix} \\ (p + 1) \times N & N \times (p + 1) & & (p + 1) \times (p + 1) \end{array} \] Here, we used $X_i$ as shorthand for the column vectors of $X$ so that $X_i = \begin{bmatrix} x_{1,i} & x_{2,i} & \cdots & x_{N,i} \end{bmatrix}^{T}$ Since $V = (X^{T}X)^{-1}$ is just the inverse of $X^{T}X$, it must have the same dimensions.
Symmetry
Next, we show that both $X^{T}X$ and $V = (X^{T}X)^{-1}$ are symmetric. This follows from basic properties of matrix transposes and inverses: \[ \begin{array}{rcccl} (X^{T}X)^{T} & = & X^{T} \cdot (X^{T})^{T} & = & X^{T}X \\ \left( (X^{T}X)^{-1} \right)^{T} & = & \left( (X^{T}X)^{T} \right)^{-1} & = & (X^{T}X)^{-1} \ftag{OLS.15} \end{array} \]
Idempotence
Note that, unlike $H$ and $R$, $V$ is not idempotent in the general case.

The Hat Matrix $H$:
As discussed above, the hat matrix $H$ is involved in turning our data $y$ into least squares predictions $\hat{y}$.

Size
$H$ has dimensions $N \times N$. We see this immediately visible from its definition: \[ \begin{array}{ccccc} X & V & X^{T} & = & H \\ N \times (p + 1) & (p + 1) \times (p + 1) & (p + 1) \times N & & N \times N \end{array} \]
Symmetry
The hat matrix $H = X(X^{T}X)^{-1}X^{T}$ is symmetric: \[ \begin{array}{rcll} H^{T} & = & \left( X(X^{T}X)X^{T} \right)^{T} & \fref{OLS.11} \\ & = & (X^{T})^{T} (X^{T}X)^{T} X^{T} & \\ & = & X (X^{T}X) X^{T} & \fref{OLS.15} \\ H^{T} & = & H \ftag{OLS.16} & \end{array} \]
Idempotence
The hat matrix $H = X(X^{T}X)^{-1}X^{T}$ is idempotent, meaning that $H^{2} = H$. This is easy to prove directly: \[ \begin{array}{rcll} H^{2} & = & \left( X(X^{T}X)^{-1}X^{T} \right)^{2} & \fref{OLS.11} \\ & = & X(X^{T}X)^{-1}X^{T} \cdot X(X^{T}X)^{-1}X^{T} \\ & = & X(X^{T}X)^{-1}(X^{T}X)(X^{T}X)^{-1}X^{T} \\ & = & X(X^{T}X)^{-1} I_{(p + 1)}X^{T} \\ & = & X(X^T{}X)^{-1}X^{T} \\ H^{2} & = & H \ftag{OLS.17} & \fref{OLS.11} \end{array} \] Note that, if we combine $\ifref{OLS.16}$ and $\ifref{OLS.17}$ we arrive at $H^{T}H = H \; H = H^2 = H$.

The Residual Matrix $R$:
As discussed above, the residual matrix $R$ is involved with turning our data $y$ into residual errors $r$.

Size
$R$ has dimensions $N \times N$. This is immedietally apparent if you consider that $R = I_N - H$ ($\ifref{OLS.14}$), so that it must have the same size as $H$.
Symmetry
Again, we turn towards the fact that $R = I_N - H$ ($\ifref{OLS.14}$) to show that $R$ is symmetric: \[ \begin{array}{rcll} R^{T} & = & (I_N - H)^{T} & \fref{OLS.14} \\ & = & I_N^{T} - H^{T} & \\ & = & I_N - H & \fref{OLS.16} \\ R^{T} & = & R \ftag{OLS.18} \end{array} \]
Idempotence
As with $H$, the residual matrix $R$ is idempotent so that $R^2 = R$. We prove this using the usual definition: \[ \begin{array}{rcll} R^2 & = & (I_N - H)^2 & \fref{OLS.14} \\ & = & (I_N - H) \cdot (I_N - H) & \\ & = & I_N I_N - I_N H - H I_N + H^2 & \\ & = & I_N - H - H + H & \fref{OLS.17} \\ & = & I_N - H & \\ R^2 & = & R \ftag{OLS.19} & \end{array} \]

Connection Between $H$ and $R$:
The projection/hat matrix and the residual matrix have several useful properties when combined with other components. In particular, when we look at different combinations of $H$ and $R$ with $X$ we can see that:

$H$ and $X$: \[ \begin{array}{rcll} H X & = & X(X^{T}X)^{-1}X^{T}\cdot X & \fref{OLS.11} \\ & = & X I_{(p + 1)} \\ H X & = & X \ftag{OLS.20} \end{array} \]
$R$ and $X$: \[ \begin{array}{rcll} R X & = & (I_N - H)X & \fref{OLS.14} \\ & = & I_N X - H X & \\ & = & X - X & \fref{OLS.20} \\ R X & = & 0 \ftag{OLS.21} & \end{array} \]

The following table summarizes the interactions we’ve learned about so far:

	$y$	$X$
$H$	$H y = \hat{y}$ ($\ifref{OLS.10}$)	$H X = X$ ($\ifref{OLS.20}$)
$R$	$R y = r$ ($\ifref{OLS.13}$)	$R X = 0$ ($\ifref{OLS.21}$)

These properties show how $H$ and $R$ behave with the data matrix $X$. Combining the above with the fact that $y = \hat{y} + r$ ($\ifref{OLS.12}$), we can decompose into a predicted part and a residual part: \[ y = H y \;+\; R y \ftag{OLS.22} \]

Properties of $\small \hat{\beta}$

Overview:
In this section, we prove some basic facts about the OLS estimator $\hat{\beta}$.

The OLS estimator $\hat{\beta}$ is unbiased so that $\mathbb{E}(\hat{\beta}) = \beta$. We saw an example of this in the previous section.
The OLS estimator $\hat{\beta}$ has a variance-covarirance matrix of the form $Var(\hat{\beta}) = \sigma^2 \left( X^{T}X\right)^{-1}$.

Expected Value:
First up, we want to prove that our estimator is unbiased. This is a straightforward exercise if we just expand $\hat{\beta}$ using our definitions: \[ \begin{array}{rcll} \mathbb{E}(\hat{\beta}) & = & \mathbb{E}((X^{T}X)^{-1}X^{T}y) & \fref{OLS.8} \\ & = & \mathbb{E}((X^{T}X)^{-1}X^{T} \cdot (X\beta + \epsilon)) & \fref{OLS.1} \\ & = & \mathbb{E}((X^{T}X)^{-1}X^{T}X\beta + (X^{T}X)^{-1}X^{T}\epsilon)) \\ & = & \mathbb{E}(\beta) + (X^{T}X)^{-1}X^{T}\mathbb{E}(\epsilon) \\ & = & \beta + (X^{T}X)^{-1}X^{T} \cdot 0 & \fref{OLS.4} \\ \mathbb{E}(\hat{\beta}) & = & \beta \ftag{OLS.23} & \end{array} \]

Variance:
We want to show that the variance-covariance matrix of $\hat{\beta}$ has dimensions $(p + 1) \times (p + 1)$ and looks like: \[ Var(\hat{\beta}) = \begin{bmatrix} Var(\hat{\beta}_0) & Cov(\hat{\beta}_0, \hat{\beta}_1) & \cdots & Cov(\hat{\beta}_0, \hat{\beta}_p) \\ Cov(\hat{\beta}_1, \hat{\beta}_0) & Var(\hat{\beta}_1) & \cdots & Cov(\hat{\beta}_1, \hat{\beta}_p) \\ \vdots & & & \vdots \\ Cov(\hat{\beta}_p, \hat{\beta}_0) & Cov(\hat{\beta}_p, \hat{\beta}_1) & \cdots & Var(\hat{\beta}_p) \end{bmatrix} = \sigma^2 (X^{T}X)^{-1} \ftag{OLS.24} \]

To do this, we note that we treat $X$ (and any combinations involving $X$) as a constant. We once again expand $\hat{\beta}$ and use properties of $Var(\Box)$ to get: \[ \begin{array}{rcll} Var(\hat{\beta}) & = & Var((X^{T}X)^{-1}X^{T}y) & \fref{OLS.8} \\ & = & (X^{T}X)^{-1}X^{T} \;Var(y)\; ((X^{T}X)^{-1}X^{T})^{T} & \fref{V.11} \\ & = & (X^{T}X)^{-1}X^{T} \;Var(y)\; X(X^{T}X)^{-1} & \fref{OLS.15} \\ & = & (X^{T}X)^{-1}X^{T} \;Var(X\beta + \epsilon)\; X(X^{T}X)^{-1} & \fref{OLS.1} \\ & = & (X^{T}X)^{-1}X^{T} \;Var(\epsilon)\; X(X^{T}X)^{-1} & \fref{V.13} \\ & = & (X^{T}X)^{-1}X^{T} \;\sigma^2 I_{N}\; X(X^{T}X)^{-1} & \fref{OLS.4} \\ & = & \sigma^2 (X^{T}X)^{-1}\cdot(X^{T}X)(X^{T}X)^{-1} & \\ Var(\hat{\beta}) & = & \sigma^2 (X^{T}X)^{-1} & \end{array} \]

We will frequently refer to the diagonal elements of $(X^{T}X)^{-1}$ as $v_j$ for convenience. This implies that $Var(\hat{\beta}_j)$ can be expressed as $\sigma^2 \cdot v_j$.

Estimating $\sigma^2$

In our model, we assumed that our dependent variable $y$ and our independent variable $X$ are linearly related through the term $\beta$. We’ve already seen how we can estimate $\beta$ by minimizing squared error $RSS$ ($\ifref{OLS.6}$), but we’ve yet to estimate the second unknown parameter in our model, namely $\sigma$. Thus, in this section, we turn our attention to estimating $\sigma^2$.

Recall that our full linear model is $y = X\beta + \epsilon$ ($\ifref{OLS.1}$) where we defined $\epsilon$ as an $N$-dimensional random vector with $\mathbb{E}(\epsilon) = 0$ and $Var(\epsilon) = \sigma^2 I_N$ ($\ifref{OLS.4}$). Can we somehow estimate the variance of our noise $\epsilon$? The following observation about squared error will motivate the answer. We ask, what is the expected value of our squared error given our model? In other words, what is $\mathbb{E}(RSS(\beta))$? We already have all the tools we need to calculate this: \[ \begin{array}{rcll} \mathbb{E}(RSS(\beta)) & = & \mathbb{E}(\;(y - X\beta)^{T}(y - X\beta)\;) & \fref{OLS.6} \\ & = & \mathbb{E}(\;(X\beta + \epsilon - X\beta)^{T}(X\beta + \epsilon - X\beta)\;) & \fref{OLS.1} \\ & = & \mathbb{E}(\;\epsilon^{T} \epsilon\;) & \\ & = & \mathbb{E}\left(\; \sum\limits_{i = 1}^{N} \epsilon_i^2 \;\right) & \\ & = & \sum\limits_{i = 1}^{N} \mathbb{E}(\;\epsilon_i^2\;) & \\ & = & \sum\limits_{i = 1}^{N} \mathbb{E}(\;\epsilon_i\;)^2 + Var(\;\epsilon_i\;) & \\ & = & N \cdot (0^2 + \sigma^2) & \fref{OLS.3} \\ \mathbb{E}(RSS(\beta)) & = & N \sigma^2 \end{array} \]

The above result demonstrates two points. First, we can rewrite $RSS(\beta)$ using our noise variable $\epsilon$ \[ RSS(\beta) = \epsilon^{T}\epsilon \ftag{OLS.25} \]

Second, if we know the true value of $\beta$, then we can figure out the variance of the noise variable $\epsilon$ by calculating the quantity: \[ \dfrac{\mathbb{E}(RSS(\beta))}{N} = \dfrac{\mathbb{E}(\epsilon^{T}\epsilon)}{N} = \sigma^2 \ftag{OLS.26} \]

Of course, we do not know the true value of $\beta$, nor do we know anything about $\epsilon$. We do, however, have an OLS esimate $\hat{\beta}$. Furthermore, we know of the following two suggestive relations: \[ \require{enclose} \begin{array}{rrcll} \enclose{circle}{1} & y & = & X\beta + \epsilon & \fref{OLS.1} \\ \enclose{circle}{2} & y & = & X\hat{\beta} + r & \fref{OLS.9:OLS.12} \\ & & = & H y + R y & \fref{OLS.22} \end{array} \]

Here, we see how our model and our predictors deconstruct the vector $y$. In our model $\enclose{circle}{1}$, we deconstruct $y$ as the sum of a definite linear term $X\beta$ and a random noise term with no bias $\epsilon$. Once we start using estimators, both of the terms become random. In our second deconstruction $\enclose{circle}{2}$, the first term $X\hat{\beta} = H y$ attempts to estimate the definite linear term $X\beta$, while the second term, our residuals $r = R y$, attempt to estimate the noise $\epsilon$.

Looking at this, it would make sense to try to estimate $\sigma^2$ using our residuals $r$. Since $\beta \leftrightarrow \hat{\beta}$ and $\epsilon \leftrightarrow r$, maybe we can estimate $\sigma^2$ as: \[ \dfrac{\mathbb{E}(RSS(\hat{\beta}))}{N} \stackrel{?}{=} \dfrac{\mathbb{E}(r^{T}r)}{N} \stackrel{?}{=} \sigma^2 \]

This turns out to be almost the right answer. The main complication arrises from the fact that we never observe our residuals directly, but through our estimate $\hat{\beta}$. In other words, we don’t define our residuals as $r_{fake} = y - X\beta$ (which we cannot do because we don’t know $\beta$), but as $r_{real} = y - \hat{y} = y - X\hat{\beta}$. This matters because: \[ \require{enclose} \begin{array}{rrcll} \enclose{circle}{1} & r_{fake} & = & y - X\beta & \\ & & = & X\beta + \epsilon - X\beta & \fref{OLS.1} \\ & r_{fake} & = & \epsilon & \\ \enclose{circle}{2} & r_{real} & = & y - X\hat{\beta} & \fref{OLS.9:OLS.12} \\ & & = & X\beta + \epsilon - X\hat{\beta} & \fref{OLS.1} \\ & r_{real} & = & \epsilon + X(\beta - \hat{\beta}) \end{array} \]

This shows that trying to estimate $\sigma^2$ through our residuals will lead to an imperfect estimate because $\beta - \hat{\beta} \neq 0$ unless $N \to \infty$. Fortunately, we can actually measure this effect directly and correct for it. Consider, once again, the expected value of our squared error term. This time, however, we want to calculate this error using $\hat{\beta}$, instead of $\beta$: \[ \begin{array}{rcll} \mathbb{E}(RSS(\hat{\beta})) & = & \mathbb{E}(\;(y - X\hat{\beta})^{T}(y - X\hat{\beta})\;) & \fref{OLS.6} \\ & = & \mathbb{E}(\;(X\hat{\beta} + r - X\hat{\beta})^{T}(X\hat{\beta} + r - X\hat{\beta})\;) & \fref{OLS.9:OLS.12} \\ & = & \mathbb{E}(\;r^{T}r\;) & \\ & = & \mathbb{E}(\;(R y)^{T}(R y)\;) & \fref{OLS.13} \\ & = & \mathbb{E}(\;(RX\beta + R\epsilon)^{T}(RX\beta + R\epsilon)\;) & \fref{OLS.1} \\ & = & \mathbb{E}(\;(0 + R\epsilon)^{T}(0 + R\epsilon)\;) & \fref{OLS.21} \\ & = & \mathbb{E}(\;\epsilon^{T}R^{T}R\epsilon\;) & \\ \mathbb{E}(RSS(\hat{\beta})) & = & \mathbb{E}(\;\epsilon^{T}R\epsilon\;) & \fref{OLS.18:OLS.19} \end{array} \]

To move further, we note that $\epsilon^{T}R\epsilon$ is a actually as scalar: \[ \begin{array}{ccc} \epsilon^{T} & R & \epsilon \\ 1 \times N & N \times N & N \times 1 \end{array} \implies 1 \times 1 \]

Thus, $\epsilon^{T}R\epsilon$ equals its own trace, $tr(\epsilon^{T}R\epsilon) = \epsilon^{T}R\epsilon$. We can now use this knowledge, as well as the fact that $\mathbb{E}(tr(X)) = tr(\mathbb{E}(X))$ and the cyclic permutation property of traces ($\ifref{MO.5}$) to write: \[ \begin{array}{rcll} \mathbb{E}(RSS(\hat{\beta})) & = & \mathbb{E}(\;\epsilon^{T}R\epsilon\;) & \\ & = & \mathbb{E}(\; tr(\epsilon^{T}R\epsilon)\;) & \\ & = & \mathbb{E}(\; tr(R\epsilon\epsilon^{T})\;) & \fref{MO.5} \\ & = & tr(\;\mathbb{E}(R\epsilon\epsilon^{T})\;) & \\ & = & tr(\;R\;\mathbb{E}(\epsilon\epsilon^{T})\;) & \\ & = & tr(\;R\;Var(\epsilon)\;) & \fref{V.5:OLS.4} \\ & = & tr(\;R\;\sigma^2 I_N\;) & \fref{OLS.4} \\ & = & \sigma^2\;tr(R) & \\ & = & \sigma^2\;tr(I_N - H) & \fref{OLS.14} \\ & = & \sigma^2\;(tr(I_N) - tr(X(X^{T}X)^{-1}X^{T})) & \fref{MO.2:OLS.11} \\ & = & \sigma^2\;(N - tr(X^{T}X(X^{T}X)^{-1})) & \fref{MO.5} \\ & = & \sigma^2\;(N - tr(I_{(p + 1)})) & \\ \mathbb{E}(RSS(\hat{\beta})) & = & \sigma^2\;(N - p - 1) \end{array} \]

The above discussion demonstrates two points. First, we can rewrite $RSS(\hat{\beta})$ using our residuals $r$: \[ RSS(\hat{\beta}) = r^{T}r \ftag{OLS.27} \]

Second, if we calculate the expected value of our residual squared errors and normalize by $N - p - 1$, then we can find the variation of our noise: \[ \dfrac{\mathbb{E}(RSS(\hat{\beta}))}{N - p - 1} = \dfrac{\mathbb{E}(r^{T}r)}{N - p - 1} = \sigma^2 \ftag{OLS.28} \]

$\ifref{OLS.28}$ finally prodvides us with a means of estimating our model noise. We define a new estimator: \[ \hat{\sigma}^2 = \dfrac{1}{N - p - 1} RSS(\hat{\beta}) = \dfrac{1}{N - p - 1} r^{T}r \ftag{OLS.29} \]

Note that, this estimator is unbiased because $\mathbb{E}(\hat{\sigma}^2) = \sigma^2$.

# Define a function to find squared error (residual) vectors
squared.error <- function(y, X, beta) {
  r <- y - X %*% beta;
  as.numeric( t(r) %*% r );
}

# Calculate estimate
sigma2.hat <- squared.error(y, X, beta.hat) / (N - p - 1);

In a previous sections, we explored the distribution of our OLS estimator $\hat{\beta}$ for a simple regression with $p = 1$ features. Recall that we defined

$N = 100$
$\beta = \begin{bmatrix} \beta_0 & \beta_1 \end{bmatrix}^{T} = \begin{bmatrix} 6 & 2 \end{bmatrix}^{T}$
$\epsilon \sim N(0, \sigma)$ where $\sigma = 2$

If we use $\ifref{OLS.29}$ we can get an idea of the noise present in the data in addition to the linear relationship is between $y$ and $X$ we got from $\ifref{OLS.8}$. In this case we calculate: \[ \hat{\sigma}^2 = \dfrac{1}{N - p - 1} RSS(\hat{\beta}) = 4.2487874 \]

We overlay this value, along with our regression line below. This plot displays both the effects of $\hat{\beta}$ (the regression line) as well as $\hat{\sigma}^2$ (the error region). The error region has a width of $2 \cdot \hat{\sigma} = 2 \cdot 2.0612587$.

# Calculate regression line
x.lin <- seq(from = min(X[,2]),
             to = max(X[,2]),
             length.out = 2);
X.lin <- matrix(
  data = c(rep(1, 2), x.lin),
  nrow = 2,
  ncol = 2
);

y.hat <- X.lin %*% beta.hat;

# Combine data with ribbons (errors)
data.lin <- data.frame(
  'y.hat' = y.hat,
  'y.lo'  = y.hat - sqrt(sigma2.hat),
  'y.hi'  = y.hat + sqrt(sigma2.hat),
  'x.lin' = x.lin
);

# Graph regression line over data with errors
ex.plot + 
  geom_line(data = data.lin,
            mapping = aes(x.lin, y.hat),
            color = brewer.pal(3, 'Set2')[3],
            size = 0.75) +
  geom_ribbon(data = data.lin,
              mapping = aes(x = x.lin, 
                            ymin = y.lo, 
                            ymax = y.hi),
              fill = brewer.pal(3, 'Set2')[3],
              alpha = 0.2);

When estimating $\beta$, we noted that $\hat{\beta}$ is a random variable, so that we will not get the same value if we resample our data. To show this, we simulated the experiment of sampling $N = 100$ datapoints $M = 5000$ times and, for each iteration, we calculated and graphed $\hat{\beta}$. This gave us an apporximate distribution for our estimator.

Of course, $\hat{\sigma}^2$ is a random variable as well. This means we can recalculate $\hat{\sigma}^2$ after simulating/gathering new data to get a different value. Below, we do this $M = 5000$ times and graph the resulting distribution:

# Define a function that will generate a single sigma2 estimate
sigma.experiment <- function(N, beta, sigma) {
  # Simulate `N` datapoints
  epsilon <- matrix(
    rnorm(N, mean = 0, sd = sigma),
    nrow = N,
    ncol = 1
  );

  x.sim <- rnorm(N, mean = 5, sd = sqrt(10));
  X <- matrix(
    c(rep.int(1, N), x.sim),
    nrow = N,
    ncol = p + 1
  );
  y <- X %*% beta + epsilon;
  
  # Estimate beta
  beta.hat <- solve(t(X) %*% X) %*% t(X) %*% y;
  
  # Estimate sigma2
  sigma2.hat <- squared.error(y, X, beta.hat) / (N - p - 1);
  as.numeric(sigma2.hat);
}

# Run the experiment `M` times
sigma2.data <- data.frame();
for(i in 1:M) {
  sigma2.data <- rbind(sigma2.data, 
                     sigma.experiment(N, beta, sigma));
}

colnames(sigma2.data) <- 'sigma2.hat';

# Graph the result
ggplot() + 
  geom_histogram(data = sigma2.data,
                 mapping = aes(sigma2.hat),
                 color = brewer.pal(3, 'Set2')[2],
                 binwidth = 0.1) +
  xlab(TeX('$\\hat{\\sigma}^2$')) + ylab('Count') +
  ggtitle( TeX('Estimates of $\\hat{\\sigma}^2$') );

Notice how the distribution seems centered at $\sigma^2 = 4$. This affirms our result that $\ifref{OLS.29}$ is unbiased.

Gaussian Noise

So far up to this point, we’ve not assumed anything about the distribution of $\epsilon$ besides its mean $\mathbb{E}(\epsilon) = 0$ and its variance $Var(\epsilon) = \sigma^2 I_N$ ($\ifref{OLS.4}$). All of the results we’ve attained so far only rely on these facts. For calculating confidence intervals and confidence sets, we break this constraint and assume normality. In other words, we assume that: \[ \epsilon \sim \mathcal{N}(0, \sigma^2 I_N) \ftag{OLS.30} \]

Assuming gaussian noise allows us to state additional properties of our estimators $\hat{\beta}$ and $\hat{\sigma}^2$. In particular, it allows us to calculate their exact distributions.

Problems

Introduction

This is the problems section.
Click on the above tabs to get view the solution to a problem.

Ex 3.1

Show that the F statistic (3.13) for dropping a single coefficient from a model is equal to the square of the corresponding z-score (3.12).

Preliminaries:
To show the above, we’ll need to review a few results regarding Ordinary Least Squares (OLS) as well as remember the basic definitions of the t distribution and the F distribution.

For linear regression, we assume that our response variable $y$ follows $y = X\beta + \epsilon$.
For this problem, we will assume that $\epsilon \sim N(0, \sigma^2)$. Thanks to the Central Limit Theorem this assumption is not strictly necessary, but it will help us in the calculations below.
$\beta$ and its OLS estimator $\hat{\beta}$ are both $p + 1$ dimensional column vectors, with $p$ representing the number of features: \[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \text{ and } \hat{\beta} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \vdots \\ \hat{\beta}_p \end{bmatrix} \]
The Ordinary Least Squares (OLS) estimator for $\beta$ is $\hat{\beta} = (X^{T}X)^{-1}X^{T}y$. The authors derived this by using vector calculus to minimize the squared residuals.
The OLS estimator $\hat{\beta}$ is unbiased so that $\mathbb{E}(\hat{\beta}) = \beta$. The authors simply stated this earlier, but it is easy to prove: \[ \begin{array}{rcl} \mathbb{E}(\hat{\beta}) & = & \mathbb{E}((X^{T}X)^{-1}X^{T}y) \\ & = & \mathbb{E}((X^{T}X)^{-1}X^{T} \cdot (X\beta + \epsilon)) \\ & = & \mathbb{E}((X^{T}X)^{-1}X^{T}X\beta + (X^{T}X)^{-1}X^{T}\epsilon)) \\ & = & \mathbb{E}(\beta) + (X^{T}X)^{-1}X^{T}\mathbb{E}(\epsilon) \\ & = & \beta \end{array} \]
The variance-covariance matrix of $\hat{\beta}$ has dimensions $(p + 1) \times (p + 1)$ and looks like: \[ Var(\hat{\beta}) = \begin{bmatrix} Var(\hat{\beta}_0) & Cov(\hat{\beta}_0, \hat{\beta}_1) & \cdots & Cov(\hat{\beta}_0, \hat{\beta}_p) \\ Cov(\hat{\beta}_1, \hat{\beta}_0) & Var(\hat{\beta}_1) & \cdots & Cov(\hat{\beta}_1, \hat{\beta}_p) \\ \vdots & & & \vdots \\ Cov(\hat{\beta}_p, \hat{\beta}_0) & Cov(\hat{\beta}_p, \hat{\beta}_1) & \cdots & Var(\hat{\beta}_p) \end{bmatrix} = (X^{T}X)^{-1}\sigma^2 \] The authors glossed over this earlier, but a full proof can be found on the Wikipedia page for Ordinary Least Squares (OLS) Proofs. We will frequently refer to the diagonal elements of $(X^{T}X)^{-1}$ as $v_j$ for convenience. This implies that $Var(\hat{\beta}_j)$ can be expressed as $\sigma^2 \cdot v_j$.
The above two bullet points hold regardless of the distribution of our noise $\epsilon$. However, if we assume that $\epsilon \sim N(0, \sigma^2)$ then we can write: \[ \hat{\beta} \sim N(\beta, (X^{T}X)^{-1}\sigma^2) \]
We can estimate $\sigma^2$ (the variance of our noise) by using the $\hat{\sigma}^2$ estimator: \[ \begin{array}{rcl} \hat{\sigma}^2 & = & \dfrac{1}{N - p - 1} \sum_{i = 1}^{N} (y_i - \hat{y}_i)^2 \\ & = & \dfrac{1}{N - p - 1} (y - X\hat{\beta})^{T}(y - X\hat{\beta}) \\ & = & \dfrac{1}{N - p - 1} RSS(\hat{\beta}) \end{array} \] Note that our $\sigma^2$ estimator $\hat{\sigma}^2$ and the residual sum of squares $RSS$ are proportional to one another.
Assuming that $\epsilon \sim N(0, \sigma^2)$, then our variance estimator $\hat{\sigma}^2$ has a chi-squared distribution with $N - p - 1$ degrees of freedom: \[ \hat{\sigma}^2 \sim \dfrac{\sigma^2}{N - p - 1} \chi^2(N - p - 1) \] Also, since $RSS$ is proportional to $\hat{\sigma}^2$, we can also write: \[ RSS(\hat{\beta}) \sim \sigma^2 \chi^2(N - p - 1) \]
If $Z \sim N(0, 1)$ is a standard normal distribution and $Y \sim \chi^2(l)$ is a chi-squared distribution with degrees of freedom $l$, then: \[ T = \dfrac{Z}{\sqrt{Y / l}} \sim t(l) \] In other words, $T$ has a t distribution with degrees of freedom $l$. Note that this is true only if $Z$ and $Y$ are independent.
If $X \sim \chi^2(k)$ is a chi-squared distribution with degrees of freedom $k$ and $Y \sim \chi^2(l)$ is a chi-squared distribution with degrees of freedom $l$ then: \[ F = \dfrac{X / k}{Y / l} \sim F(k, l) \] In other words, $F$ has a F distribution with parameters $k$ and $l$. Note that this is only true if $X$ and $Y$ are independent.

Proof:
Our proof will follow three overall steps:

First, we will prove that the z-score given by (3.12) has a t distribution with $l = N - p - 1$ degrees of freedom.
Next, we will prove that the F statistic given by (3.13) has a F distribution with parameters $k = p_1 - p_0$ and $l = N - p - 1$.
Finally, we will show that, if $p_1 = p_0 + 1$ (we’re only getting rid of one parameter), then the F statistic is just the square of the z-score.

Part 1
The z-score given by (3.12) has a t distribution with $l = N - p - 1$ degrees of freedom. To see this, note that the definition of the z-score for the $j_{th}$ element of $\hat{\beta}$ looks like: \[ z_j = \dfrac{\hat{\beta}_j}{\hat{\sigma} \sqrt{v_j}} \] From this, we can follow a line of reasoning to express $z_j$ in terms of a standard normal distribution $Z \sim N(0, 1)$ and a chi-squared distribution $Y \sim \chi^2(l)$ with degrees of freedom $l$:

Since $\hat{\beta} \sim N(\beta, (X^{T}X)^{-1}\sigma^2)$ is a multi-dimensional normal distribution, we look at a single component (the $j_{th}$ component) and get $\hat{\beta}_j \sim N(\beta_j, \sigma^2 v_j)$.
Under the null hypothesis that $\beta_j = 0$, so we further simplify $\hat{\beta}_j \sim N(0, \sigma^2 v_j)$.
Using well known properties of normal distributions, we finally rewrite $\hat{\beta}_j = \sigma \sqrt{v_j} \cdot Z$.
Now, the z-score for the $j_{th}$ component looks like: \[ \begin{array}{rcl} z_j & = & \dfrac{\hat{\beta}_j}{\hat{\sigma} \sqrt{v_j}} \\ & = & \dfrac{\sigma \sqrt{v_j} \cdot Z}{\hat{\sigma} \sqrt{v_j}} \\ & = & \dfrac{\sigma Z}{\hat{\sigma}} \end{array} \]
Now, we turn our attention to $\hat{\sigma}^2$: \[ \hat{\sigma}^2 \sim \dfrac{\sigma^2}{N - p - 1} \chi^2(N - p - 1) \]
We rewrite $\hat{\sigma}$ using $Y \sim \chi^2(l)$ where $l = N - p - 1$: \[ \begin{array}{rcl} \hat{\sigma}^2 & = & \dfrac{\sigma^2}{l} Y \\ \hat{\sigma} & = & \sigma \sqrt{\dfrac{Y}{l}} \end{array} \]
Finally, the z-score for the $j_{th}$ component becomes: \[ \begin{array}{rcl} z_j & = & \dfrac{\sigma Z}{\hat{\sigma}} \\ & = & \dfrac{\sigma Z}{\sigma \sqrt{Y / l}} \\ & = & \dfrac{Z}{\sqrt{Y / l}} \sim t(l) \end{array} \]

Thus, the z-score for the $j_{th}$ component has a t distribution as promised.

Part 2
The F statistic (3.13) has a F distribution with $k = p_1 - p_0$ and $l = N - p_1 - 1$. To see this, note that the definition for the F statistic looks like: \[ F = \dfrac{(RSS_0 - RSS_1) / (p_1 - p_0)}{RSS_1 / (N - p_1 - 1)} \] From this, we can follow a straightforward line of reasoning and express $F$ in terms of two independent chi-squared distributions $X \sim \chi^2(k)$ and $Y \sim \chi^2(l)$ where $k = p_1 - p_0$ and $l = N - p_1 - 1$:

We know that $RSS$ terms generally have chi-squared distributions that depend on the number of features $p$: \[ RSS(\hat{\beta}) \sim \sigma^2 \chi^2(N - p - 1) \]
Since $RSS_0$ has $p_0$ features and $RSS_1$ has $p_1$ features, we can rewrite: \[ \begin{array}{rcl} RSS_0 & \sim & \sigma^2 \chi^2(N - p_0 - 1) \\ RSS_1 & \sim & \sigma^2 \chi^2(N - p_1 - 1) \end{array} \]
Furthermore, we know that the sum/difference of two chi-squared random variables results in a new chi-squared variable with degrees of freedom that are the sum/difference of the constituents: \[ A \sim \chi^2(a) \text{ and } B \sim \chi^2(b) \\ \Downarrow \\ A \pm B \sim \chi^2(a \pm b) \]
Thus, we know that $RSS_0 - RSS_1$ also has a chi-squared distribution: \[ \begin{array}{rcl} RSS_0 - RSS_1 & \sim & \sigma^2 \chi^2( (N - p_0 - 1) - (N - p_1 - 1)) \\ & \sim & \sigma^2 \chi^2(p_1 - p_0) \end{array} \]
Now we substitute our $X$ and $Y$ chi-squared variables defined above: \[ \begin{array}{rcl} RSS_0 - RSS_1 & = & \sigma^2 X \\ RSS_1 & = & \sigma^2 Y \end{array} \]
This yields a new F statistic: \[ \begin{array}{rcl} F & = & \dfrac{(RSS_0 - RSS_1) / (p_1 - p_0)}{RSS_1 / (N - p_1 - 1)} \\ & = & \dfrac{\sigma^2 X / (p_1 - p_0)}{\sigma^2 Y / (N - p_1 - 1)} \\ & = & \dfrac{X / k}{Y / l} \sim F(k, l) \end{array} \]

Thus, our F statistic has an F distribution as promised.

Part 3
If $p_1 = p_0 + 1$ (we’re only getting rid of one parameter), then the F statistic is just the square of the z-score. We can prove this by comparing these two quantities directly.
First up, the square of the z-score: \[ \begin{array}{rcl} z_j & = & \dfrac{Z}{\sqrt{Y / l}} \\ z_j^2 & = & \dfrac{Z^2}{Y / l} \end{array} \] Here, we used the above results where $Z \sim N(0, 1)$ and $Y \sim \chi^2(l)$ with $l = N - p_1 - 1$.
Next up, the F statistic for just one feature. Here, we note that: \[ \begin{array}{rcl} k & = & p_1 - p_0 \\ & = & (p_0 + 1) - p_0 \\ & = & 1 \end{array} \] This means the F statistic now looks like: \[ F = \dfrac{X / k}{Y / l} = \dfrac{X}{Y / l} \] Where, once again, we have $X \sim \chi^2(k) = \chi^2(1)$ and $Y \sim \chi^2(N - p_1 - 1)$.
But note that, in this case, $Z^2 = X$ because squaring a standard normal variable yields a chi-squared variable with degrees of freedom $1$. Thus: \[ F = \dfrac{X}{Y / l} = \dfrac{Z^2}{Y / l} = z_j^2 \] So the z-score and the F statistic have the same distribution, an F distribution $F(1, l)$. This shows that the two statistics are identical.

Ex 3.2

Given data on two variables $X$ and $Y$, consider fitting a cubic polynomial regression model $f(X) = \sum_{j = 0}^3 \beta_j X^j$. In addition to plotting the fitted curve, you would like a $95\%$ confidence band around the curve. Consider the following two approaches:

At each point $x_0$, form a $95\%$ confidence interval for the linear function $a^{T}\beta = \sum_{j = 0}^3 \beta_j x_0^j$.
Form a $95\%$ confidence set for $\beta$ as in (3.15), which in turn generates confidence intervals for $f(x_0)$.

How do these approaches differ? Which band is likely to be wider? Conduct a small simulation experiment to compare the two methods.

Solution:
In general, this question asks about a cubic polynomial of the form: \[ \begin{array}{rcl} y & = & \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \\ & = & \begin{bmatrix} 1 & x & x^2 & x^3 \end{bmatrix} \cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{bmatrix} \end{array} \]

Our goal is to describe and compare two approaches to building confidence intervals for our predicted function $\hat{f}(x) = \hat{y} = X\hat{\beta} \ifref{OLS.9}$. The first method follows a pointwise approach to building confidence intervals; the second uses a global method.

Throughout the rest of this problem, we’ll work with the following cubic polynomial, so that we can get a concrete sense of how each method differs: \[ \begin{array}{rcl} y & = & 0.1x^3 -0.1x^2-3x+15 \\ & = & \begin{bmatrix} 1 & x & x^2 & x^3 \end{bmatrix} \cdot \begin{bmatrix} 15 \\ -3 \\ -0.1 \\ 0.1 \end{bmatrix} \end{array} \]

# In the following code, I use a few conventions to help keep things organized:
# - The ending `.lin` refers to variables derived from an equally spaced
#   sequence of x-values. There are always `L` of these.
# - The ending `.sim` refers to variables derived from randomly chosen
#   x-values. There are always `N` of these and they are simulated.
# - The ending `.hat` refers to estimates, such as `beta.hat` or `y.hat`.
# - The endings `.p` and `.g` refer to variables for finding the pointwise and
#   global confidence intervals, respectively.
    

# Number of linear points, sample points, and features
L <- 100;
N <- 20;
p <- 3;

# Define the beta matrix
beta <- matrix(
  c(15, -3, -0.1, 0.1),
  nrow = p + 1,
  ncol = 1
);

# Define the linear data matrix `X.lin`
x.lin <- seq(-10, 10, length.out = L);
X.lin <- matrix(
  c(rep.int(1, L), x.lin, x.lin*x.lin, x.lin*x.lin*x.lin),
  nrow = L,
  ncol = p + 1
);

# Get resulting `y.lin` from the true `beta`
y.lin <- X.lin %*% beta;

# Combine into a `data.frame`
data.lin <- data.frame('x' = x.lin, 'y' = y.lin);

# Graph the results
ggplot() + 
  geom_line(data = data.lin, 
             mapping = aes(x, y),
             color = brewer.pal(3, 'Set2')[2]) +
  ggtitle('Cubic Polynomial')

For both parts, we’ll take the following steps:

Set the true value of $\beta = \begin{bmatrix} 15 & -3 & -0.1 & 0.1 \end{bmatrix}^{T}$.
Generate $N = 20$ sample points with noise $\epsilon \sim N(0, \sigma^2)$ where $\sigma^2 = 100$. This creates a data matrix $X$: \[ \begin{array}{rccl} X & = & \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^3 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{20} & x_{20}^2 & x_{20}^3 \end{bmatrix} & \fref{OLS.5} \\ & & 20 \times 4 \end{array} \]
Calculate $\hat{\beta}$ using $\ifref{OLS.8}$.
Afterwards we will determine the $95\%$ confidence intervals/sets.

Part (1)
First up, we’ll take a pointwise approach to constructing confidence intervals. To do this, we’ll fix $x$ at a constant by arbitrary value $x_0$ and calculate our predicted $\hat{y}_0$ at that point. Note that $\hat{y}_0$ is actually a random variable because it is a scaled version of $\hat{\beta}$: \[ \begin{array}{rcl} \hat{y}_0 & = & X_0 \hat{\beta} \\ & = & \begin{bmatrix} 1 & x_0 & x_0^2 & x_0^3 \end{bmatrix} \hat{\beta} \end{array} \] Since $\hat{y}_0$ is a random variable, we can build a standard confidence interval for it by finding it’s variance: \[ \begin{array}{rcll} Var(\hat{y}_0) & = & Var(X_0 \hat{\beta}) & \\ & = & X_0 Var(\hat{\beta}) X_0^{T} & \fref{V.11} \\ & = & X_0 (X^{T}X)^{-1} \sigma^2 X_0^{T} & \fref{OLS.14} \\ Var(\hat{y}_0) & = & \sigma^2 \cdot X_0(X^{T}X)^{-1}X_0^{T} \end{array} \]

Next, we use the standard z-score for to calculate the our confidence intervals: \[ C = [ \; \hat{y}_0 - z^{(1-\alpha)} \sigma \sqrt{X_0(X^{T}X)^{-1}X_0^{T}}, \; \hat{y}_0 + z^{(1-\alpha)} \sigma \sqrt{X_0(X^{T}X)^{-1}X_0^{T}} \; ] \] The following graph shows this for our cubic polynomial. Notice that the pointwise confidence intervals depend upon the local density of data. If we have lots of data in some interval, this interval will have small confidence intervals. On the other hand, if another interval has a small density of datapoints within it, the confidence interval grows larger.

# Define the data matrix `X.sim`
x.sim <- runif(N, -10, 10);
X.sim <- matrix(
  c(rep.int(1, N), x.sim, x.sim*x.sim, x.sim*x.sim*x.sim),
  nrow = N,
  ncol = p + 1
);

# Define the noise
sigma <- 10;
epsilon <- rnorm(N, 0, sigma);

# Get resulting `y.sim` from true `beta`
y.sim <- X.sim %*% beta + epsilon;

# Calculate beta estimate `beta.hat` using simulated data
inv.sim <- solve( t(X.sim) %*% X.sim ); # Inverse of X'X
beta.hat <- inv.sim %*% t(X.sim) %*% y.sim;

# Calculate y estimate `y.hat.lin` using simulated data
y.hat.lin <- X.lin %*% beta.hat;

# Calculate upper 95% and lower 95% pointwise intervals
alpha     <- 0.05;
z_alpha   <- as.numeric( qnorm(1 - alpha/2) );
var.lin.p <- sqrt( diag(X.lin %*% inv.sim %*% t(X.lin)) );

y_lower.hat.lin.p <- y.hat.lin - z_alpha * sigma * var.lin.p;
y_upper.hat.lin.p <- y.hat.lin + z_alpha * sigma * var.lin.p;

# Combine into a `data.frame`
data.sim   <- data.frame('x' = x.sim, 'y' = y.sim);
data.lin.p <- rbind(
  data.frame('x' = x.lin, 'y' = y.hat.lin, 'Type' = 'PREDICT'),
  data.frame('x' = x.lin, 'y' = y_lower.hat.lin.p, 'Type' = 'LOWER'),
  data.frame('x' = x.lin, 'y' = y_upper.hat.lin.p, 'Type' = 'UPPER')
);

# Graph the results
ggplot() + 
  geom_point(data = data.sim, 
             mapping = aes(x, y),
             color = brewer.pal(3, 'Set2')[2]) +
  geom_line(data = data.lin.p,
            mapping = aes(x, y, color = Type)) +
  ggtitle('Polynomial Regression - Pointwise');

Part (2):
In the second part, we’ll take a global approach to estimating the confidence intervals for our cubic polynomial.

Ex 3.3

Gauss-Markov theorem:

Part (a)

Prove the Gauss-Markov theorem: the least squares estimate of a parameter $a^{T}\beta$ has variance no bigger than that of any other linear unbiased estimate of $a^{T}\beta$ (Section 3.2.2).

Preliminaries:
To prove the Gauss-Markov theorem, we’ll need a few preliminary tools. Make sure you understand each of the ideas below before continuing.

We can decompose $y$ as $y = X\beta + \epsilon$. This is by definition of the model.
The noise terms for our response variable have mean $0$, so that $\mathbb{E}(\epsilon) = 0$.
The noise terms for our response variable have constant variance $\sigma^2$, so that $Var(\epsilon) = \sigma^2$.
The matrix $(X^{T}X)$ and its inverse $(X^{T}X)^{-1}$ are symmetric, so that: \[ \begin{array}{rcl} (X^{T}X)^{T} & = & X^{T}X \\ \left( (X^{T}X)^{-1} \right)^{T} & = & (X^{T}X)^{-1} \end{array} \]
The Ordinary Least Squares (OLS) estimator for $\beta$ is $\hat{\beta} = (X^{T}X)^{-1}X^{T}y$. The authors derived this by using vector calculus to minimize the squared residuals.
We need to know that $\mathbb{E}(\hat{\beta}) = \beta$, so that $\hat{\beta}$ is unbiased. The authors simply stated this earlier, but it is easy to prove: \[ \begin{array}{rcl} \mathbb{E}(\hat{\beta}) & = & \mathbb{E}((X^{T}X)^{-1}X^{T}y) \\ & = & \mathbb{E}((X^{T}X)^{-1}X^{T} \cdot (X\beta + \epsilon)) \\ & = & \mathbb{E}((X^{T}X)^{-1}X^{T}X\beta + (X^{T}X)^{-1}X^{T}\epsilon)) \\ & = & \mathbb{E}(\beta) + (X^{T}X)^{-1}X^{T}\mathbb{E}(\epsilon) \\ & = & \beta \\ \end{array} \]
We need to know that $Var(\hat{\beta}) = (X^{T}X)^{-1} \cdot \sigma^{2}$. The authors glossed over this earlier, but a full proof can be found on the Wikipedia page for Ordinary Least Squares (OLS) Proofs.
If $X$ is a random column vector and $a$ is a constant column vector of the same size, then $Var(a^{T}X) = a^{T} Var(X) a$. We can prove by using the definition of a variance-covariance matrix: \[ Var(X) = \mathbb{E}((X-\mathbb{E}(X))\cdot(X - \mathbb{E}(X))^{T}) \] Using this definition on $a^{T}X$, we can write: \[ \begin{array}{rcl} Var(a^{T}X) & = & \mathbb{E}((a^{T}X - \mathbb{E}(a^{T}X))\cdot(a^{T}X - \mathbb{E}(a^{T}X))^{T}) \\ & = & \mathbb{E}(a^{T}(X - \mathbb{E}(X))\cdot(X^{T}a - \mathbb{E}(X)^{T}a)) \\ & = & \mathbb{E}(a^{T}(X - \mathbb{E}(X))\cdot(X^{T} - \mathbb{E}(X)^{T})a) \\ & = & \mathbb{E}(a^{T}(X - \mathbb{E}(X))\cdot(X - \mathbb{E}(X))^{T}a) \\ & = & a^{T}\cdot\mathbb{E}((X - \mathbb{E}(X))\cdot(X - \mathbb{E}(X))^{T})\cdot a \\ & = & a^{T} Var(X) a \end{array} \] Here, I made liberal use of the facts that:
- For any matrices $A$ and $B$, $(A + B)^{T} = A^{T} + B^{T}$
- Since $a$ is a constant column vector, $\mathbb{E}(a^{T}X) = a^{T}\mathbb{E}(X)$

Proof:
To demonstrate the veracity of the Gauss-Markov theorem, we need to prove that: \[ Var(\hat{\theta}) \leq Var(\tilde{\theta}) \] Or, equivalently: \[ Var(a^{T}\hat{\beta}) \leq Var(c^{T}y) \] Let’s make a few definitions to clear things up.
First, note that our OLS estimator $\hat{\theta}$ can be written fully as: \[ \begin{array}{rcl} \hat{\theta} & = & a^{T} \hat{\beta} \\ & = & a^{T} \cdot (X^{T}X)^{-1}X^{T}y \end{array} \] Second, we write our other arbitrary unbiased estimator $c^{T}y$ as: \[ \begin{array}{rcl} \tilde{\theta} & = & c^{T}y \\ & = & (a^{T}(X^{T}X)^{-1}X^{T} + d^{T}) y \\ & = & a^{T}(X^{T}X)^{-1}X^{T}y + d^{T}y \\ & = & a^{T}\hat{\beta} + d^{T}y \\ & = & \hat{\theta} + d^{T}y \end{array} \] Here, we defined $c^{T}$ as: \[ c^{T} = a^{T}(X^{T}X)^{-1}X^{T} + d^{T} \] This decomposition is always possible since, if we know both $\hat{\theta}$ and $\tilde{\theta}$, we can take their difference to find $d^{T}y$. We can think of $d^{T}y$ as a measure of how much our arbitrary estimator $\tilde{\theta}$ deviates from our OLS estimator $\hat{\theta}$.
We can find out more about the deviation term by using the fact that $\tilde{\theta}$ is an unbiased estimator so that $\mathbb{E}(\tilde{\theta}) = a^{T}\beta$. Thus, we write: \[ \begin{array}{rcl} \mathbb{E}(\tilde{\theta}) & = & \mathbb{E}(c^{T}y) \\ & = & \mathbb{E}(a^{T}\hat{\beta} + d^{T}y) \\ & = & a^{T}\beta + \mathbb{E}(d^{T}y) \end{array} \] We can decompose $y$ using our model ($y = X\beta + \epsilon$) to simplify this expression further: \[ \begin{array}{rcl} \mathbb{E}(\tilde{\theta}) & = & a^{T}\beta + \mathbb{E}(d^{T} \cdot (X\beta + \epsilon)) \\ & = & a^{T}\beta + \mathbb{E}(d^{T}X\beta) + \mathbb{E}(\epsilon) \\ & = & a^{T}\beta + \mathbb{E}(d^{T}X\beta) + 0 \\ & = & a^{T}\beta + d^{T}X\beta \\ \end{array} \] Since we know that $\mathbb{E}(\tilde{\theta}) = a^{T}\beta$, this implies that: \[ d^{T}X\beta = 0 \implies d^{T}X = 0 \] While this result does not prove the theorem, we will make use of it soon.

Now, we look at the quantity $Var(\tilde{\theta})$: \[ \begin{array}{rcl} Var(\tilde{\theta}) & = & Var(c^{T}y) \\ & = & c^{T} Var(y) c \\ & = & c^{T} Var(X\beta + \epsilon) c \\ & = & c^{T} Var(\epsilon) c \\ & = & c^{T} \sigma^2 c \\ & = & \sigma^2 \cdot c^{T} c \end{array} \] We can now use our earlier definition of $c^{T}$ to expand this further. Since this derivation gets long, let’s simplify $c^{T}c$ directly first: \[ \require{cancel} \begin{array}{rcl} c^{T}c & = & (a^{T}(X^{T}X)^{-1}X^{T} + d^{T}) (a^{T}(X^{T}X)^{-1}X^{T} + d^{T})^{T} \\ & = & (a^{T}(X^{T}X)^{-1}X^{T} + d^{T}) (X(X^{T}X)^{-1}a + d) \\ & = & a^{T}(X^{T}X)^{-1}X^{T}X(X^{T}X)^{-1}a + a^{T}(X^{T}X)^{-1}\cancelto{0}{X^{T}d} + \cancelto{0}{d^{T}X}(X^{T}X)^{-1}a + d^{T}d \\ & = & a^{T}(X^{T}X)^{-1}a + d^{T}d \end{array} \] Now we continue our earlier derivation: \[ \begin{array}{rcl} Var(\tilde{\theta}) & = & \sigma^2 \cdot c^{T}c \\ & = & \sigma^2 \cdot (a^{T} (X^{T}X)^{-1} a + d^{T}d) \\ & = & a^{T} \sigma^2 (X^{T}X)^{-1} a + \sigma^2 d^{T}d \\ & = & a^{T} Var(\hat{\beta}) a + \sigma^2 d^{T}d \\ & = & Var(a^{T}\hat{\beta}) + \sigma^2 d^{T}d \\ & = & Var(\hat{\theta}) + \sigma^2 d^{T}d \end{array} \] Since $\sigma^2 \geq 0$ and $d^{T}d \geq 0$ (it’s a dot product), we conclude that: \[ Var(\hat{\theta}) \leq Var(\tilde{\theta}) \] This proves the Gauss-Markov theorem.

Part (b)

The matrix inequality $B \preceq A$ holds if $A - B$ is positive semidefinite. Show that if $\hat{V}$ is the variance-covariance matrix of least squares estimate of $\beta$ and $\tilde{V}$ is the variance-covariance matrix of any other linear unbiased estimate, then $\hat{V} \preceq \tilde{V}$.

Definition	Size	Symmetric	Idempotent
\(V = (X^{T}X)^{-1}\)	\((p + 1) \times (p + 1)\)	\(V^{T} = V\)	\(V^{2} \neq V\)
\(H = XVX^{T} = X(X^{T}X)^{-1}X^{T}\)	\(N \times N\)	\(H^{T} = H\)	\(H^{2} = H\)
\(R = I_N - XVX^{T} = I_N - X(X^{T}X)^{-1}X^{T}\)	\(N \times N\)	\(R^{T} = R\)	\(R^{T} = R\)

	\(y\)	\(X\)
\(H\)	\(H y = \hat{y}\) (\(\ifref{OLS.10}\))	\(H X = X\) (\(\ifref{OLS.20}\))
\(R\)	\(R y = r\) (\(\ifref{OLS.13}\))	\(R X = 0\) (\(\ifref{OLS.21}\))

Elements of Statistical Learning CH3 Solutions

Dawid Minorczyk

2017-11-13

References

Introduction

Matrix Calculus

Definitions

Basic Properties

Matrix Operations

Trace

Variance

Definitions

Basic Properties

Ordinary Least Squares

Linear Model

Estimating \(\beta\)

Important Matrices

Properties of \(\small \hat{\beta}\)

Estimating \(\sigma^2\)

Gaussian Noise

Problems

Introduction

Ex 3.1

Ex 3.2

Ex 3.3

Part (a)

Part (b)