From 1d0823656021031df7b728c241293f899bc1c556 Mon Sep 17 00:00:00 2001 From: Jonathan Taylor <341611+jonathan-taylor@users.noreply.github.com> Date: Wed, 4 Feb 2026 17:48:37 -0800 Subject: [PATCH] Remove rmd (#77) * Ch2->Ch02 * removed Rmd; made a new Rmd target for make; described in README.md --- Ch02-statlearn-lab.Rmd | 1395 -------------------------- Ch02-statlearn-lab.ipynb | 5 + Ch03-linreg-lab.Rmd | 610 ------------ Ch04-classification-lab.Rmd | 1166 ---------------------- Ch05-resample-lab.Rmd | 556 ----------- Ch06-varselect-lab.Rmd | 928 ----------------- Ch07-nonlin-lab.Rmd | 900 ----------------- Ch08-baggboost-lab.Rmd | 558 ----------- Ch09-svm-lab.Rmd | 550 ----------- Ch10-deeplearning-lab.Rmd | 1858 ----------------------------------- Ch11-surv-lab.Rmd | 550 ----------- Ch12-unsup-lab.Rmd | 904 ----------------- Ch13-multiple-lab.Rmd | 574 ----------- README.md | 12 +- 14 files changed, 15 insertions(+), 10551 deletions(-) delete mode 100644 Ch02-statlearn-lab.Rmd delete mode 100644 Ch03-linreg-lab.Rmd delete mode 100644 Ch04-classification-lab.Rmd delete mode 100644 Ch05-resample-lab.Rmd delete mode 100644 Ch06-varselect-lab.Rmd delete mode 100644 Ch07-nonlin-lab.Rmd delete mode 100644 Ch08-baggboost-lab.Rmd delete mode 100644 Ch09-svm-lab.Rmd delete mode 100644 Ch10-deeplearning-lab.Rmd delete mode 100644 Ch11-surv-lab.Rmd delete mode 100644 Ch12-unsup-lab.Rmd delete mode 100644 Ch13-multiple-lab.Rmd diff --git a/Ch02-statlearn-lab.Rmd b/Ch02-statlearn-lab.Rmd deleted file mode 100644 index 261aa59..0000000 --- a/Ch02-statlearn-lab.Rmd +++ /dev/null @@ -1,1395 +0,0 @@ - - -# Introduction to Python - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch02-statlearn-lab.ipynb) - - - - - - -## Getting Started - - -To run the labs in this book, you will need two things: - -* An installation of `Python3`, which is the specific version of `Python` used in the labs. -* Access to `Jupyter`, a very popular `Python` interface that runs code through a file called a *notebook*. - - -You can download and install `Python3` by following the instructions available at [anaconda.com](http://anaconda.com). - - - There are a number of ways to get access to `Jupyter`. Here are just a few: - - * Using Google's `Colaboratory` service: [colab.research.google.com/](https://colab.research.google.com/). - * Using `JupyterHub`, available at [jupyter.org/hub](https://jupyter.org/hub). - * Using your own `jupyter` installation. Installation instructions are available at [jupyter.org/install](https://jupyter.org/install). - -Please see the `Python` resources page on the book website [statlearning.com](https://www.statlearning.com) for up-to-date information about getting `Python` and `Jupyter` working on your computer. - -You will need to install the `ISLP` package, which provides access to the datasets and custom-built functions that we provide. -Inside a macOS or Linux terminal type `pip install ISLP`; this also installs most other packages needed in the labs. The `Python` resources page has a link to the `ISLP` documentation website. - -To run this lab, download the file `Ch02-statlearn-lab.ipynb` from the `Python` resources page. -Now run the following code at the command line: `jupyter lab Ch02-statlearn-lab.ipynb`. - -If you're using Windows, you can use the `start menu` to access `anaconda`, and follow the links. For example, to install `ISLP` and run this lab, you can run the same code above in an `anaconda` shell. - - - -## Basic Commands - - - -In this lab, we will introduce some simple `Python` commands. - For more resources about `Python` in general, readers may want to consult the tutorial at [docs.python.org/3/tutorial/](https://docs.python.org/3/tutorial/). - - - - - - -Like most programming languages, `Python` uses *functions* -to perform operations. To run a -function called `fun`, we type -`fun(input1,input2)`, where the inputs (or *arguments*) -`input1` and `input2` tell -`Python` how to run the function. A function can have any number of -inputs. For example, the -`print()` function outputs a text representation of all of its arguments to the console. - -```{python} -print('fit a model with', 11, 'variables') - -``` - - The following command will provide information about the `print()` function. - -```{python} -# print? - -``` - -Adding two integers in `Python` is pretty intuitive. - -```{python} -3 + 5 - -``` - -In `Python`, textual data is handled using -*strings*. For instance, `"hello"` and -`'hello'` -are strings. -We can concatenate them using the addition `+` symbol. - -```{python} -"hello" + " " + "world" - -``` - - A string is actually a type of *sequence*: this is a generic term for an ordered list. - The three most important types of sequences are lists, tuples, and strings. -We introduce lists now. - - -The following command instructs `Python` to join together -the numbers 3, 4, and 5, and to save them as a -*list* named `x`. When we -type `x`, it gives us back the list. - -```{python} -x = [3, 4, 5] -x - -``` - -Note that we used the brackets -`[]` to construct this list. - -We will often want to add two sets of numbers together. It is reasonable to try the following code, -though it will not produce the desired results. - -```{python} -y = [4, 9, 7] -x + y - -``` - -The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists -element-by-element? - In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*. - In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`. - - - -This example reflects the fact that - `Python` is a general-purpose programming language. Much of `Python`'s data-specific -functionality comes from other packages, notably `numpy` -and `pandas`. -In the next section, we will introduce the `numpy` package. -See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/numpy/user/quickstart.html) for more information about `numpy`. - - - -## Introduction to Numerical Python - -As mentioned earlier, this book makes use of functionality that is contained in the `numpy` - *library*, or *package*. A package is a collection of modules that are not necessarily included in - the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*. - - - To access `numpy`, we must first `import` it. - -```{python} -import numpy as np -``` -In the previous line, we named the `numpy` *module* `np`; an abbreviation for easier referencing. - - -In `numpy`, an *array* is a generic term for a multidimensional -set of numbers. -We use the `np.array()` function to define `x` and `y`, which are one-dimensional arrays, i.e. vectors. - -```{python} -x = np.array([3, 4, 5]) -y = np.array([4, 9, 7]) -``` -Note that if you forgot to run the `import numpy as np` command earlier, then -you will encounter an error in calling the `np.array()` function in the previous line. - The syntax `np.array()` indicates that the function being called -is part of the `numpy` package, which we have abbreviated as `np`. - - -Since `x` and `y` have been defined using `np.array()`, we get a sensible result when we add them together. Compare this to our results in the previous section, - when we tried to add two lists without using `numpy`. - -```{python} -x + y -``` - - - - - -In `numpy`, matrices are typically represented as two-dimensional arrays, and vectors as one-dimensional arrays. {While it is also possible to create matrices using `np.matrix()`, we will use `np.array()` throughout the labs in this book.} -We can create a two-dimensional array as follows. - -```{python} -x = np.array([[1, 2], [3, 4]]) -x -``` - - - - - -The object `x` has several -*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute` -with the name of the attribute. -For instance, we can access the `ndim` attribute of `x` as follows. - -```{python} -x.ndim -``` - -The output indicates that `x` is a two-dimensional array. -Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is -comprised of 64-bit integers: - -```{python} -x.dtype -``` -Why is `x` comprised of integers? This is because we created `x` by passing in exclusively integers to the `np.array()` function. - If -we had passed in any decimals, then we would have obtained an array of -*floating point numbers* (i.e. real-valued numbers). - -```{python} -np.array([[1, 2], [3.0, 4]]).dtype - -``` - - -Typing `fun?` will cause `Python` to display -documentation associated with the function `fun`, if it exists. -We can try this for `np.array()`. - -```{python} -# np.array? - -``` -This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`. - -```{python} -np.array([[1, 2], [3, 4]], float).dtype - -``` - - -The array `x` is two-dimensional. We can find out the number of rows and columns by looking -at its `shape` attribute. - -```{python} -x.shape - -``` - - -A *method* is a function that is associated with an -object. -For instance, given an array `x`, the expression -`x.sum()` sums all of its elements, using the `sum()` -method for arrays. -The call `x.sum()` automatically provides `x` as the -first argument to its `sum()` method. - -```{python} -x = np.array([1, 2, 3, 4]) -x.sum() -``` -We could also sum the elements of `x` by passing in `x` as an argument to the `np.sum()` function. - -```{python} -x = np.array([1, 2, 3, 4]) -np.sum(x) -``` - As another example, the -`reshape()` method returns a new array with the same elements as -`x`, but a different shape. - We do this by passing in a `tuple` in our call to - `reshape()`, in this case `(2, 3)`. This tuple specifies that we would like to create a two-dimensional array with -$2$ rows and $3$ columns. {Like lists, tuples represent a sequence of objects. Why do we need more than one way to create a sequence? There are a few differences between tuples and lists, but perhaps the most important is that elements of a tuple cannot be modified, whereas elements of a list can be.} - -In what follows, the -`\n` character creates a *new line*. - -```{python} -x = np.array([1, 2, 3, 4, 5, 6]) -print('beginning x:\n', x) -x_reshape = x.reshape((2, 3)) -print('reshaped x:\n', x_reshape) - -``` - -The previous output reveals that `numpy` arrays are specified as a sequence -of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*. - - -`Python` (and hence `numpy`) uses 0-based -indexing. This means that to access the top left element of `x_reshape`, -we type in `x_reshape[0,0]`. - -```{python} -x_reshape[0, 0] -``` -Similarly, `x_reshape[1,2]` yields the element in the second row and the third column -of `x_reshape`. - -```{python} -x_reshape[1, 2] -``` -Similarly, `x[2]` yields the -third entry of `x`. - -Now, let's modify the top left element of `x_reshape`. To our surprise, we discover that the first element of `x` has been modified as well! - - - -```{python} -print('x before we modify x_reshape:\n', x) -print('x_reshape before we modify x_reshape:\n', x_reshape) -x_reshape[0, 0] = 5 -print('x_reshape after we modify its top left element:\n', x_reshape) -print('x after we modify top left element of x_reshape:\n', x) - -``` - -Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory. - - - - - -We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces -an *exception*, or error. - -```{python} -my_tuple = (3, 4, 5) -my_tuple[0] = 2 - -``` - - -We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple. -The `ndim` attribute yields the number of dimensions, and `T` provides its transpose. - -```{python} -x_reshape.shape, x_reshape.ndim, x_reshape.T - -``` - -Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple. - -We will often want to apply functions to arrays. -For instance, we can compute the -square root of the entries using the `np.sqrt()` function: - -```{python} -np.sqrt(x) - -``` - -We can also square the elements: - -```{python} -x**2 - -``` - -We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2. - -```{python} -x**0.5 - -``` - - -Throughout this book, we will often want to generate random data. -The `np.random.normal()` function generates a vector of random -normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`. -The first line of the help page reads `normal(loc=0.0, scale=1.0, size=None)`. - This *signature* line tells us that the function's arguments are `loc`, `scale`, and `size`. These are *keyword* arguments, which means that when they are passed into - the function, they can be referred to by name (in any order). {`Python` also uses *positional* arguments. Positional arguments do not need to use a keyword. To see an example, type in `np.sum?`. We see that `a` is a positional argument, i.e. this function assumes that the first unnamed argument that it receives is the array to be summed. By contrast, `axis` and `dtype` are keyword arguments: the position in which these arguments are entered into `np.sum()` does not matter.} - By default, this function will generate random normal variable(s) with mean (`loc`) $0$ and standard deviation (`scale`) $1$; furthermore, - a single random variable will be generated unless the argument to `size` is changed. - -We now generate 50 independent random variables from a $N(0,1)$ distribution. - -```{python} -x = np.random.normal(size=50) -x - -``` - -We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`. - -```{python} -y = x + np.random.normal(loc=50, scale=1, size=50) -``` -The `np.corrcoef()` function computes the correlation matrix between `x` and `y`. The off-diagonal elements give the -correlation between `x` and `y`. - -```{python} -np.corrcoef(x, y) -``` - -If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few -commands. In particular, - each -time we call `np.random.normal()`, we will get a different answer, as shown in the following example. - -```{python} -print(np.random.normal(scale=5, size=2)) -print(np.random.normal(scale=5, size=2)) - -``` - - - -In order to ensure that our code provides exactly the same results -each time it is run, we can set a *random seed* -using the -`np.random.default_rng()` function. -This function takes an arbitrary, user-specified integer argument. If we set a random seed before -generating random data, then re-running our code will yield the same results. The -object `rng` has essentially all the random number generating methods found in `np.random`. Hence, to -generate normal data we use `rng.normal()`. - -```{python} -rng = np.random.default_rng(1303) -print(rng.normal(scale=5, size=2)) -rng2 = np.random.default_rng(1303) -print(rng2.normal(scale=5, size=2)) -``` - -Throughout the labs in this book, we use `np.random.default_rng()` whenever we -perform calculations involving random quantities within `numpy`. In principle, this -should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible -that some small discrepancies may occur between the output -in the labs and the output -from `numpy`. - -The `np.mean()`, `np.var()`, and `np.std()` functions can be used -to compute the mean, variance, and standard deviation of arrays. These functions are also -available as methods on the arrays. - -```{python} -rng = np.random.default_rng(3) -y = rng.standard_normal(10) -np.mean(y), y.mean() -``` - - - -```{python} -np.var(y), y.var(), np.mean((y - y.mean())**2) -``` - - -Notice that by default `np.var()` divides by the sample size $n$ rather -than $n-1$; see the `ddof` argument in `np.var?`. - - -```{python} -np.sqrt(np.var(y)), np.std(y) -``` - -The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix. -To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums. - -```{python} -X = rng.standard_normal((10, 3)) -X -``` - -Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its rows. We pass this argument into the `mean()` method for the object `X`. - -```{python} -X.mean(axis=0) -``` - -The following yields the same result. - -```{python} -X.mean(0) -``` - - - -## Graphics -In `Python`, common practice is to use the library -`matplotlib` for graphics. -However, since `Python` was not written with data analysis in mind, - the notion of plotting is not intrinsic to the language. -We will use the `subplots()` function -from `matplotlib.pyplot` to create a figure and the -axes onto which we plot our data. -For many more examples of how to make plots in `Python`, -readers are encouraged to visit [matplotlib.org/stable/gallery/](https://matplotlib.org/stable/gallery/index.html). - -In `matplotlib`, a plot consists of a *figure* and one or more *axes*. You can think of the figure as the blank canvas upon which -one or more plots will be displayed: it is the entire plotting window. -The *axes* contain important information about each plot, such as its $x$- and $y$-axis labels, -title, and more. (Note that in `matplotlib`, the word *axes* is not the plural of *axis*: a plot's *axes* contains much more information -than just the $x$-axis and the $y$-axis.) - -We begin by importing the `subplots()` function -from `matplotlib`. We use this function -throughout when creating figures. -The function returns a tuple of length two: a figure -object as well as the relevant axes object. We will typically -pass `figsize` as a keyword argument. -Having created our axes, we attempt our first plot using its `plot()` method. -To learn more about it, -type `ax.plot?`. - -```{python} -from matplotlib.pyplot import subplots -fig, ax = subplots(figsize=(8, 8)) -x = rng.standard_normal(100) -y = rng.standard_normal(100) -ax.plot(x, y); - -``` - -We pause here to note that we have *unpacked* the tuple of length two returned by `subplots()` into the two distinct -variables `fig` and `ax`. Unpacking -is typically preferred to the following equivalent but slightly more verbose code: - -```{python} -output = subplots(figsize=(8, 8)) -fig = output[0] -ax = output[1] -``` - -We see that our earlier cell produced a line plot, which is the default. To create a scatterplot, we provide an additional argument to `ax.plot()`, indicating that circles should be displayed. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.plot(x, y, 'o'); -``` -Different values -of this additional argument can be used to produce different colored lines -as well as different linestyles. - - - -As an alternative, we could use the `ax.scatter()` function to create a scatterplot. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.scatter(x, y, marker='o'); -``` - -Notice that in the code blocks above, we have ended -the last line with a semicolon. This prevents `ax.plot(x, y)` from printing -text to the notebook. However, it does not prevent a plot from being produced. - If we omit the trailing semi-colon, then we obtain the following output: - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.scatter(x, y, marker='o') - -``` -In what follows, we will use - trailing semicolons whenever the text that would be output is not -germane to the discussion at hand. - - - - - - -To label our plot, we make use of the `set_xlabel()`, `set_ylabel()`, and `set_title()` methods -of `ax`. - - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.scatter(x, y, marker='o') -ax.set_xlabel("this is the x-axis") -ax.set_ylabel("this is the y-axis") -ax.set_title("Plot of X vs Y"); -``` - - Having access to the figure object `fig` itself means that we can go in and change some aspects and then redisplay it. Here, we change - the size from `(8, 8)` to `(12, 3)`. - - -```{python} -fig.set_size_inches(12,3) -fig -``` - - - -Occasionally we will want to create several plots within a figure. This can be -achieved by passing additional arguments to `subplots()`. -Below, we create a $2 \times 3$ grid of plots -in a figure of size determined by the `figsize` argument. In such -situations, there is often a relationship between the axes in the plots. For example, -all plots may have a common $x$-axis. The `subplots()` function can automatically handle -this situation when passed the keyword argument `sharex=True`. -The `axes` object below is an array pointing to different plots in the figure. - -```{python} -fig, axes = subplots(nrows=2, - ncols=3, - figsize=(15, 5)) -``` -We now produce a scatter plot with `'o'` in the second column of the first row and -a scatter plot with `'+'` in the third column of the second row. - -```{python} -axes[0,1].plot(x, y, 'o') -axes[1,2].scatter(x, y, marker='+') -fig -``` -Type `subplots?` to learn more about -`subplots()`. - - - - - -To save the output of `fig`, we call its `savefig()` -method. The argument `dpi` is the dots per inch, used -to determine how large the figure will be in pixels. - -```{python} -fig.savefig("Figure.png", dpi=400) -fig.savefig("Figure.pdf", dpi=200); - -``` - - -We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it. - -```{python} -axes[0,1].set_xlim([-1,1]) -fig.savefig("Figure_updated.jpg") -fig -``` - -We now create some more sophisticated plots. The -`ax.contour()` method produces a *contour plot* -in order to represent three-dimensional data, similar to a -topographical map. It takes three arguments: - -* A vector of `x` values (the first dimension), -* A vector of `y` values (the second dimension), and -* A matrix whose elements correspond to the `z` value (the third -dimension) for each pair of `(x,y)` coordinates. - -To create `x` and `y`, we’ll use the command `np.linspace(a, b, n)`, -which returns a vector of `n` numbers starting at `a` and ending at `b`. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -x = np.linspace(-np.pi, np.pi, 50) -y = x -f = np.multiply.outer(np.cos(y), 1 / (1 + x**2)) -ax.contour(x, y, f); - -``` -We can increase the resolution by adding more levels to the image. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.contour(x, y, f, levels=45); -``` -To fine-tune the output of the -`ax.contour()` function, take a -look at the help file by typing `ax.contour?`. - -The `ax.imshow()` method is similar to -`ax.contour()`, except that it produces a color-coded plot -whose colors depend on the `z` value. This is known as a -*heatmap*, and is sometimes used to plot temperature in -weather forecasts. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.imshow(f); - -``` - - -## Sequences and Slice Notation - - -As seen above, the -function `np.linspace()` can be used to create a sequence -of numbers. - -```{python} -seq1 = np.linspace(0, 10, 11) -seq1 - -``` - - -The function `np.arange()` - returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence - that starts at $0$ and ends at $10$. - -```{python} -seq2 = np.arange(0, 10) -seq2 - -``` - -Why isn't $10$ output above? This has to do with *slice* notation in `Python`. -Slice notation -is used to index sequences such as lists, tuples and arrays. -Suppose we want to retrieve the fourth through sixth (inclusive) entries -of a string. We obtain a slice of the string using the indexing notation `[3:6]`. - -```{python} -"hello world"[3:6] -``` -In the code block above, the notation `3:6` is shorthand for `slice(3,6)` when used inside -`[]`. - -```{python} -"hello world"[slice(3,6)] - -``` - -You might have expected `slice(3,6)` to output the fourth through seventh characters in the text string (recalling that `Python` begins its indexing at zero), but instead it output the fourth through sixth. - This also explains why the earlier `np.arange(0, 10)` command output only the integers from $0$ to $9$. -See the documentation `slice?` for useful options in creating slices. - - - - - - - - - - - - - - - - - - - - - - - -## Indexing Data -To begin, we create a two-dimensional `numpy` array. - -```{python} -A = np.array(np.arange(16)).reshape((4, 4)) -A - -``` - -Typing `A[1,2]` retrieves the element corresponding to the second row and third -column. (As usual, `Python` indexes from $0.$) - -```{python} -A[1,2] - -``` - -The first number after the open-bracket symbol `[` - refers to the row, and the second number refers to the column. - -### Indexing Rows, Columns, and Submatrices - To select multiple rows at a time, we can pass in a list - specifying our selection. For instance, `[1,3]` will retrieve the second and fourth rows: - -```{python} -A[[1,3]] - -``` - -To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets. -In this case we need to supply the first argument `:` -which selects all rows. - -```{python} -A[:,[0,2]] - -``` - -Now, suppose that we want to select the submatrix made up of the second and fourth -rows as well as the first and third columns. This is where -indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns: - -```{python} -A[[1,3],[0,2]] - -``` - - Oops --- what happened? We got a one-dimensional array of length two identical to - -```{python} -np.array([A[1,0],A[3,2]]) - -``` - - Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns: - -```{python} -A[[1,3],[0,2,3]] - -``` - -We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix. - -One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns. - - -```{python} -A[[1,3]][:,[0,2]] - -``` - - - -There are more efficient ways of achieving the same result. - -The *convenience function* `np.ix_()` allows us to extract a submatrix -using lists, by creating an intermediate *mesh* object. - -```{python} -idx = np.ix_([1,3],[0,2,3]) -A[idx] - -``` - - -Alternatively, we can subset matrices efficiently using slices. - -The slice -`1:4:2` captures the second and fourth items of a sequence, while the slice `0:3:2` captures -the first and third items (the third element in a slice sequence is the step size). - -```{python} -A[1:4:2,0:3:2] - -``` - - - -Why are we able to retrieve a submatrix directly using slices but not using lists? -It's because they are different `Python` types, and -are treated differently by `numpy`. -Slices can be used to extract objects from arbitrary sequences, such as strings, lists, and tuples, while the use of lists for indexing is more limited. - - - - - - - - - - - - - -### Boolean Indexing -In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively). -The next line creates a vector of $0$'s, represented as Booleans, of length equal to the first dimension of `A`. - -```{python} -keep_rows = np.zeros(A.shape[0], bool) -keep_rows -``` -We now set two of the elements to `True`. - -```{python} -keep_rows[[1,3]] = True -keep_rows - -``` - -Note that the elements of `keep_rows`, when viewed as integers, are the same as the -values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When -applied to two arrays, the `==` operation is applied elementwise. - -```{python} -np.all(keep_rows == np.array([0,1,0,1])) - -``` - -(Here, the function `np.all()` has checked whether -all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.) - - - However, even though `np.array([0,1,0,1])` and `keep_rows` are equal according to `==`, they index different sets of rows! -The former retrieves the first, second, first, and second rows of `A`. - -```{python} -A[np.array([0,1,0,1])] - -``` - - By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `True`. - -```{python} -A[keep_rows] - -``` - -This example shows that Booleans and integers are treated differently by `numpy`. - - -We again make use of the `np.ix_()` function - to create a mesh containing the second and fourth rows, and the first, third, and fourth columns. This time, we apply the function to Booleans, - rather than lists. - -```{python} -keep_cols = np.zeros(A.shape[1], bool) -keep_cols[[0, 2, 3]] = True -idx_bool = np.ix_(keep_rows, keep_cols) -A[idx_bool] - -``` - -We can also mix a list with an array of Booleans in the arguments to `np.ix_()`: - -```{python} -idx_mixed = np.ix_([1,3], keep_cols) -A[idx_mixed] - -``` - - - -For more details on indexing in `numpy`, readers are referred -to the `numpy` tutorial mentioned earlier. - - - -## Loading Data - -Data sets often contain different types of data, and may have names associated with the rows or columns. -For these reasons, they typically are best accommodated using a - *data frame*. - We can think of a data frame as a sequence -of arrays of identical length; these are the columns. Entries in the -different arrays can be combined to form a row. - The `pandas` -library can be used to create and work with data frame objects. - - -### Reading in a Data Set - -The first step of most analyses involves importing a data set into -`Python`. - Before attempting to load -a data set, we must make sure that `Python` knows where to find the file containing it. -If the -file is in the same location -as this notebook file, then we are all set. -Otherwise, -the command -`os.chdir()` can be used to *change directory*. (You will need to call `import os` before calling `os.chdir()`.) - - -We will begin by reading in `Auto.csv`, available on the book website. This is a comma-separated file, and can be read in using `pd.read_csv()`: - -```{python} -import pandas as pd -Auto = pd.read_csv('Auto.csv') -Auto - -``` - -The book website also has a whitespace-delimited version of this data, called `Auto.data`. This can be read in as follows: - -```{python} -Auto = pd.read_csv("Auto.data", sep="\\s+") - -``` - - Both `Auto.csv` and `Auto.data` are simply text -files. Before loading data into `Python`, it is a good idea to view it using -a text editor or other software, such as Microsoft Excel. - - - - -We now take a look at the column of `Auto` corresponding to the variable `horsepower`: - -```{python} -Auto['horsepower'] - -``` -We see that the `dtype` of this column is `object`. -It turns out that all values of the `horsepower` column were interpreted as strings when reading -in the data. -We can find out why by looking at the unique values. - -```{python} -np.unique(Auto['horsepower']) - -``` -We see the culprit is the value `?`, which is being used to encode missing values. - - - - -To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`. -Now, each instance of `?` in the file is replaced with the -value `np.nan`, which means *not a number*: - -```{python} -Auto = pd.read_csv('Auto.data', - na_values=['?'], - sep="\\s+") -Auto['horsepower'].sum() - -``` - -The `Auto.shape` attribute tells us that the data has 397 -observations, or rows, and nine variables, or columns. - -```{python} -Auto.shape - -``` - -There are -various ways to deal with missing data. -In this case, since only five of the rows contain missing -observations, we choose to use the `Auto.dropna()` method to simply remove these rows. - -```{python} -Auto_new = Auto.dropna() -Auto_new.shape - -``` - - -### Basics of Selecting Rows and Columns - -We can use `Auto.columns` to check the variable names. - -```{python} -Auto = Auto_new # overwrite the previous value -Auto.columns - -``` - - -Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array. -Recall that the first argument to the `[]` method -is always applied to the rows of the array. -Similarly, -passing in a slice to the `[]` method creates a data frame whose *rows* are determined by the slice: - -```{python} -Auto[:3] - -``` -Similarly, an array of Booleans can be used to subset the rows: - -```{python} -idx_80 = Auto['year'] > 80 -Auto[idx_80] - -``` -However, if we pass in a list of strings to the `[]` method, then we obtain a data frame containing the corresponding set of *columns*. - -```{python} -Auto[['mpg', 'horsepower']] - -``` -Since we did not specify an *index* column when we loaded our data frame, the rows are labeled using integers -0 to 396. - -```{python} -Auto.index - -``` -We can use the -`set_index()` method to re-name the rows using the contents of `Auto['name']`. - -```{python} -Auto_re = Auto.set_index('name') -Auto_re - -``` - -```{python} -Auto_re.columns - -``` -We see that the column `'name'` is no longer there. - -Now that the index has been set to `name`, we can access rows of the data -frame by `name` using the `{loc[]`} method of -`Auto`: - -```{python} -rows = ['amc rebel sst', 'ford torino'] -Auto_re.loc[rows] - -``` -As an alternative to using the index name, we could retrieve the 4th and 5th rows of `Auto` using the `{iloc[]`} method: - -```{python} -Auto_re.iloc[[3,4]] - -``` -We can also use it to retrieve the 1st, 3rd and and 4th columns of `Auto_re`: - -```{python} -Auto_re.iloc[:,[0,2,3]] - -``` -We can extract the 4th and 5th rows, as well as the 1st, 3rd and 4th columns, using -a single call to `iloc[]`: - -```{python} -Auto_re.iloc[[3,4],[0,2,3]] - -``` -Index entries need not be unique: there are several cars in the data frame named `ford galaxie 500`. - -```{python} -Auto_re.loc['ford galaxie 500', ['mpg', 'origin']] - -``` -### More on Selecting Rows and Columns -Suppose now that we want to create a data frame consisting of the `weight` and `origin` of the subset of cars with -`year` greater than 80 --- i.e. those built after 1980. -To do this, we first create a Boolean array that indexes the rows. -The `loc[]` method allows for Boolean entries as well as strings: - -```{python} -idx_80 = Auto_re['year'] > 80 -Auto_re.loc[idx_80, ['weight', 'origin']] - -``` - - -To do this more concisely, we can use an anonymous function called a `lambda`: - -```{python} -Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']] - -``` -The `lambda` call creates a function that takes a single -argument, here `df`, and returns `df['year']>80`. -Since it is created inside the `loc[]` method for the -dataframe `Auto_re`, that dataframe will be the argument supplied. -As another example of using a `lambda`, suppose that -we want all cars built after 1980 that achieve greater than 30 miles per gallon: - -```{python} -Auto_re.loc[lambda df: (df['year'] > 80) & (df['mpg'] > 30), - ['weight', 'origin'] - ] - -``` -The symbol `&` computes an element-wise *and* operation. -As another example, suppose that we want to retrieve all `Ford` and `Datsun` -cars with `displacement` less than 300. We check whether each `name` entry contains either the string `ford` or `datsun` using the `str.contains()` method of the `index` attribute of -the dataframe: - -```{python} -Auto_re.loc[lambda df: (df['displacement'] < 300) - & (df.index.str.contains('ford') - | df.index.str.contains('datsun')), - ['weight', 'origin'] - ] - -``` -Here, the symbol `|` computes an element-wise *or* operation. - -In summary, a powerful set of operations is available to index the rows and columns of data frames. For integer based queries, use the `iloc[]` method. For string and Boolean -selections, use the `loc[]` method. For functional queries that filter rows, use the `loc[]` method -with a function (typically a `lambda`) in the rows argument. - -## For Loops -A `for` loop is a standard tool in many languages that -repeatedly evaluates some chunk of code while -varying different values inside the code. -For example, suppose we loop over elements of a list and compute their sum. - -```{python} -total = 0 -for value in [3,2,19]: - total += value -print('Total is: {0}'.format(total)) - -``` -The indented code beneath the line with the `for` statement is run -for each value in the sequence -specified in the `for` statement. The loop ends either -when the cell ends or when code is indented at the same level -as the original `for` statement. -We see that the final line above which prints the total is executed -only once after the for loop has terminated. Loops -can be nested by additional indentation. - -```{python} -total = 0 -for value in [2,3,19]: - for weight in [3, 2, 1]: - total += value * weight -print('Total is: {0}'.format(total)) -``` -Above, we summed over each combination of `value` and `weight`. -We also took advantage of the *increment* notation -in `Python`: the expression `a += b` is equivalent -to `a = a + b`. Besides -being a convenient notation, this can save time in computationally -heavy tasks in which the intermediate value of `a+b` need not -be explicitly created. - -Perhaps a more -common task would be to sum over `(value, weight)` pairs. For instance, -to compute the average value of a random variable that takes on -possible values 2, 3 or 19 with probability 0.2, 0.3, 0.5 respectively -we would compute the weighted sum. Tasks such as this -can often be accomplished using the `zip()` function that -loops over a sequence of tuples. - -```{python} -total = 0 -for value, weight in zip([2,3,19], - [0.2,0.3,0.5]): - total += weight * value -print('Weighted average is: {0}'.format(total)) - -``` - -### String Formatting -In the code chunk above we also printed a string -displaying the total. However, the object `total` -is an integer and not a string. -Inserting the value of something into -a string is a common task, made -simple using -some of the powerful string formatting -tools in `Python`. -Many data cleaning tasks involve -manipulating and programmatically -producing strings. - -For example we may want to loop over the columns of a data frame and -print the percent missing in each column. -Let’s create a data frame `D` with columns in which 20% of the entries are missing i.e. set -to `np.nan`. We’ll create the -values in `D` from a normal distribution with mean 0 and variance 1 using `rng.standard_normal()` -and then overwrite some random entries using `rng.choice()`. - -```{python} -rng = np.random.default_rng(1) -A = rng.standard_normal((127, 5)) -M = rng.choice([0, np.nan], p=[0.8,0.2], size=A.shape) -A += M -D = pd.DataFrame(A, columns=['food', - 'bar', - 'pickle', - 'snack', - 'popcorn']) -D[:3] - -``` - - -```{python} -for col in D.columns: - template = 'Column "{0}" has {1:.2%} missing values' - print(template.format(col, - np.isnan(D[col]).mean())) - -``` -We see that the `template.format()` method expects two arguments `{0}` -and `{1:.2%}`, and the latter includes some formatting -information. In particular, it specifies that the second argument should be expressed as a percent with two decimal digits. - -The reference -[docs.python.org/3/library/string.html](https://docs.python.org/3/library/string.html) -includes many helpful and more complex examples. - - -## Additional Graphical and Numerical Summaries -We can use the `ax.plot()` or `ax.scatter()` functions to display the quantitative variables. However, simply typing the variable names will produce an error message, -because `Python` does not know to look in the `Auto` data set for those variables. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.plot(horsepower, mpg, 'o') -``` -We can address this by accessing the columns directly: - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.plot(Auto['horsepower'], Auto['mpg'], 'o'); -``` -Alternatively, we can use the `plot()` method with the call `Auto.plot()`. -Using this method, -the variables can be accessed by name. -The plot methods of a data frame return a familiar object: -an axes. We can use it to update the plot as we did previously: - -```{python} -ax = Auto.plot.scatter('horsepower', 'mpg') -ax.set_title('Horsepower vs. MPG'); -``` -If we want to save -the figure that contains a given axes, we can find the relevant figure -by accessing the `figure` attribute: - -```{python} -fig = ax.figure -fig.savefig('horsepower_mpg.png'); -``` - -We can further instruct the data frame to plot to a particular axes object. In this -case the corresponding `plot()` method will return the -modified axes we passed in as an argument. Note that -when we request a one-dimensional grid of plots, the object `axes` is similarly -one-dimensional. We place our scatter plot in the middle plot of a row of three plots -within a figure. - -```{python} -fig, axes = subplots(ncols=3, figsize=(15, 5)) -Auto.plot.scatter('horsepower', 'mpg', ax=axes[1]); - -``` - -Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`. - - -We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable. -However, since there is only a small number of possible values for this variable, we may wish to treat it as - qualitative. Below, we replace -the `cylinders` column with a categorical version of `Auto.cylinders`. The function `pd.Series()` owes its name to the fact that `pandas` is often used in time series applications. - -```{python} -Auto.cylinders = pd.Series(Auto.cylinders, dtype='category') -Auto.cylinders.dtype - -``` - Now that `cylinders` is qualitative, we can display it using - the `boxplot()` method. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -Auto.boxplot('mpg', by='cylinders', ax=ax); - -``` - -The `hist()` method can be used to plot a *histogram*. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -Auto.hist('mpg', ax=ax); - -``` -The color of the bars and the number of bins can be changed: - -```{python} -fig, ax = subplots(figsize=(8, 8)) -Auto.hist('mpg', color='red', bins=12, ax=ax); - -``` - See `Auto.hist?` for more plotting -options. - -We can use the `pd.plotting.scatter_matrix()` function to create a *scatterplot matrix* to visualize all of the pairwise relationships between the columns in -a data frame. - -```{python} -pd.plotting.scatter_matrix(Auto); - -``` - We can also produce scatterplots -for a subset of the variables. - -```{python} -pd.plotting.scatter_matrix(Auto[['mpg', - 'displacement', - 'weight']]); - -``` -The `describe()` method produces a numerical summary of each column in a data frame. - -```{python} -Auto[['mpg', 'weight']].describe() - -``` -We can also produce a summary of just a single column. - -```{python} -Auto['cylinders'].describe() -Auto['mpg'].describe() - -``` - -```{python} - -``` diff --git a/Ch02-statlearn-lab.ipynb b/Ch02-statlearn-lab.ipynb index fef989c..686975a 100644 --- a/Ch02-statlearn-lab.ipynb +++ b/Ch02-statlearn-lab.ipynb @@ -8767,6 +8767,11 @@ "language": "python", "name": "python3" }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, "language_info": { "codemirror_mode": { "name": "ipython", diff --git a/Ch03-linreg-lab.Rmd b/Ch03-linreg-lab.Rmd deleted file mode 100644 index 03c2c36..0000000 --- a/Ch03-linreg-lab.Rmd +++ /dev/null @@ -1,610 +0,0 @@ - - -# Linear Regression - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch03-linreg-lab.ipynb) - - - -## Importing packages -We import our standard libraries at this top -level. - -```{python} -import numpy as np -import pandas as pd -from matplotlib.pyplot import subplots - -``` - - -### New imports -Throughout this lab we will introduce new functions and libraries. However, -we will import them here to emphasize these are the new -code objects in this lab. Keeping imports near the top -of a notebook makes the code more readable, since scanning the first few -lines tells us what libraries are used. - -```{python} -import statsmodels.api as sm - -``` - We will provide relevant details about the -functions below as they are needed. - -Besides importing whole modules, it is also possible -to import only a few items from a given module. This -will help keep the *namespace* clean. -We will use a few specific objects from the `statsmodels` package -which we import here. - -```{python} -from statsmodels.stats.outliers_influence \ - import variance_inflation_factor as VIF -from statsmodels.stats.anova import anova_lm - -``` - -As one of the import statements above is quite a long line, we inserted a line break `\` to -ease readability. - -We will also use some functions written for the labs in this book in the `ISLP` -package. - -```{python} -from ISLP import load_data -from ISLP.models import (ModelSpec as MS, - summarize, - poly) - -``` - -### Inspecting Objects and Namespaces -The -function `dir()` -provides a list of -objects in a namespace. - -```{python} -dir() - -``` - This shows you everything that `Python` can find at the top level. -There are certain objects like `__builtins__` that contain references to built-in -functions like `print()`. - -Every python object has its own notion of -namespace, also accessible with `dir()`. This will include -both the attributes of the object -as well as any methods associated with it. For instance, we see `'sum'` in the listing for an -array. - -```{python} -A = np.array([3,5,11]) -dir(A) - -``` - This indicates that the object `A.sum` exists. In this case it is a method -that can be used to compute the sum of the array `A` as can be seen by typing `A.sum?`. - -```{python} -A.sum() - -``` - - - -## Simple Linear Regression -In this section we will construct model -matrices (also called design matrices) using the `ModelSpec()` transform from `ISLP.models`. - -We will use the `Boston` housing data set, which is contained in the `ISLP` package. The `Boston` dataset records `medv` (median house value) for $506$ neighborhoods -around Boston. We will build a regression model to predict `medv` using $13$ -predictors such as `rm` (average number of rooms per house), - `age` (proportion of owner-occupied units built prior to 1940), and `lstat` (percent of -households with low socioeconomic status). We will use `statsmodels` for this -task, a `Python` package that implements several commonly used -regression methods. - -We have included a simple loading function `load_data()` in the -`ISLP` package: - -```{python} -Boston = load_data("Boston") -Boston.columns - -``` - -Type `Boston?` to find out more about these data. - -We start by using the `sm.OLS()` function to fit a -simple linear regression model. Our response will be - `medv` and `lstat` will be the single predictor. -For this model, we can create the model matrix by hand. - - -```{python} -X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]), - 'lstat': Boston['lstat']}) -X[:4] - -``` - -We extract the response, and fit the model. - -```{python} -y = Boston['medv'] -model = sm.OLS(y, X) -results = model.fit() - -``` -Note that `sm.OLS()` does -not fit the model; it specifies the model, and then `model.fit()` does the actual fitting. - -Our `ISLP` function `summarize()` produces a simple table of the parameter estimates, -their standard errors, t-statistics and p-values. -The function takes a single argument, such as the object `results` -returned here by the `fit` -method, and returns such a summary. - -```{python} -summarize(results) - -``` - - -Before we describe other methods for working with fitted models, we outline a more useful and general framework for constructing a model matrix~`X`. -### Using Transformations: Fit and Transform -Our model above has a single predictor, and constructing `X` was straightforward. -In practice we often fit models with more than one predictor, typically selected from an array or data frame. -We may wish to introduce transformations to the variables before fitting the model, specify interactions between variables, and expand some particular variables into sets of variables (e.g. polynomials). -The `sklearn` package has a particular notion -for this type of task: a *transform*. A transform is an object -that is created with some parameters as arguments. The -object has two main methods: `fit()` and `transform()`. - -We provide a general approach for specifying models and constructing -the model matrix through the transform `ModelSpec()` in the `ISLP` library. -`ModelSpec()` -(renamed `MS()` in the preamble) creates a -transform object, and then a pair of methods -`transform()` and `fit()` are used to construct a -corresponding model matrix. - -We first describe this process for our simple regression model using a single predictor `lstat` in -the `Boston` data frame, but will use it repeatedly in more -complex tasks in this and other labs in this book. -In our case the transform is created by the expression -`design = MS(['lstat'])`. - -The `fit()` method takes the original array and may do some -initial computations on it, as specified in the transform object. -For example, it may compute means and standard deviations for centering and scaling. -The `transform()` -method applies the fitted transformation to the array of data, and produces the model matrix. - - -```{python} -design = MS(['lstat']) -design = design.fit(Boston) -X = design.transform(Boston) -X[:4] -``` -In this simple case, the `fit()` method does very little; it simply checks that the variable `'lstat'` specified in `design` exists in `Boston`. Then `transform()` constructs the model matrix with two columns: an `intercept` and the variable `lstat`. - -These two operations can be combined with the -`fit_transform()` method. - -```{python} -design = MS(['lstat']) -X = design.fit_transform(Boston) -X[:4] -``` -Note that, as in the previous code chunk when the two steps were done separately, the `design` object is changed as a result of the `fit()` operation. The power of this pipeline will become clearer when we fit more complex models that involve interactions and transformations. - - -Let's return to our fitted regression model. -The object -`results` has several methods that can be used for inference. -We already presented a function `summarize()` for showing the essentials of the fit. -For a full and somewhat exhaustive summary of the fit, we can use the `summary()` -method. - -```{python} -results.summary() - -``` - -The fitted coefficients can also be retrieved as the -`params` attribute of `results`. - -```{python} -results.params - -``` - - -The `get_prediction()` method can be used to obtain predictions, and produce confidence intervals and -prediction intervals for the prediction of `medv` for given values of `lstat`. - -We first create a new data frame, in this case containing only the variable `lstat`, with the values for this variable at which we wish to make predictions. -We then use the `transform()` method of `design` to create the corresponding model matrix. - -```{python} -new_df = pd.DataFrame({'lstat':[5, 10, 15]}) -newX = design.transform(new_df) -newX - -``` - -Next we compute the predictions at `newX`, and view them by extracting the `predicted_mean` attribute. - -```{python} -new_predictions = results.get_prediction(newX); -new_predictions.predicted_mean - -``` -We can produce confidence intervals for the predicted values. - -```{python} -new_predictions.conf_int(alpha=0.05) - -``` -Prediction intervals are computed by setting `obs=True`: - -```{python} -new_predictions.conf_int(obs=True, alpha=0.05) - -``` - For instance, the 95% confidence interval associated with an - `lstat` value of 10 is (24.47, 25.63), and the 95% prediction -interval is (12.82, 37.28). As expected, the confidence and -prediction intervals are centered around the same point (a predicted -value of 25.05 for `medv` when `lstat` equals -10), but the latter are substantially wider. - -Next we will plot `medv` and `lstat` -using `DataFrame.plot.scatter()`, \definelongblankMR{plot.scatter()}{plot.slashslashscatter()} -and wish to -add the regression line to the resulting plot. - - -### Defining Functions -While there is a function -within the `ISLP` package that adds a line to an existing plot, we take this opportunity -to define our first function to do so. - -```{python} -def abline(ax, b, m): - "Add a line with slope m and intercept b to ax" - xlim = ax.get_xlim() - ylim = [m * xlim[0] + b, m * xlim[1] + b] - ax.plot(xlim, ylim) - -``` - A few things are illustrated above. First we see the syntax for defining a function: -`def funcname(...)`. The function has arguments `ax, b, m` -where `ax` is an axis object for an existing plot, `b` is the intercept and -`m` is the slope of the desired line. Other plotting options can be passed on to -`ax.plot` by including additional optional arguments as follows: - -```{python} -def abline(ax, b, m, *args, **kwargs): - "Add a line with slope m and intercept b to ax" - xlim = ax.get_xlim() - ylim = [m * xlim[0] + b, m * xlim[1] + b] - ax.plot(xlim, ylim, *args, **kwargs) - -``` -The addition of `*args` allows any number of -non-named arguments to `abline`, while `**kwargs` allows any -number of named arguments (such as `linewidth=3`) to `abline`. -In our function, we pass -these arguments verbatim to `ax.plot` above. Readers -interested in learning more about -functions are referred to the section on -defining functions in [docs.python.org/tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions). - -Let’s use our new function to add this regression line to a plot of -`medv` vs. `lstat`. - -```{python} -ax = Boston.plot.scatter('lstat', 'medv') -abline(ax, - results.params['intercept'], - results.params['lstat'], - 'r--', - linewidth=3) - -``` -Thus, the final call to `ax.plot()` is `ax.plot(xlim, ylim, 'r--', linewidth=3)`. -We have used the argument `'r--'` to produce a red dashed line, and added -an argument to make it of width 3. -There is some evidence for non-linearity in the relationship between `lstat` and `medv`. We will explore this issue later in this lab. - -As mentioned above, there is an existing function to add a line to a plot --- `ax.axline()` --- but knowing how to write such functions empowers us to create more expressive displays. - - - - -Next we examine some diagnostic plots, several of which were discussed -in Section 3.3.3. -We can find the fitted values and residuals -of the fit as attributes of the `results` object. -Various influence measures describing the regression model -are computed with the `get_influence()` method. -As we will not use the `fig` component returned -as the first value from `subplots()`, we simply -capture the second returned value in `ax` below. - -```{python} -ax = subplots(figsize=(8,8))[1] -ax.scatter(results.fittedvalues, results.resid) -ax.set_xlabel('Fitted value') -ax.set_ylabel('Residual') -ax.axhline(0, c='k', ls='--'); - -``` - We add a horizontal line at 0 for reference using the - `ax.axhline()` method, indicating -it should be black (`c='k'`) and have a dashed linestyle (`ls='--'`). - -On the basis of the residual plot, there is some evidence of non-linearity. -Leverage statistics can be computed for any number of predictors using the -`hat_matrix_diag` attribute of the value returned by the -`get_influence()` method. - -```{python} -infl = results.get_influence() -ax = subplots(figsize=(8,8))[1] -ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag) -ax.set_xlabel('Index') -ax.set_ylabel('Leverage') -np.argmax(infl.hat_matrix_diag) - -``` - The `np.argmax()` function identifies the index of the largest element of an array, optionally computed over an axis of the array. -In this case, we maximized over the entire array -to determine which observation has the largest leverage statistic. - - -## Multiple Linear Regression -In order to fit a multiple linear regression model using least squares, we again use -the `ModelSpec()` transform to construct the required -model matrix and response. The arguments -to `ModelSpec()` can be quite general, but in this case -a list of column names suffice. We consider a fit here with -the two variables `lstat` and `age`. - -```{python} -X = MS(['lstat', 'age']).fit_transform(Boston) -model1 = sm.OLS(y, X) -results1 = model1.fit() -summarize(results1) -``` -Notice how we have compacted the first line into a succinct expression describing the construction of `X`. - -The `Boston` data set contains 12 variables, and so it would be cumbersome -to have to type all of these in order to perform a regression using all of the predictors. -Instead, we can use the following short-hand:\definelongblankMR{columns.drop()}{columns.slashslashdrop()} - -```{python} -terms = Boston.columns.drop('medv') -terms - -``` - -We can now fit the model with all the variables in `terms` using -the same model matrix builder. - -```{python} -X = MS(terms).fit_transform(Boston) -model = sm.OLS(y, X) -results = model.fit() -summarize(results) - -``` - -What if we would like to perform a regression using all of the variables but one? For -example, in the above regression output, `age` has a high $p$-value. -So we may wish to run a regression excluding this predictor. -The following syntax results in a regression using all predictors except `age`. - -```{python} -minus_age = Boston.columns.drop(['medv', 'age']) -Xma = MS(minus_age).fit_transform(Boston) -model1 = sm.OLS(y, Xma) -summarize(model1.fit()) - -``` - -## Multivariate Goodness of Fit -We can access the individual components of `results` by name -(`dir(results)` shows us what is available). Hence -`results.rsquared` gives us the $R^2$, -and -`np.sqrt(results.scale)` gives us the RSE. - -Variance inflation factors (section 3.3.3) are sometimes useful -to assess the effect of collinearity in the model matrix of a regression model. -We will compute the VIFs in our multiple regression fit, and use the opportunity to introduce the idea of *list comprehension*. - -### List Comprehension -Often we encounter a sequence of objects which we would like to transform -for some other task. Below, we compute the VIF for each -feature in our `X` matrix and produce a data frame -whose index agrees with the columns of `X`. -The notion of list comprehension can often make such -a task easier. - -List comprehensions are simple and powerful ways to form -lists of `Python` objects. The language also supports -dictionary and *generator* comprehension, though these are -beyond our scope here. Let's look at an example. We compute the VIF for each of the variables -in the model matrix `X`, using the function `variance_inflation_factor()`. - - -```{python} -vals = [VIF(X, i) - for i in range(1, X.shape[1])] -vif = pd.DataFrame({'vif':vals}, - index=X.columns[1:]) -vif - -``` -The function `VIF()` takes two arguments: a dataframe or array, -and a variable column index. In the code above we call `VIF()` on the fly for all columns in `X`. -We have excluded column 0 above (the intercept), which is not of interest. In this case the VIFs are not that exciting. - -The object `vals` above could have been constructed with the following for loop: - -```{python} -vals = [] -for i in range(1, X.values.shape[1]): - vals.append(VIF(X.values, i)) - -``` -List comprehension allows us to perform such repetitive operations in a more straightforward way. -## Interaction Terms -It is easy to include interaction terms in a linear model using `ModelSpec()`. -Including a tuple `("lstat","age")` tells the model -matrix builder to include an interaction term between - `lstat` and `age`. - -```{python} -X = MS(['lstat', - 'age', - ('lstat', 'age')]).fit_transform(Boston) -model2 = sm.OLS(y, X) -summarize(model2.fit()) - -``` - - -## Non-linear Transformations of the Predictors -The model matrix builder can include terms beyond -just column names and interactions. For instance, -the `poly()` function supplied in `ISLP` specifies that -columns representing polynomial functions -of its first argument are added to the model matrix. - -```{python} -X = MS([poly('lstat', degree=2), 'age']).fit_transform(Boston) -model3 = sm.OLS(y, X) -results3 = model3.fit() -summarize(results3) - -``` -The effectively zero *p*-value associated with the quadratic term -(i.e. the third row above) suggests that it leads to an improved model. - -By default, `poly()` creates a basis matrix for inclusion in the -model matrix whose -columns are *orthogonal polynomials*, which are designed for stable -least squares computations. {Actually, `poly()` is a wrapper for the workhorse and standalone function `Poly()` that does the work in building the model matrix.} -Alternatively, had we included an argument -`raw=True` in the above call to `poly()`, the basis matrix would consist simply of -`lstat` and `lstat**2`. Since either of these bases -represent quadratic polynomials, the fitted values would not -change in this case, just the polynomial coefficients. Also by default, the columns -created by `poly()` do not include an intercept column as -that is automatically added by `MS()`. - -We use the `anova_lm()` function to further quantify the extent to which the quadratic fit is -superior to the linear fit. - -```{python} -anova_lm(results1, results3) - -``` -Here `results1` represents the linear submodel containing -predictors `lstat` and `age`, -while `results3` corresponds to the larger model above with a quadratic -term in `lstat`. -The `anova_lm()` function performs a hypothesis test -comparing the two models. The null hypothesis is that the quadratic -term in the bigger model is not needed, and the alternative hypothesis is that the -bigger model is superior. Here the *F*-statistic is 177.28 and -the associated *p*-value is zero. -In this case the *F*-statistic is the square of the -*t*-statistic for the quadratic term in the linear model summary -for `results3` --- a consequence of the fact that these nested -models differ by one degree of freedom. -This provides very clear evidence that the quadratic polynomial in -`lstat` improves the linear model. -This is not surprising, since earlier we saw evidence for non-linearity in the relationship between `medv` -and `lstat`. - -The function `anova_lm()` can take more than two nested models -as input, in which case it compares every successive pair of models. -That also explains why there are `NaN`s in the first row above, since -there is no previous model with which to compare the first. - - -```{python} -ax = subplots(figsize=(8,8))[1] -ax.scatter(results3.fittedvalues, results3.resid) -ax.set_xlabel('Fitted value') -ax.set_ylabel('Residual') -ax.axhline(0, c='k', ls='--'); - -``` -We see that when the quadratic term is included in the model, -there is little discernible pattern in the residuals. -In order to create a cubic or higher-degree polynomial fit, we can simply change the degree argument -to `poly()`. - - - -## Qualitative Predictors -Here we use the `Carseats` data, which is included in the -`ISLP` package. We will attempt to predict `Sales` -(child car seat sales) in 400 locations based on a number of -predictors. - -```{python} -Carseats = load_data('Carseats') -Carseats.columns - -``` -The `Carseats` - data includes qualitative predictors such as - `ShelveLoc`, an indicator of the quality of the shelving - location --- that is, -the space within a store in which the car seat is displayed. The predictor - `ShelveLoc` takes on three possible values, `Bad`, `Medium`, and `Good`. -Given a qualitative variable such as `ShelveLoc`, `ModelSpec()` generates dummy -variables automatically. -These variables are often referred to as a *one-hot encoding* of the categorical -feature. Their columns sum to one, so to avoid collinearity with an intercept, the first column is dropped. Below we see -the column `ShelveLoc[Bad]` has been dropped, since `Bad` is the first level of `ShelveLoc`. -Below we fit a multiple regression model that includes some interaction terms. - -```{python} -allvars = list(Carseats.columns.drop('Sales')) -y = Carseats['Sales'] -final = allvars + [('Income', 'Advertising'), - ('Price', 'Age')] -X = MS(final).fit_transform(Carseats) -model = sm.OLS(y, X) -summarize(model.fit()) - -``` -In the first line above, we made `allvars` a list, so that we -could add the interaction terms two lines down. -Our model-matrix builder has created a `ShelveLoc[Good]` -dummy variable that takes on a value of 1 if the -shelving location is good, and 0 otherwise. It has also created a `ShelveLoc[Medium]` -dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. -A bad shelving location corresponds to a zero for each of the two dummy variables. -The fact that the coefficient for `ShelveLoc[Good]` in the regression output is -positive indicates that a good shelving location is associated with high sales (relative to a bad location). -And `ShelveLoc[Medium]` has a smaller positive coefficient, -indicating that a medium shelving location leads to higher sales than a bad -shelving location, but lower sales than a good shelving location. - - diff --git a/Ch04-classification-lab.Rmd b/Ch04-classification-lab.Rmd deleted file mode 100644 index b1375d6..0000000 --- a/Ch04-classification-lab.Rmd +++ /dev/null @@ -1,1166 +0,0 @@ - - -# Logistic Regression, LDA, QDA, and KNN - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch04-classification-lab.ipynb) - - - - - -## The Stock Market Data - -In this lab we will examine the `Smarket` -data, which is part of the `ISLP` -library. This data set consists of percentage returns for the S&P 500 -stock index over 1,250 days, from the beginning of 2001 until the end -of 2005. For each date, we have recorded the percentage returns for -each of the five previous trading days, `Lag1` through - `Lag5`. We have also recorded `Volume` (the number of -shares traded on the previous day, in billions), `Today` (the -percentage return on the date in question) and `Direction` -(whether the market was `Up` or `Down` on this date). - -We start by importing our libraries at this top level; these are all imports we have seen in previous labs. - -```{python} -import numpy as np -import pandas as pd -from matplotlib.pyplot import subplots -import statsmodels.api as sm -from ISLP import load_data -from ISLP.models import (ModelSpec as MS, - summarize) -``` - -We also collect together the new imports needed for this lab. - -```{python} -from ISLP import confusion_table -from ISLP.models import contrast -from sklearn.discriminant_analysis import \ - (LinearDiscriminantAnalysis as LDA, - QuadraticDiscriminantAnalysis as QDA) -from sklearn.naive_bayes import GaussianNB -from sklearn.neighbors import KNeighborsClassifier -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -from sklearn.linear_model import LogisticRegression - -``` - - -Now we are ready to load the `Smarket` data. - -```{python} -Smarket = load_data('Smarket') -Smarket -``` -This gives a truncated listing of the data. -We can see what the variable names are. - -```{python} -Smarket.columns -``` -We compute the correlation matrix using the `corr()` method -for data frames, which produces a matrix that contains all of -the pairwise correlations among the variables. - -By instructing `pandas` to use only numeric variables, the `corr()` method does not report a correlation for the `Direction` variable because it is - qualitative. - -```{python} -Smarket.corr(numeric_only=True) - -``` -As one would expect, the correlations between the lagged return variables and -today’s return are close to zero. The only substantial correlation is between `Year` and - `Volume`. By plotting the data we see that `Volume` -is increasing over time. In other words, the average number of shares traded -daily increased from 2001 to 2005. - - -```{python} -Smarket.plot(y='Volume'); - -``` - - -## Logistic Regression -Next, we will fit a logistic regression model in order to predict - `Direction` using `Lag1` through `Lag5` and - `Volume`. The `sm.GLM()` function fits *generalized linear models*, a class of -models that includes logistic regression. Alternatively, -the function `sm.Logit()` fits a logistic regression -model directly. The syntax of -`sm.GLM()` is similar to that of `sm.OLS()`, except -that we must pass in the argument `family=sm.families.Binomial()` -in order to tell `statsmodels` to run a logistic regression rather than some other -type of generalized linear model. - -```{python} -allvars = Smarket.columns.drop(['Today', 'Direction', 'Year']) -design = MS(allvars) -X = design.fit_transform(Smarket) -y = Smarket.Direction == 'Up' -glm = sm.GLM(y, - X, - family=sm.families.Binomial()) -results = glm.fit() -summarize(results) - -``` -The smallest *p*-value here is associated with `Lag1`. The -negative coefficient for this predictor suggests that if the market -had a positive return yesterday, then it is less likely to go up -today. However, at a value of 0.15, the *p*-value is still -relatively large, and so there is no clear evidence of a real -association between `Lag1` and `Direction`. - -We use the `params` attribute of `results` -in order to access just the -coefficients for this fitted model. - -```{python} -results.params - -``` -Likewise we can use the -`pvalues` attribute to access the *p*-values for the coefficients. - -```{python} -results.pvalues - -``` - -The `predict()` method of `results` can be used to predict the -probability that the market will go up, given values of the -predictors. This method returns predictions -on the probability scale. If no data set is supplied to the `predict()` -function, then the probabilities are computed for the training data -that was used to fit the logistic regression model. -As with linear regression, one can pass an optional `exog` argument consistent -with a design matrix if desired. Here we have -printed only the first ten probabilities. - -```{python} -probs = results.predict() -probs[:10] - -``` -In order to make a prediction as to whether the market will go up or -down on a particular day, we must convert these predicted -probabilities into class labels, `Up` or `Down`. The -following two commands create a vector of class predictions based on -whether the predicted probability of a market increase is greater than -or less than 0.5. - -```{python} -labels = np.array(['Down']*1250) -labels[probs>0.5] = "Up" - -``` -The `confusion_table()` -function from the `ISLP` package summarizes these predictions, showing how -many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function -in the module `sklearn.metrics`, transposes the resulting -matrix and includes row and column labels. -The `confusion_table()` function takes as first argument the -predicted labels, and second argument the true labels. - -```{python} -confusion_table(labels, Smarket.Direction) - -``` - -The diagonal elements of the confusion matrix indicate correct -predictions, while the off-diagonals represent incorrect -predictions. Hence our model correctly predicted that the market would -go up on 507 days and that it would go down on 145 days, for a -total of 507 + 145 = 652 correct predictions. The `np.mean()` -function can be used to compute the fraction of days for which the -prediction was correct. In this case, logistic regression correctly -predicted the movement of the market 52.2% of the time. - - -```{python} -(507+145)/1250, np.mean(labels == Smarket.Direction) - -``` - - - - -At first glance, it appears that the logistic regression model is -working a little better than random guessing. However, this result is -misleading because we trained and tested the model on the same set of -1,250 observations. In other words, $100-52.2=47.8%$ is the -*training* error rate. As we have seen -previously, the training error rate is often overly optimistic --- it -tends to underestimate the test error rate. In -order to better assess the accuracy of the logistic regression model -in this setting, we can fit the model using part of the data, and -then examine how well it predicts the *held out* data. This -will yield a more realistic error rate, in the sense that in practice -we will be interested in our model’s performance not on the data that -we used to fit the model, but rather on days in the future for which -the market’s movements are unknown. - -To implement this strategy, we first create a Boolean vector -corresponding to the observations from 2001 through 2004. We then -use this vector to create a held out data set of observations from -2005. - -```{python} -train = (Smarket.Year < 2005) -Smarket_train = Smarket.loc[train] -Smarket_test = Smarket.loc[~train] -Smarket_test.shape - -``` - - -The object `train` is a vector of 1,250 elements, corresponding -to the observations in our data set. The elements of the vector that -correspond to observations that occurred before 2005 are set to -`True`, whereas those that correspond to observations in 2005 are -set to `False`. Hence `train` is a -*boolean* array, since its -elements are `True` and `False`. Boolean arrays can be used -to obtain a subset of the rows or columns of a data frame -using the `loc` method. For instance, -the command `Smarket.loc[train]` would pick out a submatrix of the -stock market data set, corresponding only to the dates before 2005, -since those are the ones for which the elements of `train` are -`True`. The `~` symbol can be used to negate all of the -elements of a Boolean vector. That is, `~train` is a vector -similar to `train`, except that the elements that are `True` -in `train` get swapped to `False` in `~train`, and vice versa. -Therefore, `Smarket.loc[~train]` yields a -subset of the rows of the data frame -of the stock market data containing only the observations for which -`train` is `False`. -The output above indicates that there are 252 such -observations. - -We now fit a logistic regression model using only the subset of the -observations that correspond to dates before 2005. We then obtain predicted probabilities of the -stock market going up for each of the days in our test set --- that is, -for the days in 2005. - -```{python} -X_train, X_test = X.loc[train], X.loc[~train] -y_train, y_test = y.loc[train], y.loc[~train] -glm_train = sm.GLM(y_train, - X_train, - family=sm.families.Binomial()) -results = glm_train.fit() -probs = results.predict(exog=X_test) - -``` - -Notice that we have trained and tested our model on two completely -separate data sets: training was performed using only the dates before -2005, and testing was performed using only the dates in 2005. - -Finally, we compare the predictions for 2005 to the -actual movements of the market over that time period. -We will first store the test and training labels (recall `y_test` is binary). - -```{python} -D = Smarket.Direction -L_train, L_test = D.loc[train], D.loc[~train] - -``` -Now we threshold the -fitted probability at 50% to form -our predicted labels. - -```{python} -labels = np.array(['Down']*252) -labels[probs>0.5] = 'Up' -confusion_table(labels, L_test) - -``` -The test accuracy is about 48% while the error rate is about 52% - -```{python} -np.mean(labels == L_test), np.mean(labels != L_test) - -``` - - -The `!=` notation means *not equal to*, and so the last command -computes the test set error rate. The results are rather -disappointing: the test error rate is 52%, which is worse than -random guessing! Of course this result is not all that surprising, -given that one would not generally expect to be able to use previous -days’ returns to predict future market performance. (After all, if it -were possible to do so, then the authors of this book would be out -striking it rich rather than writing a statistics textbook.) - -We recall that the logistic regression model had very underwhelming -*p*-values associated with all of the predictors, and that the -smallest *p*-value, though not very small, corresponded to - `Lag1`. Perhaps by removing the variables that appear not to be -helpful in predicting `Direction`, we can obtain a more -effective model. After all, using predictors that have no relationship -with the response tends to cause a deterioration in the test error -rate (since such predictors cause an increase in variance without a -corresponding decrease in bias), and so removing such predictors may -in turn yield an improvement. Below we refit the logistic -regression using just `Lag1` and `Lag2`, which seemed to -have the highest predictive power in the original logistic regression -model. - -```{python} -model = MS(['Lag1', 'Lag2']).fit(Smarket) -X = model.transform(Smarket) -X_train, X_test = X.loc[train], X.loc[~train] -glm_train = sm.GLM(y_train, - X_train, - family=sm.families.Binomial()) -results = glm_train.fit() -probs = results.predict(exog=X_test) -labels = np.array(['Down']*252) -labels[probs>0.5] = 'Up' -confusion_table(labels, L_test) - -``` - - -Let’s evaluate the overall accuracy as well as the accuracy within the days when -logistic regression predicts an increase. - -```{python} -(35+106)/252,106/(106+76) - -``` - -Now the results appear to be a little better: 56% of the daily -movements have been correctly predicted. It is worth noting that in -this case, a much simpler strategy of predicting that the market will -increase every day will also be correct 56% of the time! Hence, in -terms of overall error rate, the logistic regression method is no -better than the naive approach. However, the confusion matrix -shows that on days when logistic regression predicts an increase in -the market, it has a 58% accuracy rate. This suggests a possible -trading strategy of buying on days when the model predicts an -increasing market, and avoiding trades on days when a decrease is -predicted. Of course one would need to investigate more carefully -whether this small improvement was real or just due to random chance. - -Suppose that we want to predict the returns associated with particular -values of `Lag1` and `Lag2`. In particular, we want to -predict `Direction` on a day when `Lag1` and - `Lag2` equal $1.2$ and $1.1$, respectively, and on a day when they -equal $1.5$ and $-0.8$. We do this using the `predict()` -function. - -```{python} -newdata = pd.DataFrame({'Lag1':[1.2, 1.5], - 'Lag2':[1.1, -0.8]}); -newX = model.transform(newdata) -results.predict(newX) - -``` - - -## Linear Discriminant Analysis - - -We begin by performing LDA on the `Smarket` data, using the function -`LinearDiscriminantAnalysis()`, which we have abbreviated `LDA()`. We -fit the model using only the observations before 2005. - -```{python} -lda = LDA(store_covariance=True) - -``` - -Since the `LDA` estimator automatically -adds an intercept, we should remove the column corresponding to the -intercept in both `X_train` and `X_test`. We can also directly -use the labels rather than the Boolean vectors `y_train`. - -```{python} -X_train, X_test = [M.drop(columns=['intercept']) - for M in [X_train, X_test]] -lda.fit(X_train, L_train) - -``` -Here we have used the list comprehensions introduced -in Section 3.6.4. Looking at our first line above, we see that the right-hand side is a list -of length two. This is because the code `for M in [X_train, X_test]` iterates over a list -of length two. While here we loop over a list, -the list comprehension method works when looping over any iterable object. -We then apply the `drop()` method to each element in the iteration, collecting -the result in a list. The left-hand side tells `Python` to unpack this list -of length two, assigning its elements to the variables `X_train` and `X_test`. Of course, -this overwrites the previous values of `X_train` and `X_test`. - -Having fit the model, we can extract the means in the two classes with the `means_` attribute. These are the average of each predictor within each class, and -are used by LDA as estimates of $\mu_k$. These suggest that there is -a tendency for the previous 2 days’ returns to be negative on days -when the market increases, and a tendency for the previous days’ -returns to be positive on days when the market declines. - -```{python} -lda.means_ - -``` - - -The estimated prior probabilities are stored in the `priors_` attribute. -The package `sklearn` typically uses this trailing `_` to denote -a quantity estimated when using the `fit()` method. We can be sure of which -entry corresponds to which label by looking at the `classes_` attribute. - -```{python} -lda.classes_ - -``` - - -The LDA output indicates that $\hat\pi_{Down}=0.492$ and -$\hat\pi_{Up}=0.508$. - - -```{python} -lda.priors_ - -``` - - -The linear discriminant vectors can be found in the `scalings_` attribute: - -```{python} -lda.scalings_ - -``` - -These values provide the linear combination of `Lag1` and `Lag2` that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in (4.24). - If $-0.64\times `Lag1` - 0.51 \times `Lag2` $ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline. - -```{python} -lda_pred = lda.predict(X_test) - -``` - -As we observed in our comparison of classification methods - (Section 4.5), the LDA and logistic -regression predictions are almost identical. - -```{python} -confusion_table(lda_pred, L_test) - -``` - - -We can also estimate the -probability of each class for -each point in a training set. Applying a 50% threshold to the posterior probabilities of -being in class one allows us to -recreate the predictions contained in `lda_pred`. - -```{python} -lda_prob = lda.predict_proba(X_test) -np.all( - np.where(lda_prob[:,1] >= 0.5, 'Up','Down') == lda_pred - ) - -``` - - -Above, we used the `np.where()` function that -creates an array with value `'Up'` for indices where -the second column of `lda_prob` (the estimated -posterior probability of `'Up'`) is greater than 0.5. -For problems with more than two classes the labels are chosen as the class whose posterior probability is highest: - -```{python} -np.all( - [lda.classes_[i] for i in np.argmax(lda_prob, 1)] == lda_pred - ) - -``` - - -If we wanted to use a posterior probability threshold other than -50% in order to make predictions, then we could easily do so. For -instance, suppose that we wish to predict a market decrease only if we -are very certain that the market will indeed decrease on that -day --- say, if the posterior probability is at least 90%. -We know that the first column of `lda_prob` corresponds to the -label `Down` after having checked the `classes_` attribute, hence we use -the column index 0 rather than 1 as we did above. - -```{python} -np.sum(lda_prob[:,0] > 0.9) - -``` - -No days in 2005 meet that threshold! In fact, the greatest posterior -probability of decrease in all of 2005 was 52.02%. - -The LDA classifier above is the first classifier from the -`sklearn` library. We will use several other objects -from this library. The objects -follow a common structure that simplifies tasks such as cross-validation, -which we will see in Chapter 5. Specifically, -the methods first create a generic classifier without -referring to any data. This classifier is then fit -to data with the `fit()` method and predictions are -always produced with the `predict()` method. This pattern -of first instantiating the classifier, followed by fitting it, and -then producing predictions is an explicit design choice of `sklearn`. This uniformity -makes it possible to cleanly copy the classifier so that it can be fit -on different data; e.g. different training sets arising in cross-validation. -This standard pattern also allows for a predictable formation of workflows. - - - -## Quadratic Discriminant Analysis -We will now fit a QDA model to the `Smarket` data. QDA is -implemented via -`QuadraticDiscriminantAnalysis()` -in the `sklearn` package, which we abbreviate to `QDA()`. -The syntax is very similar to `LDA()`. - -```{python} -qda = QDA(store_covariance=True) -qda.fit(X_train, L_train) - -``` - -The `QDA()` function will again compute `means_` and `priors_`. - -```{python} -qda.means_, qda.priors_ - -``` - -The `QDA()` classifier will estimate one covariance per class. Here is the -estimated covariance in the first class: - -```{python} -qda.covariance_[0] - -``` -The output contains the group means. But it does not contain the -coefficients of the linear discriminants, because the QDA classifier -involves a quadratic, rather than a linear, function of the -predictors. The `predict()` function works in exactly the -same fashion as for LDA. - -```{python} -qda_pred = qda.predict(X_test) -confusion_table(qda_pred, L_test) - -``` -Interestingly, the QDA predictions are accurate almost 60% of the -time, even though the 2005 data was not used to fit the model. - -```{python} -np.mean(qda_pred == L_test) - -``` - - -This level of accuracy is quite impressive for stock market data, which is -known to be quite hard to model accurately. This suggests that the -quadratic form assumed by QDA may capture the true relationship more -accurately than the linear forms assumed by LDA and logistic -regression. However, we recommend evaluating this method’s -performance on a larger test set before betting that this approach -will consistently beat the market! - - -## Naive Bayes -Next, we fit a naive Bayes model to the `Smarket` data. The syntax is -similar to that of `LDA()` and `QDA()`. By -default, this implementation `GaussianNB()` of the naive Bayes classifier models each -quantitative feature using a Gaussian distribution. However, a kernel -density method can also be used to estimate the distributions. - -```{python} -NB = GaussianNB() -NB.fit(X_train, L_train) - -``` - - -The classes are stored as `classes_`. - -```{python} -NB.classes_ - -``` - - -The class prior probabilities are stored in the `class_prior_` attribute. - -```{python} -NB.class_prior_ - -``` - - -The parameters of the features can be found in the `theta_` and `var_` attributes. The number of rows -is equal to the number of classes, while the number of columns is equal to the number of features. -We see below that the mean for feature `Lag1` in the `Down` class is 0.043. - -```{python} -NB.theta_ - -``` - -Its variance is 1.503. - -```{python} -NB.var_ - -``` -How do we know the names of these attributes? We use `NB?` (or `?NB`). - -We can easily verify the mean computation: - -```{python} -X_train[L_train == 'Down'].mean() - -``` - - -Similarly for the variance: - -```{python} -X_train[L_train == 'Down'].var(ddof=0) - -``` -Since `NB()` is a classifier in the `sklearn` library, making predictions -uses the same syntax as for `LDA()` and `QDA()` above. - -```{python} -nb_labels = NB.predict(X_test) -confusion_table(nb_labels, L_test) - -``` - -Naive Bayes performs well on these data, with accurate predictions over 59% of the time. This is slightly worse than QDA, but much better than LDA. - -As for `LDA`, the `predict_proba()` method estimates the probability that each observation belongs to a particular class. - -```{python} -NB.predict_proba(X_test)[:5] - -``` - - -## K-Nearest Neighbors -We will now perform KNN using the `KNeighborsClassifier()` function. This function works similarly -to the other model-fitting functions that we have -encountered thus far. - -As is the -case for LDA and QDA, we fit the classifier -using the `fit` method. New -predictions are formed using the `predict` method -of the object returned by `fit()`. - -```{python} -knn1 = KNeighborsClassifier(n_neighbors=1) -X_train, X_test = [np.asarray(X) for X in [X_train, X_test]] -knn1.fit(X_train, L_train) -knn1_pred = knn1.predict(X_test) -confusion_table(knn1_pred, L_test) - -``` - -The results using $K=1$ are not very good, since only $50%$ of the -observations are correctly predicted. Of course, it may be that $K=1$ -results in an overly-flexible fit to the data. - -```{python} -(83+43)/252, np.mean(knn1_pred == L_test) - -``` - -We repeat the -analysis below using $K=3$. - -```{python} -knn3 = KNeighborsClassifier(n_neighbors=3) -knn3_pred = knn3.fit(X_train, L_train).predict(X_test) -np.mean(knn3_pred == L_test) -``` -The results have improved slightly. But increasing *K* further -provides no further improvements. It appears that for these data, and this train/test split, -QDA gives the best results of the methods that we have examined so -far. - - -KNN does not perform well on the `Smarket` data, but it often does provide impressive results. As an example we will apply the KNN approach to the `Caravan` data set, which is part of the `ISLP` library. This data set includes 85 -predictors that measure demographic characteristics for 5,822 -individuals. The response variable is `Purchase`, which -indicates whether or not a given individual purchases a caravan -insurance policy. In this data set, only 6% of people purchased -caravan insurance. - -```{python} -Caravan = load_data('Caravan') -Purchase = Caravan.Purchase -Purchase.value_counts() - -``` - - -The method `value_counts()` takes a `pd.Series` or `pd.DataFrame` and returns -a `pd.Series` with the corresponding counts -for each unique element. In this case `Purchase` has only `Yes` and `No` values -and the method returns how many values of each there are. - -```{python} -348 / 5822 - -``` - - -Our features will include all columns except `Purchase`. - -```{python} -feature_df = Caravan.drop(columns=['Purchase']) - -``` - -Because the KNN classifier predicts the class of a given test -observation by identifying the observations that are nearest to it, -the scale of the variables matters. Any variables that are on a large -scale will have a much larger effect on the *distance* between -the observations, and hence on the KNN classifier, than variables that -are on a small scale. For instance, imagine a data set that contains -two variables, `salary` and `age` (measured in dollars -and years, respectively). As far as KNN is concerned, a difference of -1,000 USD in salary is enormous compared to a difference of 50 years in -age. Consequently, `salary` will drive the KNN classification -results, and `age` will have almost no effect. This is contrary -to our intuition that a salary difference of 1,000 USD is quite small -compared to an age difference of 50 years. Furthermore, the -importance of scale to the KNN classifier leads to another issue: if -we measured `salary` in Japanese yen, or if we measured - `age` in minutes, then we’d get quite different classification -results from what we get if these two variables are measured in -dollars and years. - -A good way to handle this problem is to *standardize* the data so that all variables are -given a mean of zero and a standard deviation of one. Then all -variables will be on a comparable scale. This is accomplished -using -the `StandardScaler()` -transformation. - -```{python} -scaler = StandardScaler(with_mean=True, - with_std=True, - copy=True) - -``` -The argument `with_mean` indicates whether or not -we should subtract the mean, while `with_std` indicates -whether or not we should scale the columns to have standard -deviation of 1 or not. Finally, the argument `copy=True` -indicates that we will always copy data, rather than -trying to do calculations in place where possible. - -This transformation can be fit -and then applied to arbitrary data. In the first line -below, the parameters for the scaling are computed and -stored in `scaler`, while the second line actually -constructs the standardized set of features. - -```{python} -scaler.fit(feature_df) -X_std = scaler.transform(feature_df) - -``` -Now every column of `feature_std` below has a standard deviation of -one and a mean of zero. - -```{python} -feature_std = pd.DataFrame( - X_std, - columns=feature_df.columns); -feature_std.std() - -``` - -Notice that the standard deviations are not quite $1$ here; this is again due to some procedures using the $1/n$ convention for variances (in this case `scaler()`), while others use $1/(n-1)$ (the `std()` method). See the footnote on page 183. -In this case it does not matter, as long as the variables are all on the same scale. - -Using the function `train_test_split()` we now split the observations into a test set, -containing 1000 observations, and a training set containing the remaining -observations. The argument `random_state=0` ensures that we get -the same split each time we rerun the code. - -```{python} -(X_train, - X_test, - y_train, - y_test) = train_test_split(np.asarray(feature_std), - Purchase, - test_size=1000, - random_state=0) - -``` -`?train_test_split` reveals that the non-keyword arguments can be `lists`, `arrays`, `pandas dataframes` etc that all have the same length (`shape[0]`) and hence are *indexable*. In this case they are the dataframe `feature_std` and the response variable `Purchase`. - {Note that we have converted `feature_std` to an `ndarray` to address a bug in `sklearn`.} -We fit a KNN model on the training data using $K=1$, -and evaluate its performance on the test data. - -```{python} -knn1 = KNeighborsClassifier(n_neighbors=1) -knn1_pred = knn1.fit(X_train, y_train).predict(X_test) -np.mean(y_test != knn1_pred), np.mean(y_test != "No") - -``` - - -The KNN error rate on the 1,000 test observations is about $11%$. -At first glance, this may appear to be fairly good. However, since -just over 6% of customers purchased insurance, we could get the error -rate down to almost 6% by always predicting `No` regardless of the -values of the predictors! This is known as the *null rate*.} - -Suppose that there is some non-trivial cost to trying to sell -insurance to a given individual. For instance, perhaps a salesperson -must visit each potential customer. If the company tries to sell -insurance to a random selection of customers, then the success rate -will be only 6%, which may be far too low given the costs -involved. Instead, the company would like to try to sell insurance -only to customers who are likely to buy it. So the overall error rate -is not of interest. Instead, the fraction of individuals that are -correctly predicted to buy insurance is of interest. - -```{python} -confusion_table(knn1_pred, y_test) - -``` - -It turns out that KNN with $K=1$ does far better than random guessing -among the customers that are predicted to buy insurance. Among 62 -such customers, 9, or 14.5%, actually do purchase insurance. -This is double the rate that one would obtain from random guessing. - -```{python} -9/(53+9) -``` - - -### Tuning Parameters - -The number of neighbors in KNN is referred to as a *tuning parameter*, also referred to as a *hyperparameter*. -We do not know *a priori* what value to use. It is therefore of interest -to see how the classifier performs on test data as we vary these -parameters. This can be achieved with a `for` loop, described in Section 2.3.8. -Here we use a for loop to look at the accuracy of our classifier in the group predicted to purchase -insurance as we vary the number of neighbors from 1 to 5: - -```{python} -for K in range(1,6): - knn = KNeighborsClassifier(n_neighbors=K) - knn_pred = knn.fit(X_train, y_train).predict(X_test) - C = confusion_table(knn_pred, y_test) - templ = ('K={0:d}: # predicted to rent: {1:>2},' + - ' # who did rent {2:d}, accuracy {3:.1%}') - pred = C.loc['Yes'].sum() - did_rent = C.loc['Yes','Yes'] - print(templ.format( - K, - pred, - did_rent, - did_rent / pred)) - -``` -We see some variability --- the numbers for `K=4` are very different from the rest. - -### Comparison to Logistic Regression -As a comparison, we can also fit a logistic regression model to the -data. This can also be done -with `sklearn`, though by default it fits -something like the *ridge regression* version -of logistic regression, which we introduce in Chapter 6. This can -be modified by appropriately setting the argument `C` below. Its default -value is 1 but by setting it to a very large number, the algorithm converges to the same solution as the usual (unregularized) -logistic regression estimator discussed above. - -Unlike the -`statsmodels` package, `sklearn` focuses less on -inference and more on classification. Hence, -the `summary` methods seen in `statsmodels` -and our simplified version seen with `summarize` are not -generally available for the classifiers in `sklearn`. - -```{python} -logit = LogisticRegression(C=1e10, solver='liblinear') -logit.fit(X_train, y_train) -logit_pred = logit.predict_proba(X_test) -logit_labels = np.where(logit_pred[:,1] > .5, 'Yes', 'No') -confusion_table(logit_labels, y_test) - -``` - -We used the argument `solver='liblinear'` above to -avoid a warning with the default solver which would indicate that -the algorithm does not converge. - -If we use $0.5$ as the predicted probability cut-off for the -classifier, then we have a problem: only two of the test observations -are predicted to purchase insurance. However, we are not required to use a -cut-off of $0.5$. If we instead predict a purchase any time the -predicted probability of purchase exceeds $0.25$, we get much better -results: we predict that 29 people will purchase insurance, and we are -correct for about 31% of these people. This is almost five times -better than random guessing! - -```{python} -logit_labels = np.where(logit_pred[:,1]>0.25, 'Yes', 'No') -confusion_table(logit_labels, y_test) -``` - -```{python} -9/(20+9) - -``` -## Linear and Poisson Regression on the Bikeshare Data -Here we fit linear and Poisson regression models to the `Bikeshare` data, as described in Section 4.6. -The response `bikers` measures the number of bike rentals per hour -in Washington, DC in the period 2010--2012. - -```{python} -Bike = load_data('Bikeshare') - -``` -Let's have a peek at the dimensions and names of the variables in this dataframe. - -```{python} -Bike.shape, Bike.columns - -``` - -### Linear Regression - -We begin by fitting a linear regression model to the data. - -```{python} -X = MS(['mnth', - 'hr', - 'workingday', - 'temp', - 'weathersit']).fit_transform(Bike) -Y = Bike['bikers'] -M_lm = sm.OLS(Y, X).fit() -summarize(M_lm) - -``` - -There are 24 levels in `hr` and 40 rows in all. -In `M_lm`, the first levels `hr[0]` and `mnth[Jan]` are treated -as the baseline values, and so no coefficient estimates are provided -for them: implicitly, their coefficient estimates are zero, and all -other levels are measured relative to these baselines. For example, -the Feb coefficient of $6.845$ signifies that, holding all other -variables constant, there are on average about 7 more riders in -February than in January. Similarly there are about 16.5 more riders -in March than in January. - -The results seen in Section 4.6.1 -used a slightly different coding of the variables `hr` and `mnth`, as follows: - -```{python} -hr_encode = contrast('hr', 'sum') -mnth_encode = contrast('mnth', 'sum') - -``` -Refitting again: - -```{python} -X2 = MS([mnth_encode, - hr_encode, - 'workingday', - 'temp', - 'weathersit']).fit_transform(Bike) -M2_lm = sm.OLS(Y, X2).fit() -S2 = summarize(M2_lm) -S2 -``` - -What is the difference between the two codings? In `M2_lm`, a -coefficient estimate is reported for all but level `23` of `hr` -and level `Dec` of `mnth`. Importantly, in `M2_lm`, the (unreported) coefficient estimate -for the last level of `mnth` is not zero: instead, it equals the -negative of the sum of the coefficient estimates for all of the -other levels. Similarly, in `M2_lm`, the coefficient estimate -for the last level of `hr` is the negative of the sum of the -coefficient estimates for all of the other levels. This means that the -coefficients of `hr` and `mnth` in `M2_lm` will always sum -to zero, and can be interpreted as the difference from the mean -level. For example, the coefficient for January of $-46.087$ indicates -that, holding all other variables constant, there are typically 46 -fewer riders in January relative to the yearly average. - -It is important to realize that the choice of coding really does not -matter, provided that we interpret the model output correctly in light -of the coding used. For example, we see that the predictions from the -linear model are the same regardless of coding: - -```{python} -np.sum((M_lm.fittedvalues - M2_lm.fittedvalues)**2) - -``` - -The sum of squared differences is zero. We can also see this using the -`np.allclose()` function: - -```{python} -np.allclose(M_lm.fittedvalues, M2_lm.fittedvalues) - -``` - - -To reproduce the left-hand side of Figure 4.13 -we must first obtain the coefficient estimates associated with -`mnth`. The coefficients for January through November can be obtained -directly from the `M2_lm` object. The coefficient for December -must be explicitly computed as the negative sum of all the other -months. We first extract all the coefficients for month from -the coefficients of `M2_lm`. - -```{python} -coef_month = S2[S2.index.str.contains('mnth')]['coef'] -coef_month - -``` -Next, we append `Dec` as the negative of the sum of all other months. - -```{python} -months = Bike['mnth'].dtype.categories -coef_month = pd.concat([ - coef_month, - pd.Series([-coef_month.sum()], - index=['mnth[Dec]' - ]) - ]) -coef_month - -``` -Finally, to make the plot neater, we’ll just use the first letter of each month, which is the $6$th entry of each of -the labels in the index. - -```{python} -fig_month, ax_month = subplots(figsize=(8,8)) -x_month = np.arange(coef_month.shape[0]) -ax_month.plot(x_month, coef_month, marker='o', ms=10) -ax_month.set_xticks(x_month) -ax_month.set_xticklabels([l[5] for l in coef_month.index], fontsize=20) -ax_month.set_xlabel('Month', fontsize=20) -ax_month.set_ylabel('Coefficient', fontsize=20); - -``` - -Reproducing the right-hand plot in Figure 4.13 follows a similar process. - -```{python} -coef_hr = S2[S2.index.str.contains('hr')]['coef'] -coef_hr = coef_hr.reindex(['hr[{0}]'.format(h) for h in range(23)]) -coef_hr = pd.concat([coef_hr, - pd.Series([-coef_hr.sum()], index=['hr[23]']) - ]) - -``` - -We now make the hour plot. - -```{python} -fig_hr, ax_hr = subplots(figsize=(8,8)) -x_hr = np.arange(coef_hr.shape[0]) -ax_hr.plot(x_hr, coef_hr, marker='o', ms=10) -ax_hr.set_xticks(x_hr[::2]) -ax_hr.set_xticklabels(range(24)[::2], fontsize=20) -ax_hr.set_xlabel('Hour', fontsize=20) -ax_hr.set_ylabel('Coefficient', fontsize=20); - -``` - -### Poisson Regression - -Now we fit instead a Poisson regression model to the -`Bikeshare` data. Very little changes, except that we now use the -function `sm.GLM()` with the Poisson family specified: - -```{python} -M_pois = sm.GLM(Y, X2, family=sm.families.Poisson()).fit() - -``` - -We can plot the coefficients associated with `mnth` and `hr`, in order to reproduce Figure 4.15. We first complete these coefficients as before. - -```{python} -S_pois = summarize(M_pois) -coef_month = S_pois[S_pois.index.str.contains('mnth')]['coef'] -coef_month = pd.concat([coef_month, - pd.Series([-coef_month.sum()], - index=['mnth[Dec]'])]) -coef_hr = S_pois[S_pois.index.str.contains('hr')]['coef'] -coef_hr = pd.concat([coef_hr, - pd.Series([-coef_hr.sum()], - index=['hr[23]'])]) -``` -The plotting is as before. - -```{python} -fig_pois, (ax_month, ax_hr) = subplots(1, 2, figsize=(16,8)) -ax_month.plot(x_month, coef_month, marker='o', ms=10) -ax_month.set_xticks(x_month) -ax_month.set_xticklabels([l[5] for l in coef_month.index], fontsize=20) -ax_month.set_xlabel('Month', fontsize=20) -ax_month.set_ylabel('Coefficient', fontsize=20) -ax_hr.plot(x_hr, coef_hr, marker='o', ms=10) -ax_hr.set_xticklabels(range(24)[::2], fontsize=20) -ax_hr.set_xlabel('Hour', fontsize=20) -ax_hr.set_ylabel('Coefficient', fontsize=20); - -``` -We compare the fitted values of the two models. -The fitted values are stored in the `fittedvalues` attribute -returned by the `fit()` method for both the linear regression and the Poisson -fits. The linear predictors are stored as the attribute `lin_pred`. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax.scatter(M2_lm.fittedvalues, - M_pois.fittedvalues, - s=20) -ax.set_xlabel('Linear Regression Fit', fontsize=20) -ax.set_ylabel('Poisson Regression Fit', fontsize=20) -ax.axline([0,0], c='black', linewidth=3, - linestyle='--', slope=1); - -``` - -The predictions from the Poisson regression model are correlated with -those from the linear model; however, the former are non-negative. As -a result the Poisson regression predictions tend to be larger than -those from the linear model for either very low or very high levels of -ridership. - -In this section, we fit Poisson regression models using the `sm.GLM()` function with the argument -`family=sm.families.Poisson()`. Earlier in this lab we used the `sm.GLM()` function -with `family=sm.families.Binomial()` to perform logistic regression. Other -choices for the `family` argument can be used to fit other types -of GLMs. For instance, `family=sm.families.Gamma()` fits a Gamma regression -model. - - diff --git a/Ch05-resample-lab.Rmd b/Ch05-resample-lab.Rmd deleted file mode 100644 index 0b4754a..0000000 --- a/Ch05-resample-lab.Rmd +++ /dev/null @@ -1,556 +0,0 @@ - - -# Cross-Validation and the Bootstrap - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch05-resample-lab.ipynb) - - -In this lab, we explore the resampling techniques covered in this -chapter. Some of the commands in this lab may take a while to run on -your computer. - -We again begin by placing most of our imports at this top level. - -```{python} -import numpy as np -import statsmodels.api as sm -from ISLP import load_data -from ISLP.models import (ModelSpec as MS, - summarize, - poly) -from sklearn.model_selection import train_test_split - -``` - - -There are several new imports needed for this lab. - -```{python} -from functools import partial -from sklearn.model_selection import \ - (cross_validate, - KFold, - ShuffleSplit) -from sklearn.base import clone -from ISLP.models import sklearn_sm - -``` - - -## The Validation Set Approach -We explore the use of the validation set approach in order to estimate -the test error rates that result from fitting various linear models on -the `Auto` data set. - -We use the function `train_test_split()` to split -the data into training and validation sets. As there are 392 observations, -we split into two equal sets of size 196 using the -argument `test_size=196`. It is generally a good idea to set a random seed -when performing operations like this that contain an -element of randomness, so that the results obtained can be reproduced -precisely at a later time. We set the random seed of the splitter -with the argument `random_state=0`. - -```{python} -Auto = load_data('Auto') -Auto_train, Auto_valid = train_test_split(Auto, - test_size=196, - random_state=0) - -``` - -Now we can fit a linear regression using only the observations corresponding to the training set `Auto_train`. - -```{python} -hp_mm = MS(['horsepower']) -X_train = hp_mm.fit_transform(Auto_train) -y_train = Auto_train['mpg'] -model = sm.OLS(y_train, X_train) -results = model.fit() - -``` - -We now use the `predict()` method of `results` evaluated on the model matrix for this model -created using the validation data set. We also calculate the validation MSE of our model. - -```{python} -X_valid = hp_mm.transform(Auto_valid) -y_valid = Auto_valid['mpg'] -valid_pred = results.predict(X_valid) -np.mean((y_valid - valid_pred)**2) - -``` - -Hence our estimate for the validation MSE of the linear regression -fit is $23.62$. - -We can also estimate the validation error for -higher-degree polynomial regressions. We first provide a function `evalMSE()` that takes a model string as well -as training and test sets and returns the MSE on the test set. - -```{python} -def evalMSE(terms, - response, - train, - test): - - mm = MS(terms) - X_train = mm.fit_transform(train) - y_train = train[response] - - X_test = mm.transform(test) - y_test = test[response] - - results = sm.OLS(y_train, X_train).fit() - test_pred = results.predict(X_test) - - return np.mean((y_test - test_pred)**2) - -``` - -Let’s use this function to estimate the validation MSE -using linear, quadratic and cubic fits. We use the `enumerate()` function -here, which gives both the values and indices of objects as one iterates -over a for loop. - -```{python} -MSE = np.zeros(3) -for idx, degree in enumerate(range(1, 4)): - MSE[idx] = evalMSE([poly('horsepower', degree)], - 'mpg', - Auto_train, - Auto_valid) -MSE - -``` - -These error rates are $23.62, 18.76$, and $18.80$, respectively. If we -choose a different training/validation split instead, then we -can expect somewhat different errors on the validation set. - -```{python} -Auto_train, Auto_valid = train_test_split(Auto, - test_size=196, - random_state=3) -MSE = np.zeros(3) -for idx, degree in enumerate(range(1, 4)): - MSE[idx] = evalMSE([poly('horsepower', degree)], - 'mpg', - Auto_train, - Auto_valid) -MSE -``` - -Using this split of the observations into a training set and a validation set, -we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $20.76$, $16.95$, and $16.97$, respectively. - -These results are consistent with our previous findings: a model that -predicts `mpg` using a quadratic function of `horsepower` -performs better than a model that involves only a linear function of -`horsepower`, and there is no evidence of an improvement in using a cubic function of `horsepower`. - - -## Cross-Validation -In theory, the cross-validation estimate can be computed for any generalized -linear model. {} -In practice, however, the simplest way to cross-validate in -Python is to use `sklearn`, which has a different interface or API -than `statsmodels`, the code we have been using to fit GLMs. - -This is a problem which often confronts data scientists: "I have a function to do task $A$, and need to feed it into something that performs task $B$, so that I can compute $B(A(D))$, where $D$ is my data." When $A$ and $B$ don’t naturally speak to each other, this -requires the use of a *wrapper*. -In the `ISLP` package, -we provide -a wrapper, `sklearn_sm()`, that enables us to easily use the cross-validation tools of `sklearn` with -models fit by `statsmodels`. - -The class `sklearn_sm()` -has as its first argument -a model from `statsmodels`. It can take two additional -optional arguments: `model_str` which can be -used to specify a formula, and `model_args` which should -be a dictionary of additional arguments used when fitting -the model. For example, to fit a logistic regression model -we have to specify a `family` argument. This -is passed as `model_args={'family':sm.families.Binomial()}`. - -Here is our wrapper in action: - -```{python} -hp_model = sklearn_sm(sm.OLS, - MS(['horsepower'])) -X, Y = Auto.drop(columns=['mpg']), Auto['mpg'] -cv_results = cross_validate(hp_model, - X, - Y, - cv=Auto.shape[0]) -cv_err = np.mean(cv_results['test_score']) -cv_err - -``` -The arguments to `cross_validate()` are as follows: an -object with the appropriate `fit()`, `predict()`, -and `score()` methods, an -array of features `X` and a response `Y`. -We also included an additional argument `cv` to `cross_validate()`; specifying an integer -$k$ results in $k$-fold cross-validation. We have provided a value -corresponding to the total number of observations, which results in -leave-one-out cross-validation (LOOCV). The `cross_validate()` function produces a dictionary with several components; -we simply want the cross-validated test score here (MSE), which is estimated to be 24.23. - - -We can repeat this procedure for increasingly complex polynomial fits. -To automate the process, we again -use a for loop which iteratively fits polynomial -regressions of degree 1 to 5, computes the -associated cross-validation error, and stores it in the $i$th element -of the vector `cv_error`. The variable `d` in the for loop -corresponds to the degree of the polynomial. We begin by initializing the -vector. This command may take a couple of seconds to run. - -```{python} -cv_error = np.zeros(5) -H = np.array(Auto['horsepower']) -M = sklearn_sm(sm.OLS) -for i, d in enumerate(range(1,6)): - X = np.power.outer(H, np.arange(d+1)) - M_CV = cross_validate(M, - X, - Y, - cv=Auto.shape[0]) - cv_error[i] = np.mean(M_CV['test_score']) -cv_error - -``` -As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and -quadratic fits, but then no clear improvement from using higher-degree polynomials. - -Above we introduced the `outer()` method of the `np.power()` -function. The `outer()` method is applied to an operation -that has two arguments, such as `add()`, `min()`, or -`power()`. -It has two arrays as -arguments, and then forms a larger -array where the operation is applied to each pair of elements of the -two arrays. - -```{python} -A = np.array([3, 5, 9]) -B = np.array([2, 4]) -np.add.outer(A, B) - -``` - -In the CV example above, we used $k=n$, but of course we can also use $k -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch06-varselect-lab.ipynb) - - -In this lab we implement many of the techniques discussed in this chapter. -We import some of our libraries at this top -level. - -```{python} -import numpy as np -import pandas as pd -from matplotlib.pyplot import subplots -from statsmodels.api import OLS -import sklearn.model_selection as skm -import sklearn.linear_model as skl -from sklearn.preprocessing import StandardScaler -from ISLP import load_data -from ISLP.models import ModelSpec as MS -from functools import partial - -``` - -We again collect the new imports -needed for this lab. Readers will have installed `l0bnb` when installing the requirements. - -```{python} -from sklearn.pipeline import Pipeline -from sklearn.decomposition import PCA -from sklearn.cross_decomposition import PLSRegression -from ISLP.models import \ - (Stepwise, - sklearn_selected, - sklearn_selection_path) -from l0bnb import fit_path - -``` - -Using `skl.ElasticNet` to fit ridge regression -throws up many warnings. We will suppress them below by a call to `warnings.simplefilter()`. - -```{python} -import warnings -warnings.simplefilter("ignore") -``` - -## Subset Selection Methods -Here we implement methods that reduce the number of parameters in a -model by restricting the model to a subset of the input variables. - -### Forward Selection - -We will apply the forward-selection approach to the `Hitters` -data. We wish to predict a baseball player’s `Salary` on the -basis of various statistics associated with performance in the -previous year. - -First of all, we note that the `Salary` variable is missing for -some of the players. The `np.isnan()` function can be used to -identify the missing observations. It returns an array -of the same shape as the input vector, with a `True` for any elements that -are missing, and a `False` for non-missing elements. The -`sum()` method can then be used to count all of the -missing elements. - -```{python} -Hitters = load_data('Hitters') -np.isnan(Hitters['Salary']).sum() - -``` - - We see that `Salary` is missing for 59 players. The -`dropna()` method of data frames removes all of the rows that have missing -values in any variable (by default --- see `Hitters.dropna?`). - -```{python} -Hitters = Hitters.dropna(); -Hitters.shape - -``` - - -We first choose the best model using forward selection based on $C_p$ (6.2). This score -is not built in as a metric to `sklearn`. We therefore define a function to compute it ourselves, and use -it as a scorer. By default, `sklearn` tries to maximize a score, hence - our scoring function computes the negative $C_p$ statistic. - -```{python} -def nCp(sigma2, estimator, X, Y): - "Negative Cp statistic" - n, p = X.shape - Yhat = estimator.predict(X) - RSS = np.sum((Y - Yhat)**2) - return -(RSS + 2 * p * sigma2) / n - -``` -We need to estimate the residual variance $\sigma^2$, which is the first argument in our scoring function above. -We will fit the biggest model, using all the variables, and estimate $\sigma^2$ based on its MSE. - -```{python} -design = MS(Hitters.columns.drop('Salary')).fit(Hitters) -Y = np.array(Hitters['Salary']) -X = design.transform(Hitters) -sigma2 = OLS(Y,X).fit().scale - -``` - -The function `sklearn_selected()` expects a scorer with just three arguments --- the last three in the definition of `nCp()` above. We use the function `partial()` first seen in Section 5.3.3 to freeze the first argument with our estimate of $\sigma^2$. - -```{python} -neg_Cp = partial(nCp, sigma2) - -``` -We can now use `neg_Cp()` as a scorer for model selection. - - - -Along with a score we need to specify the search strategy. This is done through the object -`Stepwise()` in the `ISLP.models` package. The method `Stepwise.first_peak()` -runs forward stepwise until any further additions to the model do not result -in an improvement in the evaluation score. Similarly, the method `Stepwise.fixed_steps()` -runs a fixed number of steps of stepwise search. - -```{python} -strategy = Stepwise.first_peak(design, - direction='forward', - max_terms=len(design.terms)) - -``` - -We now fit a linear regression model with `Salary` as outcome using forward -selection. To do so, we use the function `sklearn_selected()` from the `ISLP.models` package. This takes -a model from `statsmodels` along with a search strategy and selects a model with its -`fit` method. Without specifying a `scoring` argument, the score defaults to MSE, and so all 19 variables will be -selected. - -```{python} -hitters_MSE = sklearn_selected(OLS, - strategy) -hitters_MSE.fit(Hitters, Y) -hitters_MSE.selected_state_ - -``` - -Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected. - -```{python} -hitters_Cp = sklearn_selected(OLS, - strategy, - scoring=neg_Cp) -hitters_Cp.fit(Hitters, Y) -hitters_Cp.selected_state_ - -``` - -### Choosing Among Models Using the Validation Set Approach and Cross-Validation - -As an alternative to using $C_p$, we might try cross-validation to select a model in forward selection. For this, we need a -method that stores the full path of models found in forward selection, and allows predictions for each of these. This can be done with the `sklearn_selection_path()` -estimator from `ISLP.models`. The function `cross_val_predict()` from `ISLP.models` -computes the cross-validated predictions for each of the models -along the path, which we can use to evaluate the cross-validated MSE -along the path. - - -Here we define a strategy that fits the full forward selection path. -While there are various parameter choices for `sklearn_selection_path()`, -we use the defaults here, which selects the model at each step based on the biggest reduction in RSS. - -```{python} -strategy = Stepwise.fixed_steps(design, - len(design.terms), - direction='forward') -full_path = sklearn_selection_path(OLS, strategy) - -``` - -We now fit the full forward-selection path on the `Hitters` data and compute the fitted values. - -```{python} -full_path.fit(Hitters, Y) -Yhat_in = full_path.predict(Hitters) -Yhat_in.shape - -``` - - -This gives us an array of fitted values --- 20 steps in all, including the fitted mean for the null model --- which we can use to evaluate -in-sample MSE. As expected, the in-sample MSE improves each step we take, -indicating we must use either the validation or cross-validation -approach to select the number of steps. We fix the y-axis to range from -50,000 to 250,000 to compare to the cross-validation and validation -set MSE below, as well as other methods such as ridge regression, lasso and -principal components regression. - -```{python} -mse_fig, ax = subplots(figsize=(8,8)) -insample_mse = ((Yhat_in - Y[:,None])**2).mean(0) -n_steps = insample_mse.shape[0] -ax.plot(np.arange(n_steps), - insample_mse, - 'k', # color black - label='In-sample') -ax.set_ylabel('MSE', - fontsize=20) -ax.set_xlabel('# steps of forward stepwise', - fontsize=20) -ax.set_xticks(np.arange(n_steps)[::2]) -ax.legend() -ax.set_ylim([50000,250000]); - -``` -Notice the expression `None` in `Y[:,None]` above. -This adds an axis (dimension) to the one-dimensional array `Y`, -which allows it to be recycled when subtracted from the two-dimensional `Yhat_in`. - - -We are now ready to use cross-validation to estimate test error along -the model path. We must use *only the training observations* to perform all aspects of model-fitting --- including -variable selection. Therefore, the determination of which model of a -given size is best must be made using \emph{only the training - observations} in each training fold. This point is subtle but important. If the full data -set is used to select the best subset at each step, then the validation -set errors and cross-validation errors that we obtain will not be -accurate estimates of the test error. - -We now compute the cross-validated predicted values using 5-fold cross-validation. - -```{python} -K = 5 -kfold = skm.KFold(K, - random_state=0, - shuffle=True) -Yhat_cv = skm.cross_val_predict(full_path, - Hitters, - Y, - cv=kfold) -Yhat_cv.shape - -``` -`skm.cross_val_predict()` -The prediction matrix `Yhat_cv` is the same shape as `Yhat_in`; the difference is that the predictions in each row, corresponding to a particular sample index, were made from models fit on a training fold that did not include that row. - -At each model along the path, we compute the MSE in each of the cross-validation folds. -These we will average to get the mean MSE, and can also use the individual values to compute a crude estimate of the standard error of the mean. {The estimate is crude because the five error estimates are based on overlapping training sets, and hence are not independent.} -Hence we must know the test indices for each cross-validation -split. This can be found by using the `split()` method of `kfold`. Because -we fixed the random state above, whenever we split any array with the same -number of rows as $Y$ we recover the same training and test indices, though we simply -ignore the training indices below. - -```{python} -cv_mse = [] -for train_idx, test_idx in kfold.split(Y): - errors = (Yhat_cv[test_idx] - Y[test_idx,None])**2 - cv_mse.append(errors.mean(0)) # column means -cv_mse = np.array(cv_mse).T -cv_mse.shape - -``` - -We now add the cross-validation error estimates to our MSE plot. -We include the mean error across the five folds, and the estimate of the standard error of the mean. - -```{python} -ax.errorbar(np.arange(n_steps), - cv_mse.mean(1), - cv_mse.std(1) / np.sqrt(K), - label='Cross-validated', - c='r') # color red -ax.set_ylim([50000,250000]) -ax.legend() -mse_fig - -``` - -To repeat the above using the validation set approach, we simply change our -`cv` argument to a validation set: one random split of the data into a test and training. We choose a test size -of 20%, similar to the size of each test set in 5-fold cross-validation.`skm.ShuffleSplit()` - -```{python} -validation = skm.ShuffleSplit(n_splits=1, - test_size=0.2, - random_state=0) -for train_idx, test_idx in validation.split(Y): - full_path.fit(Hitters.iloc[train_idx], - Y[train_idx]) - Yhat_val = full_path.predict(Hitters.iloc[test_idx]) - errors = (Yhat_val - Y[test_idx,None])**2 - validation_mse = errors.mean(0) - -``` - As for the in-sample MSE case, the validation set approach does not provide standard errors. - -```{python} -ax.plot(np.arange(n_steps), - validation_mse, - 'b--', # color blue, broken line - label='Validation') -ax.set_xticks(np.arange(n_steps)[::2]) -ax.set_ylim([50000,250000]) -ax.legend() -mse_fig - -``` - - -### Best Subset Selection -Forward stepwise is a *greedy* selection procedure; at each step it augments the current set by including one additional variable. We now apply best subset selection to the `Hitters` -data, which for every subset size, searches for the best set of predictors. - -We will use a package called `l0bnb` to perform -best subset selection. -Instead of constraining the subset to be a given size, -this package produces a path of solutions using the subset size as a -penalty rather than a constraint. Although the distinction is subtle, the difference comes when we cross-validate. - - -```{python} -D = design.fit_transform(Hitters) -D = D.drop('intercept', axis=1) -X = np.asarray(D) - -``` -Here we excluded the first column corresponding to the intercept, as -`l0bnb` will fit the intercept separately. We can find a path using the `fit_path()` function. - -```{python} -path = fit_path(X, - Y, - max_nonzeros=X.shape[1]) - -``` - -The function `fit_path()` returns a list whose values include the fitted coefficients as `B`, an intercept as `B0`, as well as a few other attributes related to the particular path algorithm used. Such details are beyond the scope of this book. - -```{python} -path[3] - -``` -In the example above, we see that at the fourth step in the path, we have two nonzero coefficients in `'B'`, corresponding to the value $0.0114$ for the penalty parameter `lambda_0`. -We could make predictions using this sequence of fits on a validation set as a function of `lambda_0`, or with more work using cross-validation. - -## Ridge Regression and the Lasso -We will use the `sklearn.linear_model` package (for which -we use `skl` as shorthand below) -to fit ridge and lasso regularized linear models on the `Hitters` data. -We start with the model matrix `X` (without an intercept) that we computed in the previous section on best subset regression. - -### Ridge Regression -We will use the function `skl.ElasticNet()` to fit both ridge and the lasso. -To fit a *path* of ridge regressions models, we use -`skl.ElasticNet.path()`, which can fit both ridge and lasso, as well as a hybrid mixture; ridge regression -corresponds to `l1_ratio=0`. -It is good practice to standardize the columns of `X` in these applications, if the variables are measured in different units. Since `skl.ElasticNet()` does no normalization, we have to take care of that ourselves. -Since we -standardize first, in order to find coefficient -estimates on the original scale, we must *unstandardize* -the coefficient estimates. The parameter -$\lambda$ in (6.5) and (6.7) is called `alphas` in `sklearn`. In order to -be consistent with the rest of this chapter, we use `lambdas` -rather than `alphas` in what follows. {At the time of publication, ridge fits like the one in code chunk [23] issue unwarranted convergence warning messages; we suppressed these when we filtered the warnings above.} - -```{python} -Xs = X - X.mean(0)[None,:] -X_scale = X.std(0) -Xs = Xs / X_scale[None,:] -lambdas = 10**np.linspace(8, -2, 100) / Y.std() -soln_array = skl.ElasticNet.path(Xs, - Y, - l1_ratio=0., - alphas=lambdas)[1] -soln_array.shape - -``` -Here we extract the array of coefficients corresponding to the solutions along the regularization path. -By default the `skl.ElasticNet.path` method fits a path along -an automatically selected range of $\lambda$ values, except for the case when -`l1_ratio=0`, which results in ridge regression (as is the case here). {The reason is rather technical; for all models except ridge, we can find the smallest value of $\lambda$ for which all coefficients are zero. For ridge this value is $\infty$.} So here -we have chosen to implement the function over a grid of values ranging -from $\lambda=10^{8}$ to $\lambda=10^{-2}$ scaled by the standard -deviation of $y$, essentially covering the full range of scenarios -from the null model containing only the intercept, to the least -squares fit. - -Associated with each value of $\lambda$ is a vector of ridge -regression coefficients, that can be accessed by -a column of `soln_array`. In this case, `soln_array` is a $19 \times 100$ matrix, with -19 rows (one for each predictor) and 100 -columns (one for each value of $\lambda$). - -We transpose this matrix and turn it into a data frame to facilitate viewing and plotting. - -```{python} -soln_path = pd.DataFrame(soln_array.T, - columns=D.columns, - index=-np.log(lambdas)) -soln_path.index.name = 'negative log(lambda)' -soln_path - -``` - -We plot the paths to get a sense of how the coefficients vary with $\lambda$. -To control the location of the legend we first set `legend` to `False` in the -plot method, adding it afterward with the `legend()` method of `ax`. - -```{python} -path_fig, ax = subplots(figsize=(8,8)) -soln_path.plot(ax=ax, legend=False) -ax.set_xlabel('$-\log(\lambda)$', fontsize=20) -ax.set_ylabel('Standardized coefficients', fontsize=20) -ax.legend(loc='upper left'); - -``` -(We have used `latex` formatting in the horizontal label, in order to format the Greek $\lambda$ appropriately.) -We expect the coefficient estimates to be much smaller, in terms of -$\ell_2$ norm, when a large value of $\lambda$ is used, as compared to -when a small value of $\lambda$ is used. (Recall that the $\ell_2$ norm is the square root of the sum of squared coefficient values.) We display the coefficients at the $40$th step, -where $\lambda$ is 25.535. - -```{python} -beta_hat = soln_path.loc[soln_path.index[39]] -lambdas[39], beta_hat - -``` - -Let’s compute the $\ell_2$ norm of the standardized coefficients. - -```{python} -np.linalg.norm(beta_hat) - -``` - -In contrast, here is the $\ell_2$ norm when $\lambda$ is 2.44e-01. -Note the much larger $\ell_2$ norm of the -coefficients associated with this smaller value of $\lambda$. - -```{python} -beta_hat = soln_path.loc[soln_path.index[59]] -lambdas[59], np.linalg.norm(beta_hat) - -``` -Above we normalized `X` upfront, and fit the ridge model using `Xs`. -The `Pipeline()` object -in `sklearn` provides a clear way to separate feature -normalization from the fitting of the ridge model itself. - -```{python} -ridge = skl.ElasticNet(alpha=lambdas[59], l1_ratio=0) -scaler = StandardScaler(with_mean=True, with_std=True) -pipe = Pipeline(steps=[('scaler', scaler), ('ridge', ridge)]) -pipe.fit(X, Y) - -``` - -We show that it gives the same $\ell_2$ norm as in our previous fit on the standardized data. - -```{python} -np.linalg.norm(ridge.coef_) - -``` - Notice that the operation `pipe.fit(X, Y)` above has changed the `ridge` object, and in particular has added attributes such as `coef_` that were not there before. -### Estimating Test Error of Ridge Regression -Choosing an *a priori* value of $\lambda$ for ridge regression is -difficult if not impossible. We will want to use the validation method -or cross-validation to select the tuning parameter. The reader may not -be surprised that the `Pipeline()` approach can be used in -`skm.cross_validate()` with either a validation method -(i.e. `validation`) or $k$-fold cross-validation. - -We fix the random state of the splitter -so that the results obtained will be reproducible. - -```{python} -validation = skm.ShuffleSplit(n_splits=1, - test_size=0.5, - random_state=0) -ridge.alpha = 0.01 -results = skm.cross_validate(ridge, - X, - Y, - scoring='neg_mean_squared_error', - cv=validation) --results['test_score'] - -``` - -The test MSE is 1.342e+05. Note -that if we had instead simply fit a model with just an intercept, we -would have predicted each test observation using the mean of the -training observations. We can get the same result by fitting a ridge regression model -with a *very* large value of $\lambda$. Note that `1e10` -means $10^{10}$. - -```{python} -ridge.alpha = 1e10 -results = skm.cross_validate(ridge, - X, - Y, - scoring='neg_mean_squared_error', - cv=validation) --results['test_score'] - -``` -Obviously choosing $\lambda=0.01$ is arbitrary, so we will use cross-validation or the validation-set -approach to choose the tuning parameter $\lambda$. -The object `GridSearchCV()` allows exhaustive -grid search to choose such a parameter. - -We first use the validation set method -to choose $\lambda$. - -```{python} -param_grid = {'ridge__alpha': lambdas} -grid = skm.GridSearchCV(pipe, - param_grid, - cv=validation, - scoring='neg_mean_squared_error') -grid.fit(X, Y) -grid.best_params_['ridge__alpha'] -grid.best_estimator_ - -``` - -Alternatively, we can use 5-fold cross-validation. - -```{python} -grid = skm.GridSearchCV(pipe, - param_grid, - cv=kfold, - scoring='neg_mean_squared_error') -grid.fit(X, Y) -grid.best_params_['ridge__alpha'] -grid.best_estimator_ - -``` -Recall we set up the `kfold` object for 5-fold cross-validation on page 271. We now plot the cross-validated MSE as a function of $-\log(\lambda)$, which has shrinkage decreasing from left -to right. - -```{python} -ridge_fig, ax = subplots(figsize=(8,8)) -ax.errorbar(-np.log(lambdas), - -grid.cv_results_['mean_test_score'], - yerr=grid.cv_results_['std_test_score'] / np.sqrt(K)) -ax.set_ylim([50000,250000]) -ax.set_xlabel('$-\log(\lambda)$', fontsize=20) -ax.set_ylabel('Cross-validated MSE', fontsize=20); - -``` - -One can cross-validate different metrics to choose a parameter. The default -metric for `skl.ElasticNet()` is test $R^2$. -Let’s compare $R^2$ to MSE for cross-validation here. - -```{python} -grid_r2 = skm.GridSearchCV(pipe, - param_grid, - cv=kfold) -grid_r2.fit(X, Y) - -``` - -Finally, let’s plot the results for cross-validated $R^2$. - -```{python} -r2_fig, ax = subplots(figsize=(8,8)) -ax.errorbar(-np.log(lambdas), - grid_r2.cv_results_['mean_test_score'], - yerr=grid_r2.cv_results_['std_test_score'] / np.sqrt(K)) -ax.set_xlabel('$-\log(\lambda)$', fontsize=20) -ax.set_ylabel('Cross-validated $R^2$', fontsize=20); - -``` - - -### Fast Cross-Validation for Solution Paths -The ridge, lasso, and elastic net can be efficiently fit along a sequence of $\lambda$ values, creating what is known as a *solution path* or *regularization path*. Hence there is specialized code to fit -such paths, and to choose a suitable value of $\lambda$ using cross-validation. Even with -identical splits the results will not agree *exactly* with our `grid` -above because the standardization of each feature in `grid` is carried out on each fold, -while in `pipeCV` below it is carried out only once. -Nevertheless, the results are similar as the normalization -is relatively stable across folds. - -```{python} -ridgeCV = skl.ElasticNetCV(alphas=lambdas, - l1_ratio=0, - cv=kfold) -pipeCV = Pipeline(steps=[('scaler', scaler), - ('ridge', ridgeCV)]) -pipeCV.fit(X, Y) - -``` - -Let’s produce a plot again of the cross-validation error to see that -it is similar to using `skm.GridSearchCV`. - -```{python} -tuned_ridge = pipeCV.named_steps['ridge'] -ridgeCV_fig, ax = subplots(figsize=(8,8)) -ax.errorbar(-np.log(lambdas), - tuned_ridge.mse_path_.mean(1), - yerr=tuned_ridge.mse_path_.std(1) / np.sqrt(K)) -ax.axvline(-np.log(tuned_ridge.alpha_), c='k', ls='--') -ax.set_ylim([50000,250000]) -ax.set_xlabel('$-\log(\lambda)$', fontsize=20) -ax.set_ylabel('Cross-validated MSE', fontsize=20); - -``` - -We see that the value of $\lambda$ that results in the -smallest cross-validation error is 1.19e-02, available -as the value `tuned_ridge.alpha_`. What is the test MSE -associated with this value of $\lambda$? - -```{python} -np.min(tuned_ridge.mse_path_.mean(1)) - -``` - -This represents a further improvement over the test MSE that we got -using $\lambda=4$. Finally, `tuned_ridge.coef_` -has the coefficients fit on the entire data set -at this value of $\lambda$. - -```{python} -tuned_ridge.coef_ - -``` -As expected, none of the coefficients are zero—ridge regression does -not perform variable selection! - - -### Evaluating Test Error of Cross-Validated Ridge -Choosing $\lambda$ using cross-validation provides a single regression -estimator, similar to fitting a linear regression model as we saw in -Chapter 3. It is therefore reasonable to estimate what its test error -is. We run into a problem here in that cross-validation will have -*touched* all of its data in choosing $\lambda$, hence we have no -further data to estimate test error. A compromise is to do an initial -split of the data into two disjoint sets: a training set and a test set. -We then fit a cross-validation -tuned ridge regression on the training set, and evaluate its performance on the test set. -We might call this cross-validation nested -within the validation set approach. A priori there is no reason to use -half of the data for each of the two sets in validation. Below, we use -75% for training and 25% for test, with the estimator being ridge -regression tuned using 5-fold cross-validation. This can be achieved -in code as follows: - -```{python} -outer_valid = skm.ShuffleSplit(n_splits=1, - test_size=0.25, - random_state=1) -inner_cv = skm.KFold(n_splits=5, - shuffle=True, - random_state=2) -ridgeCV = skl.ElasticNetCV(alphas=lambdas, - l1_ratio=0, - cv=inner_cv) -pipeCV = Pipeline(steps=[('scaler', scaler), - ('ridge', ridgeCV)]); - -``` - -```{python} -results = skm.cross_validate(pipeCV, - X, - Y, - cv=outer_valid, - scoring='neg_mean_squared_error') --results['test_score'] - -``` - - - -### The Lasso -We saw that ridge regression with a wise choice of $\lambda$ can -outperform least squares as well as the null model on the - `Hitters` data set. We now ask whether the lasso can yield -either a more accurate or a more interpretable model than ridge -regression. In order to fit a lasso model, we once again use the -`ElasticNetCV()` function; however, this time we use the argument -`l1_ratio=1`. Other than that change, we proceed just as we did in -fitting a ridge model. - -```{python} -lassoCV = skl.ElasticNetCV(n_alphas=100, - l1_ratio=1, - cv=kfold) -pipeCV = Pipeline(steps=[('scaler', scaler), - ('lasso', lassoCV)]) -pipeCV.fit(X, Y) -tuned_lasso = pipeCV.named_steps['lasso'] -tuned_lasso.alpha_ - -``` - - -```{python} -lambdas, soln_array = skl.Lasso.path(Xs, - Y, - l1_ratio=1, - n_alphas=100)[:2] -soln_path = pd.DataFrame(soln_array.T, - columns=D.columns, - index=-np.log(lambdas)) - -``` -We can see from the coefficient plot of the standardized coefficients that depending on the choice of -tuning parameter, some of the coefficients will be exactly equal to -zero. - -```{python} -path_fig, ax = subplots(figsize=(8,8)) -soln_path.plot(ax=ax, legend=False) -ax.legend(loc='upper left') -ax.set_xlabel('$-\log(\lambda)$', fontsize=20) -ax.set_ylabel('Standardized coefficiients', fontsize=20); - -``` -The smallest cross-validated error is lower than the test set MSE of the null model -and of least squares, and very similar to the test MSE of 115526.71 of ridge -regression (page 278) with $\lambda$ chosen by cross-validation. - -```{python} -np.min(tuned_lasso.mse_path_.mean(1)) - -``` - -Let’s again produce a plot of the cross-validation error. - - -```{python} -lassoCV_fig, ax = subplots(figsize=(8,8)) -ax.errorbar(-np.log(tuned_lasso.alphas_), - tuned_lasso.mse_path_.mean(1), - yerr=tuned_lasso.mse_path_.std(1) / np.sqrt(K)) -ax.axvline(-np.log(tuned_lasso.alpha_), c='k', ls='--') -ax.set_ylim([50000,250000]) -ax.set_xlabel('$-\log(\lambda)$', fontsize=20) -ax.set_ylabel('Cross-validated MSE', fontsize=20); -``` - -However, the lasso has a substantial advantage over ridge regression -in that the resulting coefficient estimates are sparse. Here we see -that 6 of the 19 coefficient estimates are exactly zero. So the lasso -model with $\lambda$ chosen by cross-validation contains only 13 -variables. - -```{python} -tuned_lasso.coef_ - -``` - -As in ridge regression, we could evaluate the test error -of cross-validated lasso by first splitting into -test and training sets and internally running -cross-validation on the training set. We leave -this as an exercise. - - -## PCR and PLS Regression - -### Principal Components Regression - - -Principal components regression (PCR) can be performed using -`PCA()` from the `sklearn.decomposition` -module. We now apply PCR to the `Hitters` data, in order to -predict `Salary`. Again, ensure that the missing values have -been removed from the data, as described in Section 6.5.1. - -We use `LinearRegression()` to fit the regression model -here. Note that it fits an intercept by default, unlike -the `OLS()` function seen earlier in Section 6.5.1. - -```{python} -pca = PCA(n_components=2) -linreg = skl.LinearRegression() -pipe = Pipeline([('pca', pca), - ('linreg', linreg)]) -pipe.fit(X, Y) -pipe.named_steps['linreg'].coef_ - -``` - -When performing PCA, the results vary depending -on whether the data has been *standardized* or not. -As in the earlier examples, this can be accomplished -by including an additional step in the pipeline. - -```{python} -pipe = Pipeline([('scaler', scaler), - ('pca', pca), - ('linreg', linreg)]) -pipe.fit(X, Y) -pipe.named_steps['linreg'].coef_ - -``` - -We can of course use CV to choose the number of components, by -using `skm.GridSearchCV`, in this -case fixing the parameters to vary the -`n_components`. - -```{python} -param_grid = {'pca__n_components': range(1, 20)} -grid = skm.GridSearchCV(pipe, - param_grid, - cv=kfold, - scoring='neg_mean_squared_error') -grid.fit(X, Y) - -``` - -Let’s plot the results as we have for other methods. - -```{python} -pcr_fig, ax = subplots(figsize=(8,8)) -n_comp = param_grid['pca__n_components'] -ax.errorbar(n_comp, - -grid.cv_results_['mean_test_score'], - grid.cv_results_['std_test_score'] / np.sqrt(K)) -ax.set_ylabel('Cross-validated MSE', fontsize=20) -ax.set_xlabel('# principal components', fontsize=20) -ax.set_xticks(n_comp[::2]) -ax.set_ylim([50000,250000]); - -``` - -We see that the smallest cross-validation error occurs when -17 -components are used. However, from the plot we also see that the -cross-validation error is roughly the same when only one component is -included in the model. This suggests that a model that uses just a -small number of components might suffice. - -The CV score is provided for each possible number of components from -1 to 19 inclusive. The `PCA()` method complains -if we try to fit an intercept only with `n_components=0` -so we also compute the MSE for just the null model with -these splits. - -```{python} -Xn = np.zeros((X.shape[0], 1)) -cv_null = skm.cross_validate(linreg, - Xn, - Y, - cv=kfold, - scoring='neg_mean_squared_error') --cv_null['test_score'].mean() - -``` - - -The `explained_variance_ratio_` -attribute of our `PCA` object provides the *percentage of variance explained* in the predictors and in the response using -different numbers of components. This concept is discussed in greater -detail in Section 12.2. - -```{python} -pipe.named_steps['pca'].explained_variance_ratio_ - -``` - -Briefly, we can think of -this as the amount of information about the predictors -that is captured using $M$ principal components. For example, setting -$M=1$ only captures 38.31% of the variance, while $M=2$ captures an additional 21.84%, for a total of 60.15% of the variance. -By $M=6$ it increases to -88.63%. Beyond this the increments continue to diminish, until we use all $M=p=19$ components, which captures all 100% of the variance. - - - - -### Partial Least Squares -Partial least squares (PLS) is implemented in the -`PLSRegression()` function. - - - -```{python} -pls = PLSRegression(n_components=2, - scale=True) -pls.fit(X, Y) - -``` - -As was the case in PCR, we will want to -use CV to choose the number of components. - -```{python} -param_grid = {'n_components':range(1, 20)} -grid = skm.GridSearchCV(pls, - param_grid, - cv=kfold, - scoring='neg_mean_squared_error') -grid.fit(X, Y) - -``` - -As for our other methods, we plot the MSE. - -```{python} -pls_fig, ax = subplots(figsize=(8,8)) -n_comp = param_grid['n_components'] -ax.errorbar(n_comp, - -grid.cv_results_['mean_test_score'], - grid.cv_results_['std_test_score'] / np.sqrt(K)) -ax.set_ylabel('Cross-validated MSE', fontsize=20) -ax.set_xlabel('# principal components', fontsize=20) -ax.set_xticks(n_comp[::2]) -ax.set_ylim([50000,250000]); - -``` - -CV error is minimized at 12, -though there is little noticeable difference between this point and a much lower number like 2 or 3 components. - - diff --git a/Ch07-nonlin-lab.Rmd b/Ch07-nonlin-lab.Rmd deleted file mode 100644 index 1af6882..0000000 --- a/Ch07-nonlin-lab.Rmd +++ /dev/null @@ -1,900 +0,0 @@ - - -# Non-Linear Modeling - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch07-nonlin-lab.ipynb) - - -In this lab, we demonstrate some of the nonlinear models discussed in -this chapter. We use the `Wage` data as a running example, and show that many of the complex non-linear fitting procedures discussed can easily be implemented in `Python`. - -As usual, we start with some of our standard imports. - -```{python} -import numpy as np, pandas as pd -from matplotlib.pyplot import subplots -import statsmodels.api as sm -from ISLP import load_data -from ISLP.models import (summarize, - poly, - ModelSpec as MS) -from statsmodels.stats.anova import anova_lm -``` - -We again collect the new imports -needed for this lab. Many of these are developed specifically for the -`ISLP` package. - -```{python} -from pygam import (s as s_gam, - l as l_gam, - f as f_gam, - LinearGAM, - LogisticGAM) - -from ISLP.transforms import (BSpline, - NaturalSpline) -from ISLP.models import bs, ns -from ISLP.pygam import (approx_lam, - degrees_of_freedom, - plot as plot_gam, - anova as anova_gam) - -``` - -## Polynomial Regression and Step Functions -We start by demonstrating how Figure 7.1 can be reproduced. -Let's begin by loading the data. - -```{python} -Wage = load_data('Wage') -y = Wage['wage'] -age = Wage['age'] - -``` - -Throughout most of this lab, our response is `Wage['wage']`, which -we have stored as `y` above. -As in Section 3.6.6, we will use the `poly()` function to create a model matrix -that will fit a $4$th degree polynomial in `age`. - -```{python} -poly_age = MS([poly('age', degree=4)]).fit(Wage) -M = sm.OLS(y, poly_age.transform(Wage)).fit() -summarize(M) - -``` - - -This polynomial is constructed using the function `poly()`, -which creates -a special *transformer* `Poly()` (using `sklearn` terminology -for feature transformations such as `PCA()` seen in Section 6.5.3) which -allows for easy evaluation of the polynomial at new data points. Here `poly()` is referred to as a *helper* function, and sets up the transformation; `Poly()` is the actual workhorse that computes the transformation. See also -the -discussion of transformations on -page 118. - -In the code above, the first line executes the `fit()` method -using the dataframe -`Wage`. This recomputes and stores as attributes any parameters needed by `Poly()` -on the training data, and these will be used on all subsequent -evaluations of the `transform()` method. For example, it is used -on the second line, as well as in the plotting function developed below. - - - - - - -We now create a grid of values for `age` at which we want -predictions. - -```{python} -age_grid = np.linspace(age.min(), - age.max(), - 100) -age_df = pd.DataFrame({'age': age_grid}) - -``` - -Finally, we wish to plot the data and add the fit from the fourth-degree polynomial. As we will make -several similar plots below, we first write a function -to create all the ingredients and produce the plot. Our function -takes in a model specification (here a basis specified by a -transform), as well as a grid of `age` values. The function -produces a fitted curve as well as 95% confidence bands. By using -an argument for `basis` we can produce and plot the results with several different -transforms, such as the splines we will see shortly. - -```{python} -def plot_wage_fit(age_df, - basis, - title): - - X = basis.transform(Wage) - Xnew = basis.transform(age_df) - M = sm.OLS(y, X).fit() - preds = M.get_prediction(Xnew) - bands = preds.conf_int(alpha=0.05) - fig, ax = subplots(figsize=(8,8)) - ax.scatter(age, - y, - facecolor='gray', - alpha=0.5) - for val, ls in zip([preds.predicted_mean, - bands[:,0], - bands[:,1]], - ['b','r--','r--']): - ax.plot(age_df.values, val, ls, linewidth=3) - ax.set_title(title, fontsize=20) - ax.set_xlabel('Age', fontsize=20) - ax.set_ylabel('Wage', fontsize=20); - return ax - -``` -We include an argument `alpha` to `ax.scatter()` -to add some transparency to the points. This provides a visual indication -of density. Notice the use of the `zip()` function in the -`for` loop above (see Section 2.3.8). -We have three lines to plot, each with different colors and line -types. Here `zip()` conveniently bundles these together as -iterators in the loop. {In `Python`{} speak, an "iterator" is an object with a finite number of values, that can be iterated on, as in a loop.} - -We now plot the fit of the fourth-degree polynomial using this -function. - -```{python} -plot_wage_fit(age_df, - poly_age, - 'Degree-4 Polynomial'); - -``` - - - - - -With polynomial regression we must decide on the degree of -the polynomial to use. Sometimes we just wing it, and decide to use -second or third degree polynomials, simply to obtain a nonlinear fit. But we can -make such a decision in a more systematic way. One way to do this is through hypothesis -tests, which we demonstrate here. We now fit a series of models ranging from -linear (degree-one) to degree-five polynomials, -and look to determine the simplest model that is sufficient to -explain the relationship between `wage` and `age`. We use the -`anova_lm()` function, which performs a series of ANOVA -tests. -An \emph{analysis of - variance} or ANOVA tests the null -hypothesis that a model $\mathcal{M}_1$ is sufficient to explain the -data against the alternative hypothesis that a more complex model -$\mathcal{M}_2$ is required. The determination is based on an F-test. -To perform the test, the models $\mathcal{M}_1$ and $\mathcal{M}_2$ must be *nested*: -the space spanned by the predictors in $\mathcal{M}_1$ must be a subspace of the -space spanned by the predictors in $\mathcal{M}_2$. In this case, we -fit five different polynomial -models and sequentially compare the simpler model to the more complex -model. - -```{python} -models = [MS([poly('age', degree=d)]) - for d in range(1, 6)] -Xs = [model.fit_transform(Wage) for model in models] -anova_lm(*[sm.OLS(y, X_).fit() - for X_ in Xs]) - -``` - -Notice the `*` in the `anova_lm()` line above. This -function takes a variable number of non-keyword arguments, in this case fitted models. -When these models are provided as a list (as is done here), it must be -prefixed by `*`. - -The p-value comparing the linear `models[0]` to the quadratic -`models[1]` is essentially zero, indicating that a linear -fit is not sufficient. {Indexing starting at zero is confusing for the polynomial degree example, since `models[1]` is quadratic rather than linear!} Similarly the p-value comparing the quadratic -`models[1]` to the cubic `models[2]` is very low (0.0017), so the -quadratic fit is also insufficient. The p-value comparing the cubic -and degree-four polynomials, `models[2]` and `models[3]`, is -approximately 5%, while the degree-five polynomial `models[4]` seems -unnecessary because its p-value is 0.37. Hence, either a cubic or a -quartic polynomial appear to provide a reasonable fit to the data, but -lower- or higher-order models are not justified. - -In this case, instead of using the `anova()` function, we could -have obtained these p-values more succinctly by exploiting the fact -that `poly()` creates orthogonal polynomials. - -```{python} -summarize(M) - -``` - - -Notice that the p-values are the same, and in fact the square of -the t-statistics are equal to the F-statistics from the -`anova_lm()` function; for example: - -```{python} -(-11.983)**2 - -``` - - -However, the ANOVA method works whether or not we used orthogonal -polynomials, provided the models are nested. For example, we can use -`anova_lm()` to compare the following three -models, which all have a linear term in `education` and a -polynomial in `age` of different degrees: - -```{python} -models = [MS(['education', poly('age', degree=d)]) - for d in range(1, 4)] -XEs = [model.fit_transform(Wage) - for model in models] -anova_lm(*[sm.OLS(y, X_).fit() for X_ in XEs]) - -``` - - -As an alternative to using hypothesis tests and ANOVA, we could choose -the polynomial degree using cross-validation, as discussed in Chapter 5. - -Next we consider the task of predicting whether an individual earns -more than $250,000 per year. We proceed much as before, except -that first we create the appropriate response vector, and then apply -the `glm()` function using the binomial family in order -to fit a polynomial logistic regression model. - -```{python} -X = poly_age.transform(Wage) -high_earn = Wage['high_earn'] = y > 250 # shorthand -glm = sm.GLM(y > 250, - X, - family=sm.families.Binomial()) -B = glm.fit() -summarize(B) - -``` - - -Once again, we make predictions using the `get_prediction()` method. - -```{python} -newX = poly_age.transform(age_df) -preds = B.get_prediction(newX) -bands = preds.conf_int(alpha=0.05) - -``` - -We now plot the estimated relationship. - -```{python} -fig, ax = subplots(figsize=(8,8)) -rng = np.random.default_rng(0) -ax.scatter(age + - 0.2 * rng.uniform(size=y.shape[0]), - np.where(high_earn, 0.198, 0.002), - fc='gray', - marker='|') -for val, ls in zip([preds.predicted_mean, - bands[:,0], - bands[:,1]], - ['b','r--','r--']): - ax.plot(age_df.values, val, ls, linewidth=3) -ax.set_title('Degree-4 Polynomial', fontsize=20) -ax.set_xlabel('Age', fontsize=20) -ax.set_ylim([0,0.2]) -ax.set_ylabel('P(Wage > 250)', fontsize=20); - -``` -We have drawn the `age` values corresponding to the observations with -`wage` values above 250 as gray marks on the top of the plot, and -those with `wage` values below 250 are shown as gray marks on the -bottom of the plot. We added a small amount of noise to jitter -the `age` values a bit so that observations with the same `age` -value do not cover each other up. This type of plot is often called a -*rug plot*. - -In order to fit a step function, as discussed in -Section 7.2, we first use the `pd.qcut()` -function to discretize `age` based on quantiles. Then we use `pd.get_dummies()` to create the -columns of the model matrix for this categorical variable. Note that this function will -include *all* columns for a given categorical, rather than the usual approach which drops one -of the levels. - -```{python} -cut_age = pd.qcut(age, 4) -summarize(sm.OLS(y, pd.get_dummies(cut_age)).fit()) - -``` - - -Here `pd.qcut()` automatically picked the cutpoints based on the quantiles 25%, 50% and 75%, which results in four regions. We could also have specified our own -quantiles directly instead of the argument `4`. For cuts not based -on quantiles we would use the `pd.cut()` function. -The function `pd.qcut()` (and `pd.cut()`) returns an ordered categorical variable. - The regression model then creates a set of -dummy variables for use in the regression. Since `age` is the only variable in the model, the value $94,158.40 is the average salary for those under 33.75 years of -age, and the other coefficients are the average -salary for those in the other age groups. We can produce -predictions and plots just as we did in the case of the polynomial -fit. - - -## Splines -In order to fit regression splines, we use transforms -from the `ISLP` package. The actual spline -evaluation functions are in the `scipy.interpolate` package; -we have simply wrapped them as transforms -similar to `Poly()` and `PCA()`. - -In Section 7.4, we saw -that regression splines can be fit by constructing an appropriate -matrix of basis functions. The `BSpline()` function generates the -entire matrix of basis functions for splines with the specified set of -knots. By default, the B-splines produced are cubic. To change the degree, use -the argument `degree`. - -```{python} -bs_ = BSpline(internal_knots=[25,40,60], intercept=True).fit(age) -bs_age = bs_.transform(age) -bs_age.shape - -``` -This results in a seven-column matrix, which is what is expected for a cubic-spline basis with 3 interior knots. -We can form this same matrix using the `bs()` object, -which facilitates adding this to a model-matrix builder (as in `poly()` versus its workhorse `Poly()`) described in Section 7.8.1. - -We now fit a cubic spline model to the `Wage` data. - -```{python} -bs_age = MS([bs('age', internal_knots=[25,40,60])]) -Xbs = bs_age.fit_transform(Wage) -M = sm.OLS(y, Xbs).fit() -summarize(M) - -``` -The column names are a little cumbersome, and have caused us to truncate the printed summary. They can be set on construction using the `name` argument as follows. - -```{python} -bs_age = MS([bs('age', - internal_knots=[25,40,60], - name='bs(age)')]) -Xbs = bs_age.fit_transform(Wage) -M = sm.OLS(y, Xbs).fit() -summarize(M) - -``` - -Notice that there are 6 spline coefficients rather than 7. This is because, by default, -`bs()` assumes `intercept=False`, since we typically have an overall intercept in the model. -So it generates the spline basis with the given knots, and then discards one of the basis functions to account for the intercept. - -We could also use the `df` (degrees of freedom) option to -specify the complexity of the spline. We see above that with 3 knots, -the spline basis has 6 columns or degrees of freedom. When we specify -`df=6` rather than the actual knots, `bs()` will produce a -spline with 3 knots chosen at uniform quantiles of the training data. -We can see these chosen knots most easily using `Bspline()` directly: - -```{python} -BSpline(df=6).fit(age).internal_knots_ - -``` - When asking for six degrees of freedom, -the transform chooses knots at ages 33.75, 42.0, and 51.0, -which correspond to the 25th, 50th, and 75th percentiles of -`age`. - -When using B-splines we need not limit ourselves to cubic polynomials -(i.e. `degree=3`). For instance, using `degree=0` results -in piecewise constant functions, as in our example with -`pd.qcut()` above. - - -```{python} -bs_age0 = MS([bs('age', - df=3, - degree=0)]).fit(Wage) -Xbs0 = bs_age0.transform(Wage) -summarize(sm.OLS(y, Xbs0).fit()) - -``` -This fit should be compared with cell [15] where we use `qcut()` -to create four bins by cutting at the 25%, 50% and 75% quantiles of -`age`. Since we specified `df=3` for degree-zero splines here, there will also be -knots at the same three quantiles. Although the coefficients appear different, we -see that this is a result of the different coding. For example, the -first coefficient is identical in both cases, and is the mean response -in the first bin. For the second coefficient, we have -$94.158 + 22.349 = 116.507 \approx 116.611$, the latter being the mean -in the second bin in cell [15]. Here the intercept is coded by a column -of ones, so the second, third and fourth coefficients are increments -for those bins. Why is the sum not exactly the same? It turns out that -the `qcut()` uses $\leq$, while `bs()` uses $<$ when -deciding bin membership. - - - - - - - - - - - -In order to fit a natural spline, we use the `NaturalSpline()` -transform with the corresponding helper `ns()`. Here we fit a natural spline with five -degrees of freedom (excluding the intercept) and plot the results. - -```{python} -ns_age = MS([ns('age', df=5)]).fit(Wage) -M_ns = sm.OLS(y, ns_age.transform(Wage)).fit() -summarize(M_ns) -``` -We now plot the natural spline using our plotting function. - -```{python} -plot_wage_fit(age_df, - ns_age, - 'Natural spline, df=5'); - -``` - -## Smoothing Splines and GAMs -A smoothing spline is a special case of a GAM with squared-error loss -and a single feature. To fit GAMs in `Python` we will use the -`pygam` package which can be installed via `pip install pygam`. The -estimator `LinearGAM()` uses squared-error loss. -The GAM is specified by associating each column -of a model matrix with a particular smoothing operation: -`s` for smoothing spline; `l` for linear, and `f` for factor or categorical variables. -The argument `0` passed to `s` below indicates that this smoother will -apply to the first column of a feature matrix. Below, we pass it a -matrix with a single column: `X_age`. The argument `lam` is the penalty parameter $\lambda$ as discussed in Section 7.5.2. - -```{python} -X_age = np.asarray(age).reshape((-1,1)) -gam = LinearGAM(s_gam(0, lam=0.6)) -gam.fit(X_age, y) - -``` - -The `pygam` library generally expects a matrix of features so we reshape `age` to be a matrix (a two-dimensional array) instead -of a vector (i.e. a one-dimensional array). The `-1` in the call to the `reshape()` method tells `numpy` to impute the -size of that dimension based on the remaining entries of the shape tuple. - -Let’s investigate how the fit changes with the smoothing parameter `lam`. -The function `np.logspace()` is similar to `np.linspace()` but spaces points -evenly on the log-scale. Below we vary `lam` from $10^{-2}$ to $10^6$. - -```{python} -fig, ax = subplots(figsize=(8,8)) -ax.scatter(age, y, facecolor='gray', alpha=0.5) -for lam in np.logspace(-2, 6, 5): - gam = LinearGAM(s_gam(0, lam=lam)).fit(X_age, y) - ax.plot(age_grid, - gam.predict(age_grid), - label='{:.1e}'.format(lam), - linewidth=3) -ax.set_xlabel('Age', fontsize=20) -ax.set_ylabel('Wage', fontsize=20); -ax.legend(title='$\lambda$'); - -``` - -The `pygam` package can perform a search for an optimal smoothing parameter. - -```{python} -gam_opt = gam.gridsearch(X_age, y) -ax.plot(age_grid, - gam_opt.predict(age_grid), - label='Grid search', - linewidth=4) -ax.legend() -fig - -``` - -Alternatively, we can fix the degrees of freedom of the smoothing -spline using a function included in the `ISLP.pygam` package. Below we -find a value of $\lambda$ that gives us roughly four degrees of -freedom. We note here that these degrees of freedom include the -unpenalized intercept and linear term of the smoothing spline, hence there are at least two -degrees of freedom. - -```{python} -age_term = gam.terms[0] -lam_4 = approx_lam(X_age, age_term, 4) -age_term.lam = lam_4 -degrees_of_freedom(X_age, age_term) - -``` - - -Let’s vary the degrees of freedom in a similar plot to above. We choose the degrees of freedom -as the desired degrees of freedom plus one to account for the fact that these smoothing -splines always have an intercept term. Hence, a value of one for `df` is just a linear fit. - -```{python} -fig, ax = subplots(figsize=(8,8)) -ax.scatter(X_age, - y, - facecolor='gray', - alpha=0.3) -for df in [1,3,4,8,15]: - lam = approx_lam(X_age, age_term, df+1) - age_term.lam = lam - gam.fit(X_age, y) - ax.plot(age_grid, - gam.predict(age_grid), - label='{:d}'.format(df), - linewidth=4) -ax.set_xlabel('Age', fontsize=20) -ax.set_ylabel('Wage', fontsize=20); -ax.legend(title='Degrees of freedom'); - -``` - -### Additive Models with Several Terms -The strength of generalized additive models lies in their ability to fit multivariate regression models with more flexibility than linear models. We demonstrate two approaches: the first in a more manual fashion using natural splines and piecewise constant functions, and the second using the `pygam` package and smoothing splines. - -We now fit a GAM by hand to predict -`wage` using natural spline functions of `year` and `age`, -treating `education` as a qualitative predictor, as in (7.16). -Since this is just a big linear regression model -using an appropriate choice of basis functions, we can simply do this -using the `sm.OLS()` function. - -We will build the model matrix in a more manual fashion here, since we wish -to access the pieces separately when constructing partial dependence plots. - -```{python} -ns_age = NaturalSpline(df=4).fit(age) -ns_year = NaturalSpline(df=5).fit(Wage['year']) -Xs = [ns_age.transform(age), - ns_year.transform(Wage['year']), - pd.get_dummies(Wage['education']).values] -X_bh = np.hstack(Xs) -gam_bh = sm.OLS(y, X_bh).fit() - -``` -Here the function `NaturalSpline()` is the workhorse supporting -the `ns()` helper function. We chose to use all columns of the -indicator matrix for the categorical variable `education`, making an intercept redundant. -Finally, we stacked the three component matrices horizontally to form the model matrix `X_bh`. - -We now show how to construct partial dependence plots for each of the terms in our rudimentary GAM. We can do this by hand, -given grids for `age` and `year`. - We simply predict with new $X$ matrices, fixing all but one of the features at a time. - -```{python} -age_grid = np.linspace(age.min(), - age.max(), - 100) -X_age_bh = X_bh.copy()[:100] -X_age_bh[:] = X_bh[:].mean(0)[None,:] -X_age_bh[:,:4] = ns_age.transform(age_grid) -preds = gam_bh.get_prediction(X_age_bh) -bounds_age = preds.conf_int(alpha=0.05) -partial_age = preds.predicted_mean -center = partial_age.mean() -partial_age -= center -bounds_age -= center -fig, ax = subplots(figsize=(8,8)) -ax.plot(age_grid, partial_age, 'b', linewidth=3) -ax.plot(age_grid, bounds_age[:,0], 'r--', linewidth=3) -ax.plot(age_grid, bounds_age[:,1], 'r--', linewidth=3) -ax.set_xlabel('Age') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of age on wage', fontsize=20); - -``` -Let's explain in some detail what we did above. The idea is to create a new prediction matrix, where all but the columns belonging to `age` are constant (and set to their training-data means). The four columns for `age` are filled in with the natural spline basis evaluated at the 100 values in `age_grid`. - -* We made a grid of length 100 in `age`, and created a matrix `X_age_bh` with 100 rows and the same number of columns as `X_bh`. -* We replaced every row of this matrix with the column means of the original. -* We then replace just the first four columns representing `age` with the natural spline basis computed at the values in `age_grid`. - -The remaining steps should by now be familiar. - -We also look at the effect of `year` on `wage`; the process is the same. - -```{python} -year_grid = np.linspace(2003, 2009, 100) -year_grid = np.linspace(Wage['year'].min(), - Wage['year'].max(), - 100) -X_year_bh = X_bh.copy()[:100] -X_year_bh[:] = X_bh[:].mean(0)[None,:] -X_year_bh[:,4:9] = ns_year.transform(year_grid) -preds = gam_bh.get_prediction(X_year_bh) -bounds_year = preds.conf_int(alpha=0.05) -partial_year = preds.predicted_mean -center = partial_year.mean() -partial_year -= center -bounds_year -= center -fig, ax = subplots(figsize=(8,8)) -ax.plot(year_grid, partial_year, 'b', linewidth=3) -ax.plot(year_grid, bounds_year[:,0], 'r--', linewidth=3) -ax.plot(year_grid, bounds_year[:,1], 'r--', linewidth=3) -ax.set_xlabel('Year') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of year on wage', fontsize=20); - -``` - -We now fit the model (7.16) using smoothing splines rather -than natural splines. All of the -terms in (7.16) are fit simultaneously, taking each other -into account to explain the response. The `pygam` package only works with matrices, so we must convert -the categorical series `education` to its array representation, which can be found -with the `cat.codes` attribute of `education`. As `year` only has 7 unique values, we -use only seven basis functions for it. - -```{python} -gam_full = LinearGAM(s_gam(0) + - s_gam(1, n_splines=7) + - f_gam(2, lam=0)) -Xgam = np.column_stack([age, - Wage['year'], - Wage['education'].cat.codes]) -gam_full = gam_full.fit(Xgam, y) - -``` -The two `s_gam()` terms result in smoothing spline fits, and use a default value for $\lambda$ (`lam=0.6`), which is somewhat arbitrary. For the categorical term `education`, specified using a `f_gam()` term, we specify `lam=0` to avoid any shrinkage. -We produce the partial dependence plot in `age` to see the effect of these choices. - -The values for the plot -are generated by the `pygam` package. We provide a `plot_gam()` -function for partial-dependence plots in `ISLP.pygam`, which makes this job easier than in our last example with natural splines. - -```{python} -fig, ax = subplots(figsize=(8,8)) -plot_gam(gam_full, 0, ax=ax) -ax.set_xlabel('Age') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of age on wage - default lam=0.6', fontsize=20); - -``` - -We see that the function is somewhat wiggly. It is more natural to specify the `df` than a value for `lam`. -We refit a GAM using four degrees of freedom each for -`age` and `year`. Recall that the addition of one below takes into account the intercept -of the smoothing spline. - -```{python} -age_term = gam_full.terms[0] -age_term.lam = approx_lam(Xgam, age_term, df=4+1) -year_term = gam_full.terms[1] -year_term.lam = approx_lam(Xgam, year_term, df=4+1) -gam_full = gam_full.fit(Xgam, y) - -``` -Note that updating `age_term.lam` above updates it in `gam_full.terms[0]` as well! Likewise for `year_term.lam`. - -Repeating the plot for `age`, we see that it is much smoother. -We also produce the plot for `year`. - -```{python} -fig, ax = subplots(figsize=(8,8)) -plot_gam(gam_full, - 1, - ax=ax) -ax.set_xlabel('Year') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of year on wage', fontsize=20) - -``` -Finally we plot `education`, which is categorical. The partial dependence plot is different, and more suitable for the set of fitted constants for each level of this variable. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax = plot_gam(gam_full, 2) -ax.set_xlabel('Education') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of wage on education', - fontsize=20); -ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8); - -``` - -### ANOVA Tests for Additive Models -In all of our models, the function of `year` looks rather linear. We can -perform a series of ANOVA tests in order to determine which of these -three models is best: a GAM that excludes `year` ($\mathcal{M}_1$), -a GAM that uses a linear function of `year` ($\mathcal{M}_2$), or a -GAM that uses a spline function of `year` ($\mathcal{M}_3$). - -```{python} -gam_0 = LinearGAM(age_term + f_gam(2, lam=0)) -gam_0.fit(Xgam, y) -gam_linear = LinearGAM(age_term + - l_gam(1, lam=0) + - f_gam(2, lam=0)) -gam_linear.fit(Xgam, y) - -``` - -Notice our use of `age_term` in the expressions above. We do this because -earlier we set the value for `lam` in this term to achieve four degrees of freedom. - -To directly assess the effect of `year` we run an ANOVA on the -three models fit above. - -```{python} -anova_gam(gam_0, gam_linear, gam_full) - -``` - We find that there is compelling evidence that a GAM with a linear -function in `year` is better than a GAM that does not include -`year` at all ($p$-value= 0.002). However, there is no -evidence that a non-linear function of `year` is needed -($p$-value=0.435). In other words, based on the results of this -ANOVA, $\mathcal{M}_2$ is preferred. - -We can repeat the same process for `age` as well. We see there is very clear evidence that -a non-linear term is required for `age`. - -```{python} -gam_0 = LinearGAM(year_term + - f_gam(2, lam=0)) -gam_linear = LinearGAM(l_gam(0, lam=0) + - year_term + - f_gam(2, lam=0)) -gam_0.fit(Xgam, y) -gam_linear.fit(Xgam, y) -anova_gam(gam_0, gam_linear, gam_full) -``` - -There is a (verbose) `summary()` method for the GAM fit. - -```{python} -gam_full.summary() - -``` - -We can make predictions from `gam` objects, just like from -`lm` objects, using the `predict()` method for the class -`gam`. Here we make predictions on the training set. - -```{python} -Yhat = gam_full.predict(Xgam) - -``` - -In order to fit a logistic regression GAM, we use `LogisticGAM()` -from `pygam`. - -```{python} -gam_logit = LogisticGAM(age_term + - l_gam(1, lam=0) + - f_gam(2, lam=0)) -gam_logit.fit(Xgam, high_earn) - -``` - - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax = plot_gam(gam_logit, 2) -ax.set_xlabel('Education') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of wage on education', - fontsize=20); -ax.set_xticklabels(Wage['education'].cat.categories, fontsize=8); - -``` -The model seems to be very flat, with especially high error bars for the first category. -Let's look at the data a bit more closely. - -```{python} -pd.crosstab(Wage['high_earn'], Wage['education']) - -``` -We see that there are no high earners in the first category of -education, meaning that the model will have a hard time fitting. We -will fit a logistic regression GAM excluding all observations falling into this -category. This provides more sensible results. - -To do so, -we could subset the model matrix, though this will not remove the -column from `Xgam`. While we can deduce which column corresponds -to this feature, for reproducibility’s sake we reform the model matrix -on this smaller subset. - - -```{python} -only_hs = Wage['education'] == '1. < HS Grad' -Wage_ = Wage.loc[~only_hs] -Xgam_ = np.column_stack([Wage_['age'], - Wage_['year'], - Wage_['education'].cat.codes-1]) -high_earn_ = Wage_['high_earn'] - -``` -In the second-to-last line above, we subtract one from the codes of the category, due to a bug in `pygam`. It just relabels -the education values and hence has no effect on the fit. - -We now fit the model. - -```{python} -gam_logit_ = LogisticGAM(age_term + - year_term + - f_gam(2, lam=0)) -gam_logit_.fit(Xgam_, high_earn_) - -``` - - -Let’s look at the effect of `education`, `year` and `age` on high earner status now that we’ve -removed those observations. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax = plot_gam(gam_logit_, 2) -ax.set_xlabel('Education') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of high earner status on education', fontsize=20); -ax.set_xticklabels(Wage['education'].cat.categories[1:], - fontsize=8); - -``` - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax = plot_gam(gam_logit_, 1) -ax.set_xlabel('Year') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of high earner status on year', - fontsize=20); - -``` - -```{python} -fig, ax = subplots(figsize=(8, 8)) -ax = plot_gam(gam_logit_, 0) -ax.set_xlabel('Age') -ax.set_ylabel('Effect on wage') -ax.set_title('Partial dependence of high earner status on age', fontsize=20); - -``` - - -## Local Regression -We illustrate the use of local regression using the `lowess()` -function from `sm.nonparametric`. Some implementations of -GAMs allow terms to be local regression operators; this is not the case in `pygam`. - -Here we fit local linear regression models using spans of 0.2 -and 0.5; that is, each neighborhood consists of 20% or 50% of -the observations. As expected, using a span of 0.5 is smoother than 0.2. - -```{python} -lowess = sm.nonparametric.lowess -fig, ax = subplots(figsize=(8,8)) -ax.scatter(age, y, facecolor='gray', alpha=0.5) -for span in [0.2, 0.5]: - fitted = lowess(y, - age, - frac=span, - xvals=age_grid) - ax.plot(age_grid, - fitted, - label='{:.1f}'.format(span), - linewidth=4) -ax.set_xlabel('Age', fontsize=20) -ax.set_ylabel('Wage', fontsize=20); -ax.legend(title='span', fontsize=15); - -``` - - - diff --git a/Ch08-baggboost-lab.Rmd b/Ch08-baggboost-lab.Rmd deleted file mode 100644 index adbdafb..0000000 --- a/Ch08-baggboost-lab.Rmd +++ /dev/null @@ -1,558 +0,0 @@ - - -# Tree-Based Methods - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch08-baggboost-lab.ipynb) - - -We import some of our usual libraries at this top -level. - -```{python} -import numpy as np -import pandas as pd -from matplotlib.pyplot import subplots -import sklearn.model_selection as skm -from ISLP import load_data, confusion_table -from ISLP.models import ModelSpec as MS - -``` -We also collect the new imports -needed for this lab. - -```{python} -from sklearn.tree import (DecisionTreeClassifier as DTC, - DecisionTreeRegressor as DTR, - plot_tree, - export_text) -from sklearn.metrics import (accuracy_score, - log_loss) -from sklearn.ensemble import \ - (RandomForestRegressor as RF, - GradientBoostingRegressor as GBR) -from ISLP.bart import BART - -``` - - -## Fitting Classification Trees - - -We first use classification trees to analyze the `Carseats` data set. -In these data, `Sales` is a continuous variable, and so we begin -by recoding it as a binary variable. We use the `where()` -function to create a variable, called `High`, which takes on a -value of `Yes` if the `Sales` variable exceeds 8, and takes -on a value of `No` otherwise. - -```{python} -Carseats = load_data('Carseats') -High = np.where(Carseats.Sales > 8, - "Yes", - "No") - -``` - -We now use `DecisionTreeClassifier()` to fit a classification tree in -order to predict `High` using all variables but `Sales`. -To do so, we must form a model matrix as we did when fitting regression -models. - -```{python} -model = MS(Carseats.columns.drop('Sales'), intercept=False) -D = model.fit_transform(Carseats) -feature_names = list(D.columns) -X = np.asarray(D) - -``` -We have converted `D` from a data frame to an array `X`, which is needed in some of the analysis below. We also need the `feature_names` for annotating our plots later. - -There are several options needed to specify the classifier, -such as `max_depth` (how deep to grow the tree), `min_samples_split` -(minimum number of observations in a node to be eligible for splitting) -and `criterion` (whether to use Gini or cross-entropy as the split criterion). -We also set `random_state` for reproducibility; ties in the split criterion are broken at random. - -```{python} -clf = DTC(criterion='entropy', - max_depth=3, - random_state=0) -clf.fit(X, High) - -``` - - -In our discussion of qualitative features in Section 3.3, -we noted that for a linear regression model such a feature could be -represented by including a matrix of dummy variables (one-hot-encoding) in the model -matrix, using the formula notation of `statsmodels`. -As mentioned in Section 8.1, there is a more -natural way to handle qualitative features when building a decision -tree, that does not require such dummy variables; each split amounts to partitioning the levels into two groups. -However, -the `sklearn` implementation of decision trees does not take -advantage of this approach; instead it simply treats the one-hot-encoded levels as separate variables. - -```{python} -accuracy_score(High, clf.predict(X)) - -``` - - -With only the default arguments, the training error rate is -21%. -For classification trees, we can -access the value of the deviance using `log_loss()`, -\begin{equation*} -\begin{split} --2 \sum_m \sum_k n_{mk} \log \hat{p}_{mk}, -\end{split} -\end{equation*} -where $n_{mk}$ is the number of observations in the $m$th terminal -node that belong to the $k$th class. - -```{python} -resid_dev = np.sum(log_loss(High, clf.predict_proba(X))) -resid_dev - -``` - -This is closely related to the *entropy*, defined in (8.7). -A small deviance indicates a -tree that provides a good fit to the (training) data. - -One of the most attractive properties of trees is that they can -be graphically displayed. Here we use the `plot()` function -to display the tree structure. - -```{python} -ax = subplots(figsize=(12,12))[1] -plot_tree(clf, - feature_names=feature_names, - ax=ax); - -``` -The most important indicator of `Sales` appears to be `ShelveLoc`. - -We can see a text representation of the tree using -`export_text()`, which displays the split -criterion (e.g. `Price <= 92.5`) for each branch. -For leaf nodes it shows the overall prediction -(`Yes` or `No`). - We can also see the number of observations in that -leaf that take on values of `Yes` and `No` by specifying `show_weights=True`. - -```{python} -print(export_text(clf, - feature_names=feature_names, - show_weights=True)) - -``` - -In order to properly evaluate the performance of a classification tree -on these data, we must estimate the test error rather than simply -computing the training error. We split the observations into a -training set and a test set, build the tree using the training set, -and evaluate its performance on the test data. This pattern is -similar to that in Chapter 6, with the linear models -replaced here by decision trees --- the code for validation -is almost identical. This approach leads to correct predictions -for 68.5% of the locations in the test data set. - -```{python} -validation = skm.ShuffleSplit(n_splits=1, - test_size=200, - random_state=0) -results = skm.cross_validate(clf, - D, - High, - cv=validation) -results['test_score'] - -``` - - - -Next, we consider whether pruning the tree might lead to improved -classification performance. We first split the data into a training and -test set. We will use cross-validation to prune the tree on the training -set, and then evaluate the performance of the pruned tree on the test -set. - -```{python} -(X_train, - X_test, - High_train, - High_test) = skm.train_test_split(X, - High, - test_size=0.5, - random_state=0) - -``` -We first refit the full tree on the training set; here we do not set a `max_depth` parameter, since we will learn that through cross-validation. - - -```{python} -clf = DTC(criterion='entropy', random_state=0) -clf.fit(X_train, High_train) -accuracy_score(High_test, clf.predict(X_test)) - -``` -Next we use the `cost_complexity_pruning_path()` method of -`clf` to extract cost-complexity values. - -```{python} -ccp_path = clf.cost_complexity_pruning_path(X_train, High_train) -kfold = skm.KFold(10, - random_state=1, - shuffle=True) - -``` -This yields a set of impurities and $\alpha$ values -from which we can extract an optimal one by cross-validation. - -```{python} -grid = skm.GridSearchCV(clf, - {'ccp_alpha': ccp_path.ccp_alphas}, - refit=True, - cv=kfold, - scoring='accuracy') -grid.fit(X_train, High_train) -grid.best_score_ - -``` -Let’s take a look at the pruned tree. - -```{python} -ax = subplots(figsize=(12, 12))[1] -best_ = grid.best_estimator_ -plot_tree(best_, - feature_names=feature_names, - ax=ax); - -``` -This is quite a bushy tree. We could count the leaves, or query -`best_` instead. - -```{python} -best_.tree_.n_leaves - -``` -The tree with 30 terminal -nodes results in the lowest cross-validation error rate, with an accuracy of -68.5%. How well does this pruned tree perform on the test data set? Once -again, we apply the `predict()` function. - -```{python} -print(accuracy_score(High_test, - best_.predict(X_test))) -confusion = confusion_table(best_.predict(X_test), - High_test) -confusion - -``` - - -Now 72.0% of the test observations are correctly classified, which is slightly worse than the error for the full tree (with 35 leaves). So cross-validation has not helped us much here; it only pruned off 5 leaves, at a cost of a slightly worse error. These results would change if we were to change the random number seeds above; even though cross-validation gives an unbiased approach to model selection, it does have variance. - - - - -## Fitting Regression Trees -Here we fit a regression tree to the `Boston` data set. The -steps are similar to those for classification trees. - -```{python} -Boston = load_data("Boston") -model = MS(Boston.columns.drop('medv'), intercept=False) -D = model.fit_transform(Boston) -feature_names = list(D.columns) -X = np.asarray(D) - -``` - -First, we split the data into training and test sets, and fit the tree -to the training data. Here we use 30% of the data for the test set. - - -```{python} -(X_train, - X_test, - y_train, - y_test) = skm.train_test_split(X, - Boston['medv'], - test_size=0.3, - random_state=0) - -``` - -Having formed our training and test data sets, we fit the regression tree. - -```{python} -reg = DTR(max_depth=3) -reg.fit(X_train, y_train) -ax = subplots(figsize=(12,12))[1] -plot_tree(reg, - feature_names=feature_names, - ax=ax); - -``` - -The variable `lstat` measures the percentage of individuals with -lower socioeconomic status. The tree indicates that lower -values of `lstat` correspond to more expensive houses. -The tree predicts a median house price of $12,042 for small-sized homes (`rm < 6.8`), in -suburbs in which residents have low socioeconomic status (`lstat > 14.4`) and the crime-rate is moderate (`crim > 5.8`). - - -Now we use the cross-validation function to see whether pruning -the tree will improve performance. - -```{python} -ccp_path = reg.cost_complexity_pruning_path(X_train, y_train) -kfold = skm.KFold(5, - shuffle=True, - random_state=10) -grid = skm.GridSearchCV(reg, - {'ccp_alpha': ccp_path.ccp_alphas}, - refit=True, - cv=kfold, - scoring='neg_mean_squared_error') -G = grid.fit(X_train, y_train) - -``` - -In keeping with the cross-validation results, we use the pruned tree -to make predictions on the test set. - -```{python} -best_ = grid.best_estimator_ -np.mean((y_test - best_.predict(X_test))**2) - -``` - - -In other words, the test set MSE associated with the regression tree -is 28.07. The square root of -the MSE is therefore around -5.30, -indicating that this model leads to test predictions that are within around -$5300 -of the true median home value for the suburb. - -Let’s plot the best tree to see how interpretable it is. - -```{python} -ax = subplots(figsize=(12,12))[1] -plot_tree(G.best_estimator_, - feature_names=feature_names, - ax=ax); - -``` - - - - -## Bagging and Random Forests - - -Here we apply bagging and random forests to the `Boston` data, using -the `RandomForestRegressor()` from the `sklearn.ensemble` package. Recall -that bagging is simply a special case of a random forest with -$m=p$. Therefore, the `RandomForestRegressor()` function can be used to -perform both bagging and random forests. We start with bagging. - -```{python} -bag_boston = RF(max_features=X_train.shape[1], random_state=0) -bag_boston.fit(X_train, y_train) - -``` - - -The argument `max_features` indicates that all 12 predictors should -be considered for each split of the tree --- in other words, that -bagging should be done. How well does this bagged model perform on -the test set? - -```{python} -ax = subplots(figsize=(8,8))[1] -y_hat_bag = bag_boston.predict(X_test) -ax.scatter(y_hat_bag, y_test) -np.mean((y_test - y_hat_bag)**2) - -``` - -The test set MSE associated with the bagged regression tree is -14.63, about half that obtained using an optimally-pruned single -tree. We could change the number of trees grown from the default of -100 by -using the `n_estimators` argument: - -```{python} -bag_boston = RF(max_features=X_train.shape[1], - n_estimators=500, - random_state=0).fit(X_train, y_train) -y_hat_bag = bag_boston.predict(X_test) -np.mean((y_test - y_hat_bag)**2) -``` -There is not much change. Bagging and random forests cannot overfit by -increasing the number of trees, but can underfit if the number is too small. - -Growing a random forest proceeds in exactly the same way, except that -we use a smaller value of the `max_features` argument. By default, -`RandomForestRegressor()` uses $p$ variables when building a random -forest of regression trees (i.e. it defaults to bagging), and `RandomForestClassifier()` uses -$\sqrt{p}$ variables when building a -random forest of classification trees. Here we use `max_features=6`. - -```{python} -RF_boston = RF(max_features=6, - random_state=0).fit(X_train, y_train) -y_hat_RF = RF_boston.predict(X_test) -np.mean((y_test - y_hat_RF)**2) - -``` - - -The test set MSE is 20.04; -this indicates that random forests did somewhat worse than bagging -in this case. Extracting the `feature_importances_` values from the fitted model, we can view the -importance of each variable. - -```{python} -feature_imp = pd.DataFrame( - {'importance':RF_boston.feature_importances_}, - index=feature_names) -feature_imp.sort_values(by='importance', ascending=False) -``` - This -is a relative measure of the total decrease in node impurity that results from -splits over that variable, averaged over all trees (this was plotted in Figure 8.9 for a model fit to the `Heart` data). - -The results indicate that across all of the trees considered in the -random forest, the wealth level of the community (`lstat`) and the -house size (`rm`) are by far the two most important variables. - - - - -## Boosting - - -Here we use `GradientBoostingRegressor()` from `sklearn.ensemble` -to fit boosted regression trees to the `Boston` data -set. For classification we would use `GradientBoostingClassifier()`. -The argument `n_estimators=5000` -indicates that we want 5000 trees, and the option -`max_depth=3` limits the depth of each tree. The -argument `learning_rate` is the $\lambda$ -mentioned earlier in the description of boosting. - -```{python} -boost_boston = GBR(n_estimators=5000, - learning_rate=0.001, - max_depth=3, - random_state=0) -boost_boston.fit(X_train, y_train) - -``` - -We can see how the training error decreases with the `train_score_` attribute. -To get an idea of how the test error decreases we can use the -`staged_predict()` method to get the predicted values along the path. - -```{python} -test_error = np.zeros_like(boost_boston.train_score_) -for idx, y_ in enumerate(boost_boston.staged_predict(X_test)): - test_error[idx] = np.mean((y_test - y_)**2) - -plot_idx = np.arange(boost_boston.train_score_.shape[0]) -ax = subplots(figsize=(8,8))[1] -ax.plot(plot_idx, - boost_boston.train_score_, - 'b', - label='Training') -ax.plot(plot_idx, - test_error, - 'r', - label='Test') -ax.legend(); - -``` - -We now use the boosted model to predict `medv` on the test set: - -```{python} -y_hat_boost = boost_boston.predict(X_test); -np.mean((y_test - y_hat_boost)**2) - -``` - - The test MSE obtained is 14.48, -similar to the test MSE for bagging. If we want to, we can -perform boosting with a different value of the shrinkage parameter -$\lambda$ in (8.10). The default value is 0.001, but -this is easily modified. Here we take $\lambda=0.2$. - -```{python} -boost_boston = GBR(n_estimators=5000, - learning_rate=0.2, - max_depth=3, - random_state=0) -boost_boston.fit(X_train, - y_train) -y_hat_boost = boost_boston.predict(X_test); -np.mean((y_test - y_hat_boost)**2) - -``` - - -In this case, using $\lambda=0.2$ leads to almost the same test MSE -as when using $\lambda=0.001$. - - - - -## Bayesian Additive Regression Trees - - -In this section we demonstrate a `Python` implementation of BART found in the -`ISLP.bart` package. We fit a model -to the `Boston` housing data set. This `BART()` estimator is -designed for quantitative outcome variables, though other implementations are available for -fitting logistic and probit models to categorical outcomes. - -```{python} -bart_boston = BART(random_state=0, burnin=5, ndraw=15) -bart_boston.fit(X_train, y_train) - -``` - - -On this data set, with this split into test and training, we see that the test error of BART is similar to that of random forest. - -```{python} -yhat_test = bart_boston.predict(X_test.astype(np.float32)) -np.mean((y_test - yhat_test)**2) - -``` - - -We can check how many times each variable appeared in the collection of trees. -This gives a summary similar to the variable importance plot for boosting and random forests. - -```{python} -var_inclusion = pd.Series(bart_boston.variable_inclusion_.mean(0), - index=D.columns) -var_inclusion - -``` - - - - - diff --git a/Ch09-svm-lab.Rmd b/Ch09-svm-lab.Rmd deleted file mode 100644 index 105a3c6..0000000 --- a/Ch09-svm-lab.Rmd +++ /dev/null @@ -1,550 +0,0 @@ - - -# Support Vector Machines - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch09-svm-lab.ipynb) - - -In this lab, we use the `sklearn.svm` library to demonstrate the support -vector classifier and the support vector machine. - -We import some of our usual libraries. - -```{python} -import numpy as np -from matplotlib.pyplot import subplots, cm -import sklearn.model_selection as skm -from ISLP import load_data, confusion_table - -``` -We also collect the new imports -needed for this lab. - -```{python} -from sklearn.svm import SVC -from ISLP.svm import plot as plot_svm -from sklearn.metrics import RocCurveDisplay - -``` - -We will use the function `RocCurveDisplay.from_estimator()` to -produce several ROC plots, using a shorthand `roc_curve`. - -```{python} -roc_curve = RocCurveDisplay.from_estimator # shorthand - -``` - -## Support Vector Classifier - -We now use the `SupportVectorClassifier()` function (abbreviated `SVC()`) from `sklearn` to fit the support vector -classifier for a given value of the parameter `C`. The -`C` argument allows us to specify the cost of a violation to -the margin. When the `C` argument is small, then the margins -will be wide and many support vectors will be on the margin or will -violate the margin. When the `C` argument is large, then the -margins will be narrow and there will be few support vectors on the -margin or violating the margin. - -Here we demonstrate -the use of `SVC()` on a two-dimensional example, so that we can -plot the resulting decision boundary. We begin by generating the -observations, which belong to two classes, and checking whether the -classes are linearly separable. - -```{python} -rng = np.random.default_rng(1) -X = rng.standard_normal((50, 2)) -y = np.array([-1]*25+[1]*25) -X[y==1] += 1 -fig, ax = subplots(figsize=(8,8)) -ax.scatter(X[:,0], - X[:,1], - c=y, - cmap=cm.coolwarm); - -``` -They are not. We now fit the classifier. - -```{python} -svm_linear = SVC(C=10, kernel='linear') -svm_linear.fit(X, y) - -``` - - -The support vector classifier with two features can -be visualized by plotting values of its *decision function*. -We have included a function for this in the `ISLP` package (inspired by a similar -example in the `sklearn` docs). - -```{python} -fig, ax = subplots(figsize=(8,8)) -plot_svm(X, - y, - svm_linear, - ax=ax) - -``` - -The decision -boundary between the two classes is linear (because we used the -argument `kernel='linear'`). The support vectors are marked with `+` -and the remaining observations are plotted as circles. - -What if we instead used a smaller value of the cost parameter? - -```{python} -svm_linear_small = SVC(C=0.1, kernel='linear') -svm_linear_small.fit(X, y) -fig, ax = subplots(figsize=(8,8)) -plot_svm(X, - y, - svm_linear_small, - ax=ax) - -``` -With a smaller value of the cost parameter, we -obtain a larger number of support vectors, because the margin is now -wider. For linear kernels, we can extract the -coefficients of the linear decision boundary as follows: - -```{python} -svm_linear.coef_ - -``` - - -Since the support vector machine is an estimator in `sklearn`, we -can use the usual machinery to tune it. - -```{python} -kfold = skm.KFold(5, - random_state=0, - shuffle=True) -grid = skm.GridSearchCV(svm_linear, - {'C':[0.001,0.01,0.1,1,5,10,100]}, - refit=True, - cv=kfold, - scoring='accuracy') -grid.fit(X, y) -grid.best_params_ - -``` - - -We can easily access the cross-validation errors for each of these models -in `grid.cv_results_`. This prints out a lot of detail, so we -extract the accuracy results only. - -```{python} -grid.cv_results_[('mean_test_score')] - -``` -We see that `C=1` results in the highest cross-validation -accuracy of 0.74, though -the accuracy is the same for several values of `C`. -The classifier `grid.best_estimator_` can be used to predict the class -label on a set of test observations. Let’s generate a test data set. - -```{python} -X_test = rng.standard_normal((20, 2)) -y_test = np.array([-1]*10+[1]*10) -X_test[y_test==1] += 1 - -``` - -Now we predict the class labels of these test observations. Here we -use the best model selected by cross-validation in order to make the -predictions. - -```{python} -best_ = grid.best_estimator_ -y_test_hat = best_.predict(X_test) -confusion_table(y_test_hat, y_test) - -``` - -Thus, with this value of `C`, -70% of the test -observations are correctly classified. What if we had instead used -`C=0.001`? - -```{python} -svm_ = SVC(C=0.001, - kernel='linear').fit(X, y) -y_test_hat = svm_.predict(X_test) -confusion_table(y_test_hat, y_test) - -``` - -In this case 60% of test observations are correctly classified. - -We now consider a situation in which the two classes are linearly -separable. Then we can find an optimal separating hyperplane using the -`SVC()` estimator. We first -further separate the two classes in our simulated data so that they -are linearly separable: - -```{python} -X[y==1] += 1.9; -fig, ax = subplots(figsize=(8,8)) -ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm); - -``` - -Now the observations are just barely linearly separable. - -```{python} -svm_ = SVC(C=1e5, kernel='linear').fit(X, y) -y_hat = svm_.predict(X) -confusion_table(y_hat, y) - -``` - -We fit the -support vector classifier and plot the resulting hyperplane, using a -very large value of `C` so that no observations are -misclassified. - -```{python} -fig, ax = subplots(figsize=(8,8)) -plot_svm(X, - y, - svm_, - ax=ax) - -``` -Indeed no training errors were made and only three support vectors were used. -In fact, the large value of `C` also means that these three support points are *on the margin*, and define it. -One may wonder how good the classifier could be on test data that depends on only three data points! - We now try a smaller -value of `C`. - -```{python} -svm_ = SVC(C=0.1, kernel='linear').fit(X, y) -y_hat = svm_.predict(X) -confusion_table(y_hat, y) - -``` - -Using `C=0.1`, we again do not misclassify any training observations, but we -also obtain a much wider margin and make use of twelve support -vectors. These jointly define the orientation of the decision boundary, and since there are more of them, it is more stable. It seems possible that this model will perform better on test -data than the model with `C=1e5` (and indeed, a simple experiment with a large test set would bear this out). - -```{python} -fig, ax = subplots(figsize=(8,8)) -plot_svm(X, - y, - svm_, - ax=ax) - -``` - - -## Support Vector Machine -In order to fit an SVM using a non-linear kernel, we once again use -the `SVC()` estimator. However, now we use a different value -of the parameter `kernel`. To fit an SVM with a polynomial -kernel we use `kernel="poly"`, and to fit an SVM with a -radial kernel we use -`kernel="rbf"`. In the former case we also use the -`degree` argument to specify a degree for the polynomial kernel -(this is $d$ in (9.22)), and in the latter case we use -`gamma` to specify a value of $\gamma$ for the radial basis -kernel (9.24). - -We first generate some data with a non-linear class boundary, as follows: - -```{python} -X = rng.standard_normal((200, 2)) -X[:100] += 2 -X[100:150] -= 2 -y = np.array([1]*150+[2]*50) - -``` - -Plotting the data makes it clear that the class boundary is indeed non-linear. - -```{python} -fig, ax = subplots(figsize=(8,8)) -ax.scatter(X[:,0], - X[:,1], - c=y, - cmap=cm.coolwarm); - -``` - - -The data is randomly split into training and testing groups. We then -fit the training data using the `SVC()` estimator with a -radial kernel and $\gamma=1$: - -```{python} -(X_train, - X_test, - y_train, - y_test) = skm.train_test_split(X, - y, - test_size=0.5, - random_state=0) -svm_rbf = SVC(kernel="rbf", gamma=1, C=1) -svm_rbf.fit(X_train, y_train) - -``` - -The plot shows that the resulting SVM has a decidedly non-linear -boundary. - -```{python} -fig, ax = subplots(figsize=(8,8)) -plot_svm(X_train, - y_train, - svm_rbf, - ax=ax) - -``` - -We can see from the figure that there are a fair number of training -errors in this SVM fit. If we increase the value of `C`, we -can reduce the number of training errors. However, this comes at the -price of a more irregular decision boundary that seems to be at risk -of overfitting the data. - -```{python} -svm_rbf = SVC(kernel="rbf", gamma=1, C=1e5) -svm_rbf.fit(X_train, y_train) -fig, ax = subplots(figsize=(8,8)) -plot_svm(X_train, - y_train, - svm_rbf, - ax=ax) - -``` - -We can perform cross-validation using `skm.GridSearchCV()` to select the -best choice of $\gamma$ and `C` for an SVM with a radial -kernel: - -```{python} -kfold = skm.KFold(5, - random_state=0, - shuffle=True) -grid = skm.GridSearchCV(svm_rbf, - {'C':[0.1,1,10,100,1000], - 'gamma':[0.5,1,2,3,4]}, - refit=True, - cv=kfold, - scoring='accuracy'); -grid.fit(X_train, y_train) -grid.best_params_ - -``` - -The best choice of parameters under five-fold CV is achieved at `C=1` -and `gamma=0.5`, though several other values also achieve the same -value. - -```{python} -best_svm = grid.best_estimator_ -fig, ax = subplots(figsize=(8,8)) -plot_svm(X_train, - y_train, - best_svm, - ax=ax) - -y_hat_test = best_svm.predict(X_test) -confusion_table(y_hat_test, y_test) - -``` - -With these parameters, 12% of test -observations are misclassified by this SVM. - - -## ROC Curves - -SVMs and support vector classifiers output class labels for each -observation. However, it is also possible to obtain *fitted values* -for each observation, which are the numerical scores used to -obtain the class labels. For instance, in the case of a support vector -classifier, the fitted value for an observation $X= (X_1, X_2, \ldots, -X_p)^T$ takes the form $\hat{\beta}_0 + \hat{\beta}_1 X_1 + -\hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$. For an SVM with a -non-linear kernel, the equation that yields the fitted value is given -in (9.23). The sign of the fitted value -determines on which side of the decision boundary the observation -lies. Therefore, the relationship between the fitted value and the -class prediction for a given observation is simple: if the fitted -value exceeds zero then the observation is assigned to one class, and -if it is less than zero then it is assigned to the other. -By changing this threshold from zero to some positive value, -we skew the classifications in favor of one class versus the other. -By considering a range of these thresholds, positive and negative, we produce the ingredients for a ROC plot. -We can access these values by calling the `decision_function()` -method of a fitted SVM estimator. - -The function `ROCCurveDisplay.from_estimator()` (which we have abbreviated to `roc_curve()`) will produce a plot of a ROC curve. It takes a fitted estimator as its first argument, followed -by a model matrix $X$ and labels $y$. The argument `name` is used in the legend, -while `color` is used for the color of the line. Results are plotted -on our axis object `ax`. - -```{python} -fig, ax = subplots(figsize=(8,8)) -roc_curve(best_svm, - X_train, - y_train, - name='Training', - color='r', - ax=ax); - -``` - In this example, the SVM appears to provide accurate predictions. By increasing -$\gamma$ we can produce a more flexible fit and generate further -improvements in accuracy. - -```{python} -svm_flex = SVC(kernel="rbf", - gamma=50, - C=1) -svm_flex.fit(X_train, y_train) -fig, ax = subplots(figsize=(8,8)) -roc_curve(svm_flex, - X_train, - y_train, - name='Training $\gamma=50$', - color='r', - ax=ax); - -``` - -However, these ROC curves are all on the training data. We are really -more interested in the level of prediction accuracy on the test -data. When we compute the ROC curves on the test data, the model with -$\gamma=0.5$ appears to provide the most accurate results. - -```{python} -roc_curve(svm_flex, - X_test, - y_test, - name='Test $\gamma=50$', - color='b', - ax=ax) -fig; - -``` - -Let’s look at our tuned SVM. - -```{python} -fig, ax = subplots(figsize=(8,8)) -for (X_, y_, c, name) in zip( - (X_train, X_test), - (y_train, y_test), - ('r', 'b'), - ('CV tuned on training', - 'CV tuned on test')): - roc_curve(best_svm, - X_, - y_, - name=name, - ax=ax, - color=c) - -``` - -## SVM with Multiple Classes - -If the response is a factor containing more than two levels, then the -`SVC()` function will perform multi-class classification using -either the one-versus-one approach (when `decision_function_shape=='ovo'`) -or one-versus-rest {One-versus-rest is also known as one-versus-all.} (when `decision_function_shape=='ovr'`). -We explore that setting briefly here by -generating a third class of observations. - -```{python} -rng = np.random.default_rng(123) -X = np.vstack([X, rng.standard_normal((50, 2))]) -y = np.hstack([y, [0]*50]) -X[y==0,1] += 2 -fig, ax = subplots(figsize=(8,8)) -ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm); - -``` - -We now fit an SVM to the data: - -```{python} -svm_rbf_3 = SVC(kernel="rbf", - C=10, - gamma=1, - decision_function_shape='ovo'); -svm_rbf_3.fit(X, y) -fig, ax = subplots(figsize=(8,8)) -plot_svm(X, - y, - svm_rbf_3, - scatter_cmap=cm.tab10, - ax=ax) - -``` -The `sklearn.svm` library can also be used to perform support vector -regression with a numerical response using the estimator `SupportVectorRegression()`. - - -## Application to Gene Expression Data - -We now examine the `Khan` data set, which consists of a number of -tissue samples corresponding to four distinct types of small round -blue cell tumors. For each tissue sample, gene expression measurements -are available. The data set consists of training data, `xtrain` -and `ytrain`, and testing data, `xtest` and `ytest`. - -We examine the dimension of the data: - -```{python} -Khan = load_data('Khan') -Khan['xtrain'].shape, Khan['xtest'].shape - -``` - -This data set consists of expression measurements for 2,308 -genes. The training and test sets consist of 63 and 20 -observations, respectively. - -We will use a support vector approach to predict cancer subtype using -gene expression measurements. In this data set, there is a very -large number of features relative to the number of observations. This -suggests that we should use a linear kernel, because the additional -flexibility that will result from using a polynomial or radial kernel -is unnecessary. - -```{python} -khan_linear = SVC(kernel='linear', C=10) -khan_linear.fit(Khan['xtrain'], Khan['ytrain']) -confusion_table(khan_linear.predict(Khan['xtrain']), - Khan['ytrain']) - -``` - -We see that there are *no* training -errors. In fact, this is not surprising, because the large number of -variables relative to the number of observations implies that it is -easy to find hyperplanes that fully separate the classes. We are more -interested in the support vector classifier’s performance on the -test observations. - -```{python} -confusion_table(khan_linear.predict(Khan['xtest']), - Khan['ytest']) - -``` - -We see that using `C=10` yields two test set errors on these data. - - diff --git a/Ch10-deeplearning-lab.Rmd b/Ch10-deeplearning-lab.Rmd deleted file mode 100644 index 8a31f0f..0000000 --- a/Ch10-deeplearning-lab.Rmd +++ /dev/null @@ -1,1858 +0,0 @@ - -# Deep Learning - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch10-deeplearning-lab.ipynb) - - -In this section we demonstrate how to fit the examples discussed -in the text. We use the `Python`{} `torch` package, along with the -`pytorch_lightning` package which provides utilities to simplify -fitting and evaluating models. This code can be impressively fast -with certain special processors, such as Apple’s new M1 chip. The package is well-structured, flexible, and will feel comfortable -to `Python`{} users. A good companion is the site -[pytorch.org/tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html). -Much of our code is adapted from there, as well as the `pytorch_lightning` documentation. {The precise URLs at the time of writing are and .} - -We start with several standard imports that we have seen before. - -```{python} -import numpy as np -import pandas as pd -from matplotlib.pyplot import subplots -from sklearn.linear_model import \ - (LinearRegression, - LogisticRegression, - Lasso) -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import KFold -from sklearn.pipeline import Pipeline -from ISLP import load_data -from ISLP.models import ModelSpec as MS -from sklearn.model_selection import \ - (train_test_split, - GridSearchCV) -``` - - -### Torch-Specific Imports -There are a number of imports for `torch`. (These are not -included with `ISLP`, so must be installed separately.) -First we import the main library -and essential tools used to specify sequentially-structured networks. - -```{python} -import torch -from torch import nn -from torch.optim import RMSprop -from torch.utils.data import TensorDataset - -``` - -There are several other helper packages for `torch`. For instance, -the `torchmetrics` package has utilities to compute -various metrics to evaluate performance when fitting -a model. The `torchinfo` package provides a useful -summary of the layers of a model. We use the `read_image()` -function when loading test images in Section 10.9.4. - -If you have not already installed the packages `torchvision` -and `torchinfo` you can install them by running -`pip install torchinfo torchvision`. -We can now import from `torchinfo`. - - -```{python} -from torchmetrics import (MeanAbsoluteError, - R2Score) -from torchinfo import summary - -``` - -The package `pytorch_lightning` is a somewhat higher-level -interface to `torch` that simplifies the specification and -fitting of -models by reducing the amount of boilerplate code needed -(compared to using `torch` alone). - -```{python} -from pytorch_lightning import Trainer -from pytorch_lightning.loggers import CSVLogger - -``` - -In order to reproduce results we use `seed_everything()`. We will also instruct `torch` to use deterministic algorithms -where possible. - -```{python} -from pytorch_lightning import seed_everything -seed_everything(0, workers=True) -torch.use_deterministic_algorithms(True, warn_only=True) - -``` - -We will use several datasets shipped with `torchvision` for our -examples: a pretrained network for image classification, -as well as some transforms used for preprocessing. - -```{python} -from torchvision.io import read_image -from torchvision.datasets import MNIST, CIFAR100 -from torchvision.models import (resnet50, - ResNet50_Weights) -from torchvision.transforms import (Resize, - Normalize, - CenterCrop, - ToTensor) -``` -We have provided a few utilities in `ISLP` specifically for this lab. -The `SimpleDataModule` and `SimpleModule` are simple -versions of objects used in `pytorch_lightning`, the -high-level module for fitting `torch` models. Although more advanced -uses such as computing on graphical processing units (GPUs) and parallel data processing -are possible in this module, we will not be focusing much on these -in this lab. The `ErrorTracker` handles -collections of targets and predictions over each mini-batch -in the validation or test stage, allowing computation -of the metric over the entire validation or test data set. - -```{python} -from ISLP.torch import (SimpleDataModule, - SimpleModule, - ErrorTracker, - rec_num_workers) - -``` - -In addition we have included some helper -functions to load the -`IMDb` database, as well as a lookup that maps integers -to particular keys in the database. We’ve included -a slightly modified copy of the preprocessed -`IMDb` data from `keras`, a separate package -for fitting deep learning models. This saves us significant -preprocessing and allows us to focus on specifying and fitting -the models themselves. - -```{python} -from ISLP.torch.imdb import (load_lookup, - load_tensor, - load_sparse, - load_sequential) - -``` - -Finally, we introduce some utility imports not directly related to -`torch`. -The `glob()` function from the `glob` module is used -to find all files matching wildcard characters, which we will use -in our example applying the `ResNet50` model -to some of our own images. -The `json` module will be used to load -a JSON file for looking up classes to identify the labels of the -pictures in the `ResNet50` example. We'll also import `warnings` to filter -out warnings when fitting the LASSO to the IMDB data. - -```{python} -from glob import glob -import json -import warnings - -``` - - -## Single Layer Network on Hitters Data -We start by fitting the models in Section 10.6 on the `Hitters` data. - -```{python} -Hitters = load_data('Hitters').dropna() -n = Hitters.shape[0] - -``` - We will fit two linear models (least squares and lasso) and compare their performance -to that of a neural network. For this comparison we will use mean absolute error on a validation dataset. -\begin{equation*} -\begin{split} -\mbox{MAE}(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^n |y_i-\hat{y}_i|. -\end{split} -\end{equation*} -We set up the model matrix and the response. - -```{python} -model = MS(Hitters.columns.drop('Salary'), intercept=False) -X = model.fit_transform(Hitters).to_numpy() -Y = Hitters['Salary'].to_numpy() - -``` -The `to_numpy()` method above converts `pandas` -data frames or series to `numpy` arrays. -We do this because we will need to use `sklearn` to fit the lasso model, -and it requires this conversion. -We also use a linear regression method from `sklearn`, rather than the method -in Chapter~3 from `statsmodels`, to facilitate the comparisons. - - -We now split the data into test and training, fixing the random -state used by `sklearn` to do the split. - -```{python} -(X_train, - X_test, - Y_train, - Y_test) = train_test_split(X, - Y, - test_size=1/3, - random_state=1) -``` - -### Linear Models -We fit the linear model and evaluate the test error directly. - -```{python} -hit_lm = LinearRegression().fit(X_train, Y_train) -Yhat_test = hit_lm.predict(X_test) -np.abs(Yhat_test - Y_test).mean() -``` - -Next we fit the lasso using `sklearn`. We are using -mean absolute error to select and evaluate a model, rather than mean squared error. -The specialized solver we used in Section 6.5.2 uses only mean squared error. So here, with a bit more work, we create a cross-validation grid and perform the cross-validation directly. - -We encode a pipeline with two steps: we first normalize the features using a `StandardScaler()` transform, -and then fit the lasso without further normalization. - -```{python} -scaler = StandardScaler(with_mean=True, with_std=True) -lasso = Lasso(warm_start=True, max_iter=30000) -standard_lasso = Pipeline(steps=[('scaler', scaler), - ('lasso', lasso)]) -``` - -We need to create a grid of values for $\lambda$. As is common practice, -we choose a grid of 100 values of $\lambda$, uniform on the log scale from `lam_max` down to `0.01*lam_max`. Here `lam_max` is the smallest value of -$\lambda$ with an all-zero solution. This value equals the largest absolute inner-product between any predictor and the (centered) response. {The derivation of this result is beyond the scope of this book.} - -```{python} -X_s = scaler.fit_transform(X_train) -n = X_s.shape[0] -lam_max = np.fabs(X_s.T.dot(Y_train - Y_train.mean())).max() / n -param_grid = {'lasso__alpha': np.exp(np.linspace(0, np.log(0.01), 100)) - * lam_max} -``` -Note that we had to transform the data first, since the scale of the variables impacts the choice of $\lambda$. -We now perform cross-validation using this sequence of $\lambda$ values. - -```{python} -cv = KFold(10, - shuffle=True, - random_state=1) -grid = GridSearchCV(standard_lasso, - param_grid, - cv=cv, - scoring='neg_mean_absolute_error') -grid.fit(X_train, Y_train); -``` - -We extract the lasso model with best cross-validated mean absolute error, and evaluate its -performance on `X_test` and `Y_test`, which were not used in -cross-validation. - -```{python} -trained_lasso = grid.best_estimator_ -Yhat_test = trained_lasso.predict(X_test) -np.fabs(Yhat_test - Y_test).mean() -``` -This is similar to the results we got for the linear model fit by least squares. However, these results can vary a lot for different train/test splits; we encourage the reader to try a different seed in code block 12 and rerun the subsequent code up to this point. - -### Specifying a Network: Classes and Inheritance -To fit the neural network, we first set up a model structure -that describes the network. -Doing so requires us to define new classes specific to the model we wish to fit. -Typically this is done in `pytorch` by sub-classing a generic -representation of a network, which is the approach we take here. -Although this example is simple, we will go through the steps in some detail, since it will serve us well -for the more complex examples to follow. - - -```{python} -class HittersModel(nn.Module): - - def __init__(self, input_size): - super(HittersModel, self).__init__() - self.flatten = nn.Flatten() - self.sequential = nn.Sequential( - nn.Linear(input_size, 50), - nn.ReLU(), - nn.Dropout(0.4), - nn.Linear(50, 1)) - - def forward(self, x): - x = self.flatten(x) - return torch.flatten(self.sequential(x)) - -``` - -The `class` statement identifies the code chunk as a -declaration for a class `HittersModel` -that inherits from the base class `nn.Module`. This base -class is ubiquitous in `torch` and represents the -mappings in the neural networks. - -Indented beneath the `class` statement are the methods of this class: -in this case `__init__` and `forward`. The `__init__` method is -called when an instance of the class is created as in the cell -below. In the methods, `self` always refers to an instance of the -class. In the `__init__` method, we have attached two objects to -`self` as attributes: `flatten` and `sequential`. These are used in -the `forward` method to describe the map that this module implements. - -There is one additional line in the `__init__` method, which -is a call to -`super()`. This function allows subclasses (i.e. `HittersModel`) -to access methods of the class they inherit from. For example, -the class `nn.Module` has its own `__init__` method, which is different from -the `HittersModel.__init__()` method we’ve written above. -Using `super()` allows us to call the method of the base class. For -`torch` models, we will always be making this `super()` call as it is necessary -for the model to be properly interpreted by `torch`. - -The object `nn.Module` has more methods than simply `__init__` and `forward`. These -methods are directly accessible to `HittersModel` instances because of this inheritance. -One such method we will see shortly is the `eval()` method, used -to disable dropout for when we want to evaluate the model on test data. - -```{python} -hit_model = HittersModel(X.shape[1]) - -``` - -The object `self.sequential` is a composition of four maps. The -first maps the 19 features of `Hitters` to 50 dimensions, introducing $50\times 19+50$ parameters -for the weights and *intercept* of the map (often called the *bias*). This layer -is then mapped to a ReLU layer followed by a 40% dropout layer, and finally a -linear map down to 1 dimension, again with a bias. The total number of -trainable parameters is therefore $50\times 19+50+50+1=1051$. - - - - -The package `torchinfo` provides a `summary()` function that neatly summarizes -this information. We specify the size of the input and see the size -of each tensor as it passes through layers of the network. - -```{python} -summary(hit_model, - input_size=X_train.shape, - col_names=['input_size', - 'output_size', - 'num_params']) - -``` -We have truncated the end of the output slightly, here and in subsequent uses. - -We now need to transform our training data into a form accessible to `torch`. -The basic -datatype in `torch` is a `tensor`, which is very similar -to an `ndarray` from early chapters. -We also note here that `torch` typically -works with 32-bit (*single precision*) -rather than 64-bit (*double precision*) floating point numbers. -We therefore convert our data to `np.float32` before -forming the tensor. -The $X$ and $Y$ tensors are then arranged into a `Dataset` -recognized by `torch` -using `TensorDataset()`. - -```{python} -X_train_t = torch.tensor(X_train.astype(np.float32)) -Y_train_t = torch.tensor(Y_train.astype(np.float32)) -hit_train = TensorDataset(X_train_t, Y_train_t) -``` -We do the same for the test data. - -```{python} -X_test_t = torch.tensor(X_test.astype(np.float32)) -Y_test_t = torch.tensor(Y_test.astype(np.float32)) -hit_test = TensorDataset(X_test_t, Y_test_t) - -``` - -Finally, this dataset is passed to a `DataLoader()` which ultimately -passes data into our network. While this may seem -like a lot of overhead, this structure is helpful for more -complex tasks where data may live on different machines, -or where data must be passed to a GPU. -We provide a helper function `SimpleDataModule()` in `ISLP` to make this task easier for -standard usage. -One of its arguments is `num_workers`, which indicates -how many processes we will use -for loading the data. For small -data like `Hitters` this will have little effect, but -it does provide an advantage for the `MNIST` and `CIFAR100` examples below. -The `torch` package will inspect the process running and determine a -maximum number of workers. {This depends on the computing hardware and the number of cores available.} We’ve included a function -`rec_num_workers()` to compute this so we know how many -workers might be reasonable (here the max was 16). - -```{python} -max_num_workers = rec_num_workers() -``` - -The general training setup in `pytorch_lightning` involves -training, validation and test data. These are each -represented by different data loaders. During each epoch, -we run a training step to learn the model and a validation -step to track the error. The test data is typically -used at the end of training to evaluate the model. - -In this case, as we had split only into test and training, -we’ll use the test data as validation data with the -argument `validation=hit_test`. The -`validation` argument can be a float between 0 and 1, an -integer, or a -`Dataset`. If a float (respectively, integer), it is interpreted -as a percentage (respectively number) of the *training* observations to be used for validation. -If it is a `Dataset`, it is passed directly to a data loader. - -```{python} -hit_dm = SimpleDataModule(hit_train, - hit_test, - batch_size=32, - num_workers=min(4, max_num_workers), - validation=hit_test) - -``` - -Next we must provide a `pytorch_lightning` module that controls -the steps performed during the training process. We provide methods for our -`SimpleModule()` that simply record the value -of the loss function and any additional -metrics at the end of each epoch. These operations -are controlled by the methods `SimpleModule.[training/test/validation]_step()`, though -we will not be modifying these in our examples. - -```{python} -hit_module = SimpleModule.regression(hit_model, - metrics={'mae':MeanAbsoluteError()}) - -``` - - By using the `SimpleModule.regression()` method, we indicate that we will use squared-error loss as in -(10.23). -We have also asked for mean absolute error to be tracked as well -in the metrics that are logged. - -We log our results via `CSVLogger()`, which in this case stores the results in a CSV file within a directory `logs/hitters`. After the fitting is complete, this allows us to load the -results as a `pd.DataFrame()` and visualize them below. There are -several ways to log the results within `pytorch_lightning`, though -we will not cover those here in detail. - -```{python} -hit_logger = CSVLogger('logs', name='hitters') -``` - -Finally we are ready to train our model and log the results. We -use the `Trainer()` object from `pytorch_lightning` -to do this work. The argument `datamodule=hit_dm` tells the trainer -how training/validation/test logs are produced, -while the first argument `hit_module` -specifies the network architecture -as well as the training/validation/test steps. -The `callbacks` argument allows for -several tasks to be carried out at various -points while training a model. Here -our `ErrorTracker()` callback will enable -us to compute validation error while training -and, finally, the test error. -We now fit the model for 50 epochs. - -```{python} -hit_trainer = Trainer(deterministic=True, - max_epochs=50, - log_every_n_steps=5, - logger=hit_logger, - callbacks=[ErrorTracker()]) -hit_trainer.fit(hit_module, datamodule=hit_dm) -``` -At each step of SGD, the algorithm randomly selects 32 training observations for -the computation of the gradient. Recall from Section 10.7 -that an epoch amounts to the number of SGD steps required to process $n$ -observations. Since the training set has -$n=175$, and we specified a `batch_size` of 32 in the construction of `hit_dm`, an epoch is $175/32=5.5$ SGD steps. - -After having fit the model, we can evaluate performance on our test -data using the `test()` method of our trainer. - -```{python} -hit_trainer.test(hit_module, datamodule=hit_dm) - -``` - - -The results of the fit have been logged into a CSV file. We can find the -results specific to this run in the `experiment.metrics_file_path` -attribute of our logger. Note that each time the model is fit, the logger will output -results into a new subdirectory of our directory `logs/hitters`. - -We now create a plot of the MAE (mean absolute error) as a function of -the number of epochs. -First we retrieve the logged summaries. - -```{python} -hit_results = pd.read_csv(hit_logger.experiment.metrics_file_path) -``` - -Since we will produce similar plots in later examples, we write a -simple generic function to produce this plot. - -```{python} -def summary_plot(results, - ax, - col='loss', - valid_legend='Validation', - training_legend='Training', - ylabel='Loss', - fontsize=20): - for (column, - color, - label) in zip([f'train_{col}_epoch', - f'valid_{col}'], - ['black', - 'red'], - [training_legend, - valid_legend]): - results.plot(x='epoch', - y=column, - label=label, - marker='o', - color=color, - ax=ax) - ax.set_xlabel('Epoch') - ax.set_ylabel(ylabel) - return ax -``` -We now set up our axes, and use our function to produce the MAE plot. - -```{python} -fig, ax = subplots(1, 1, figsize=(6, 6)) -ax = summary_plot(hit_results, - ax, - col='mae', - ylabel='MAE', - valid_legend='Validation (=Test)') -ax.set_ylim([0, 400]) -ax.set_xticks(np.linspace(0, 50, 11).astype(int)); -``` - - -We can predict directly from the final model, and -evaluate its performance on the test data. -Before fitting, we call the `eval()` method -of `hit_model`. -This tells -`torch` to effectively consider this model to be fitted, so that -we can use it to predict on new data. For our model here, -the biggest change is that the dropout layers will -be turned off, i.e. no weights will be randomly -dropped in predicting on new data. - -```{python} -hit_model.eval() -preds = hit_module(X_test_t) -torch.abs(Y_test_t - preds).mean() -``` - - - -### Cleanup -In setting up our data module, we had initiated -several worker processes that will remain running. -We delete all references to the torch objects to ensure these processes -will be killed. - - -```{python} -del(Hitters, - hit_model, hit_dm, - hit_logger, - hit_test, hit_train, - X, Y, - X_test, X_train, - Y_test, Y_train, - X_test_t, Y_test_t, - hit_trainer, hit_module) - -``` - - -## Multilayer Network on the MNIST Digit Data -The `torchvision` package comes with a number of example datasets, -including the `MNIST` digit data. Our first step is to retrieve -the training and test data sets; the `MNIST()` function within -`torchvision.datasets` is provided for this purpose. The -data will be downloaded the first time this function is executed, and stored in the directory `data/MNIST`. - -```{python} -(mnist_train, - mnist_test) = [MNIST(root='data', - train=train, - download=True, - transform=ToTensor()) - for train in [True, False]] -mnist_train - -``` - -There are 60,000 images in the training data and 10,000 in the test -data. The images are $28\times 28$, and stored as a matrix of pixels. We -need to transform each one into a vector. - -Neural networks are somewhat sensitive to the scale of the inputs, much as ridge and -lasso regularization are affected by scaling. Here the inputs are eight-bit -grayscale values between 0 and 255, so we rescale to the unit -interval. {Note: eight bits means $2^8$, which equals 256. Since the convention -is to start at $0$, the possible values range from $0$ to $255$.} -This transformation, along with some reordering -of the axes, is performed by the `ToTensor()` transform -from the `torchvision.transforms` package. - -As in our `Hitters` example, we form a data module -from the training and test datasets, setting aside 20% -of the training images for validation. - -```{python} -mnist_dm = SimpleDataModule(mnist_train, - mnist_test, - validation=0.2, - num_workers=max_num_workers, - batch_size=256) - -``` - -Let’s take a look at the data that will get fed into our network. We loop through the first few -chunks of the test dataset, breaking after 2 batches: - -```{python} -for idx, (X_ ,Y_) in enumerate(mnist_dm.train_dataloader()): - print('X: ', X_.shape) - print('Y: ', Y_.shape) - if idx >= 1: - break - -``` - - -We see that the $X$ for each batch consists of 256 images of size `1x28x28`. -Here the `1` indicates a single channel (greyscale). For RGB images such as `CIFAR100` below, -we will see that the `1` in the size will be replaced by `3` for the three RGB channels. - -Now we are ready to specify our neural network. - -```{python} -class MNISTModel(nn.Module): - def __init__(self): - super(MNISTModel, self).__init__() - self.layer1 = nn.Sequential( - nn.Flatten(), - nn.Linear(28*28, 256), - nn.ReLU(), - nn.Dropout(0.4)) - self.layer2 = nn.Sequential( - nn.Linear(256, 128), - nn.ReLU(), - nn.Dropout(0.3)) - self._forward = nn.Sequential( - self.layer1, - self.layer2, - nn.Linear(128, 10)) - def forward(self, x): - return self._forward(x) -``` - -We see that in the first layer, each `1x28x28` image is flattened, then mapped to -256 dimensions where we apply a ReLU activation with 40% dropout. -A second layer maps the first layer’s output down to -128 dimensions, applying a ReLU activation with 30% dropout. Finally, -the 128 dimensions are mapped down to 10, the number of classes in the -`MNIST` data. - -```{python} -mnist_model = MNISTModel() - -``` - -We can check that the model produces output of expected size based -on our existing batch `X_` above. - -```{python} -mnist_model(X_).size() -``` - -Let’s take a look at the summary of the model. Instead of an `input_size` we can pass -a tensor of correct shape. In this case, we pass through the final -batched `X_` from above. - -```{python} -summary(mnist_model, - input_data=X_, - col_names=['input_size', - 'output_size', - 'num_params']) -``` - -Having set up both the model and the data module, fitting this model is -now almost identical to the `Hitters` example. In contrast to our regression model, here we will use the -`SimpleModule.classification()` method which -uses the cross-entropy loss function instead of mean squared error. It must be supplied with the number of -classes in the problem. - -```{python} -mnist_module = SimpleModule.classification(mnist_model, - num_classes=10) -mnist_logger = CSVLogger('logs', name='MNIST') - -``` - -Now we are ready to go. The final step is to supply training data, and fit the model. We disable the progress bar below to avoid lengthy output in the browser when running. - -```{python} -mnist_trainer = Trainer(deterministic=True, - max_epochs=30, - logger=mnist_logger, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -mnist_trainer.fit(mnist_module, - datamodule=mnist_dm) - -``` -We have suppressed the output here, which is a progress report on the -fitting of the model, grouped by epoch. This is very useful, since on -large datasets fitting can take time. Fitting this model took 245 -seconds on a MacBook Pro with an Apple M1 Pro chip with 10 cores and 16 GB of RAM. -Here we specified a -validation split of 20%, so training is actually performed on -80% of the 60,000 observations in the training set. This is an -alternative to actually supplying validation data, like we did for the `Hitters` data. -SGD uses batches -of 256 observations in computing the gradient, and doing the -arithmetic, we see that an epoch corresponds to 188 gradient steps. - - -`SimpleModule.classification()` includes -an accuracy metric by default. Other -classification metrics can be added from `torchmetrics`. -We will use our `summary_plot()` function to display -accuracy across epochs. - -```{python} -mnist_results = pd.read_csv(mnist_logger.experiment.metrics_file_path) -fig, ax = subplots(1, 1, figsize=(6, 6)) -summary_plot(mnist_results, - ax, - col='accuracy', - ylabel='Accuracy') -ax.set_ylim([0.5, 1]) -ax.set_ylabel('Accuracy') -ax.set_xticks(np.linspace(0, 30, 7).astype(int)); - -``` -Once again we evaluate the accuracy using the `test()` method of our trainer. This model achieves -97% accuracy on the test data. - -```{python} -mnist_trainer.test(mnist_module, - datamodule=mnist_dm) -``` - -Table 10.1 also reports the error rates resulting from LDA (Chapter 4) and multiclass logistic -regression. For LDA we refer the reader to Section 4.7.3. -Although we could use the `sklearn` function `LogisticRegression()` to fit -multiclass logistic regression, we are set up here to fit such a model -with `torch`. -We just have an input layer and an output layer, and omit the hidden layers! - -```{python} -class MNIST_MLR(nn.Module): - def __init__(self): - super(MNIST_MLR, self).__init__() - self.linear = nn.Sequential(nn.Flatten(), - nn.Linear(784, 10)) - def forward(self, x): - return self.linear(x) - -mlr_model = MNIST_MLR() -mlr_module = SimpleModule.classification(mlr_model, - num_classes=10) -mlr_logger = CSVLogger('logs', name='MNIST_MLR') -``` - -```{python} -mlr_trainer = Trainer(deterministic=True, - max_epochs=30, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -mlr_trainer.fit(mlr_module, datamodule=mnist_dm) -``` -We fit the model just as before and compute the test results. - -```{python} -mlr_trainer.test(mlr_module, - datamodule=mnist_dm) -``` -The accuracy is above 90% even for this pretty simple model. - -As in the `Hitters` example, we delete some of -the objects we created above. - -```{python} -del(mnist_test, - mnist_train, - mnist_model, - mnist_dm, - mnist_trainer, - mnist_module, - mnist_results, - mlr_model, - mlr_module, - mlr_trainer) -``` - - -## Convolutional Neural Networks -In this section we fit a CNN to the `CIFAR100` data, which is available in the `torchvision` -package. It is arranged in a similar fashion as the `MNIST` data. - -```{python} -(cifar_train, - cifar_test) = [CIFAR100(root="data", - train=train, - download=True) - for train in [True, False]] -``` - -```{python} -transform = ToTensor() -cifar_train_X = torch.stack([transform(x) for x in - cifar_train.data]) -cifar_test_X = torch.stack([transform(x) for x in - cifar_test.data]) -cifar_train = TensorDataset(cifar_train_X, - torch.tensor(cifar_train.targets)) -cifar_test = TensorDataset(cifar_test_X, - torch.tensor(cifar_test.targets)) -``` - -The `CIFAR100` dataset consists of 50,000 training images, each represented by a three-dimensional tensor: -each three-color image is represented as a set of three channels, each of which consists of -$32\times 32$ eight-bit pixels. We standardize as we did for the -digits, but keep the array structure. This is accomplished with the `ToTensor()` transform. - -Creating the data module is similar to the `MNIST` example. - -```{python} -cifar_dm = SimpleDataModule(cifar_train, - cifar_test, - validation=0.2, - num_workers=max_num_workers, - batch_size=128) - -``` -We again look at the shape of typical batches in our data loaders. - -```{python} -for idx, (X_ ,Y_) in enumerate(cifar_dm.train_dataloader()): - print('X: ', X_.shape) - print('Y: ', Y_.shape) - if idx >= 1: - break - -``` - - -Before we start, we look at some of the training images; similar code produced -Figure 10.5 on page 406. The example below also illustrates -that `TensorDataset` objects can be indexed with integers --- we are choosing -random images from the training data by indexing `cifar_train`. In order to display correctly, -we must reorder the dimensions by a call to `np.transpose()`. - -```{python} -fig, axes = subplots(5, 5, figsize=(10,10)) -rng = np.random.default_rng(4) -indices = rng.choice(np.arange(len(cifar_train)), 25, - replace=False).reshape((5,5)) -for i in range(5): - for j in range(5): - idx = indices[i,j] - axes[i,j].imshow(np.transpose(cifar_train[idx][0], - [1,2,0]), - interpolation=None) - axes[i,j].set_xticks([]) - axes[i,j].set_yticks([]) - -``` -Here the `imshow()` method recognizes from the shape of its argument that it is a 3-dimensional array, with the last dimension indexing the three RGB color channels. - -We specify a moderately-sized CNN for -demonstration purposes, similar in structure to Figure 10.8. -We use several layers, each consisting of convolution, ReLU, and max-pooling steps. -We first define a module that defines one of these layers. As in our -previous examples, we overwrite the `__init__()` and `forward()` methods -of `nn.Module`. This user-defined module can now be used in ways just like -`nn.Linear()` or `nn.Dropout()`. - -```{python} -class BuildingBlock(nn.Module): - - def __init__(self, - in_channels, - out_channels): - - super(BuildingBlock, self).__init__() - self.conv = nn.Conv2d(in_channels=in_channels, - out_channels=out_channels, - kernel_size=(3,3), - padding='same') - self.activation = nn.ReLU() - self.pool = nn.MaxPool2d(kernel_size=(2,2)) - - def forward(self, x): - return self.pool(self.activation(self.conv(x))) - -``` - -Notice that we used the `padding = "same"` argument to -`nn.Conv2d()`, which ensures that the output channels have the -same dimension as the input channels. There are 32 channels in the first -hidden layer, in contrast to the three channels in the input layer. We -use a $3\times 3$ convolution filter for each channel in all the layers. Each -convolution is followed by a max-pooling layer over $2\times2$ -blocks. - -In forming our deep learning model for the `CIFAR100` data, we use several of our `BuildingBlock()` -modules sequentially. This simple example -illustrates some of the power of `torch`. Users can -define modules of their own, which can be combined in other -modules. Ultimately, everything is fit by a generic trainer. - -```{python} -class CIFARModel(nn.Module): - - def __init__(self): - super(CIFARModel, self).__init__() - sizes = [(3,32), - (32,64), - (64,128), - (128,256)] - self.conv = nn.Sequential(*[BuildingBlock(in_, out_) - for in_, out_ in sizes]) - - self.output = nn.Sequential(nn.Dropout(0.5), - nn.Linear(2*2*256, 512), - nn.ReLU(), - nn.Linear(512, 100)) - def forward(self, x): - val = self.conv(x) - val = torch.flatten(val, start_dim=1) - return self.output(val) - -``` - -We build the model and look at the summary. (We had created examples of `X_` earlier.) - -```{python} -cifar_model = CIFARModel() -summary(cifar_model, - input_data=X_, - col_names=['input_size', - 'output_size', - 'num_params']) -``` - - -The total number of trainable parameters is 964,516. -By studying the size of the parameters, we can see that the channels halve in both -dimensions -after each of these max-pooling operations. After the last of these we -have a layer with 256 channels of dimension $2\times 2$. These are then -flattened to a dense layer of size 1,024; -in other words, each of the $2\times 2$ matrices is turned into a -$4$-vector, and put side-by-side in one layer. This is followed by a -dropout regularization layer, then -another dense layer of size 512, and finally, the -output layer. - -Up to now, we have been using a default -optimizer in `SimpleModule()`. For these data, -experiments show that a smaller learning rate performs -better than the default 0.01. We use a -custom optimizer here with a learning rate of 0.001. -Besides this, the logging and training -follow a similar pattern to our previous examples. The optimizer -takes an argument `params` that informs -the optimizer which parameters are involved in SGD (stochastic gradient descent). - -We saw earlier that entries of a module’s parameters are tensors. In passing -the parameters to the optimizer we are doing more than -simply passing arrays; part of the structure of the graph -is encoded in the tensors themselves. - -```{python} -cifar_optimizer = RMSprop(cifar_model.parameters(), lr=0.001) -cifar_module = SimpleModule.classification(cifar_model, - num_classes=100, - optimizer=cifar_optimizer) -cifar_logger = CSVLogger('logs', name='CIFAR100') - -``` - -```{python} -cifar_trainer = Trainer(deterministic=True, - max_epochs=30, - logger=cifar_logger, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -cifar_trainer.fit(cifar_module, - datamodule=cifar_dm) - -``` - -This model can take 10 minutes or more to run and achieves about 42% accuracy on the test -data. Although this is not terrible for 100-class data (a random -classifier gets 1% accuracy), searching the web we see results around -75%. Typically it takes a lot of architecture carpentry, -fiddling with regularization, and time, to achieve such results. - -Let’s take a look at the validation and training accuracy -across epochs. - -```{python} -log_path = cifar_logger.experiment.metrics_file_path -cifar_results = pd.read_csv(log_path) -fig, ax = subplots(1, 1, figsize=(6, 6)) -summary_plot(cifar_results, - ax, - col='accuracy', - ylabel='Accuracy') -ax.set_xticks(np.linspace(0, 30, 7).astype(int)) -ax.set_ylabel('Accuracy') -ax.set_ylim([0, 1]); -``` -Finally, we evaluate our model on our test data. - -```{python} -cifar_trainer.test(cifar_module, - datamodule=cifar_dm) - -``` - - -### Hardware Acceleration -As deep learning has become ubiquitous in machine learning, hardware -manufacturers have produced special libraries that can -often speed up the gradient-descent steps. - -For instance, Mac OS devices with the M1 chip may have the *Metal* programming framework -enabled, which can speed up the `torch` -computations. We present an example of how to use this acceleration. - -The main changes are to the `Trainer()` call as well as to the metrics -that will be evaluated on the data. These metrics must be told where -the data will be located at evaluation time. This is -accomplished with a call to the `to()` method of the metrics. - -```{python} -try: - for name, metric in cifar_module.metrics.items(): - cifar_module.metrics[name] = metric.to('mps') - cifar_trainer_mps = Trainer(accelerator='mps', - deterministic=True, - enable_progress_bar=False, - max_epochs=30) - cifar_trainer_mps.fit(cifar_module, - datamodule=cifar_dm) - cifar_trainer_mps.test(cifar_module, - datamodule=cifar_dm) -except: - pass -``` -This yields approximately two- or three-fold acceleration for each epoch. -We have protected this code block using `try:` and `except:` -clauses; if it works, we get the speedup, if it fails, nothing happens. - - -## Using Pretrained CNN Models -We now show how to use a CNN pretrained on the `imagenet` database to classify natural -images, and demonstrate how we produced Figure 10.10. -We copied six JPEG images from a digital photo album into the -directory `book_images`. These images are available -from the data section of , the ISLP book website. Download `book_images.zip`; when -clicked it creates the `book_images` directory. - -The pretrained network we use is called `resnet50`; specification details can be found on the web. -We will read in the images, and -convert them into the array format expected by the `torch` -software to match the specifications in `resnet50`. -The conversion involves a resize, a crop and then a predefined standardization for each of the three channels. -We now read in the images and preprocess them. - -```{python} -resize = Resize((232,232), antialias=True) -crop = CenterCrop(224) -normalize = Normalize([0.485,0.456,0.406], - [0.229,0.224,0.225]) -imgfiles = sorted([f for f in glob('book_images/*')]) -imgs = torch.stack([torch.div(crop(resize(read_image(f))), 255) - for f in imgfiles]) -imgs = normalize(imgs) -imgs.size() -``` - - -We now set up the trained network with the weights we read in code block~6. The model has 50 layers, with a fair bit of complexity. - -```{python} -resnet_model = resnet50(weights=ResNet50_Weights.DEFAULT) -summary(resnet_model, - input_data=imgs, - col_names=['input_size', - 'output_size', - 'num_params']) - -``` -We set the mode to `eval()` to ensure that the model is ready to predict on new data. - -```{python} -resnet_model.eval() -``` -Inspecting the output above, we see that when setting up the -`resnet_model`, the authors defined a `Bottleneck`, much like our -`BuildingBlock` module. - -We now feed our six images through the fitted network. - -```{python} -img_preds = resnet_model(imgs) - -``` - -Let’s look at the predicted probabilities for each of the top 3 choices. First we compute -the probabilities by applying the softmax to the logits in `img_preds`. Note that -we have had to call the `detach()` method on the tensor `img_preds` in order to convert -it to a more familiar `ndarray`. - -```{python} -img_probs = np.exp(np.asarray(img_preds.detach())) -img_probs /= img_probs.sum(1)[:,None] - -``` - -In order to see the class labels, we must download the index file associated with `imagenet`. {This is avalable from the book website and [s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json](https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json).} - -```{python} -labs = json.load(open('imagenet_class_index.json')) -class_labels = pd.DataFrame([(int(k), v[1]) for k, v in - labs.items()], - columns=['idx', 'label']) -class_labels = class_labels.set_index('idx') -class_labels = class_labels.sort_index() - -``` - -We’ll now construct a data frame for each image file -with the labels with the three highest probabilities as -estimated by the model above. - -```{python} -for i, imgfile in enumerate(imgfiles): - img_df = class_labels.copy() - img_df['prob'] = img_probs[i] - img_df = img_df.sort_values(by='prob', ascending=False)[:3] - print(f'Image: {imgfile}') - print(img_df.reset_index().drop(columns=['idx'])) - -``` - - -We see that the model -is quite confident about `Flamingo.jpg`, but a little less so for the -other images. - -We end this section with our usual cleanup. - -```{python} -del(cifar_test, - cifar_train, - cifar_dm, - cifar_module, - cifar_logger, - cifar_optimizer, - cifar_trainer) -``` - - -## IMDB Document Classification -We now implement models for sentiment classification (Section 10.4) on the `IMDB` -dataset. As mentioned above code block~8, we are using -a preprocessed version of the `IMDB` dataset found in the -`keras` package. As `keras` uses `tensorflow`, a different -tensor and deep learning library, we have -converted the data to be suitable for `torch`. The -code used to convert from `keras` is -available in the module `ISLP.torch._make_imdb`. It -requires some of the `keras` packages to run. These data use a dictionary of size 10,000. - -We have stored three different representations of the review data for this lab: - -* `load_tensor()`, a sparse tensor version usable by `torch`; -* `load_sparse()`, a sparse matrix version usable by `sklearn`, since we will compare with a lasso fit; -* `load_sequential()`, a padded -version of the original sequence representation, limited to the last -500 words of each review. - - - -```{python} -(imdb_seq_train, - imdb_seq_test) = load_sequential(root='data/IMDB') -padded_sample = np.asarray(imdb_seq_train.tensors[0][0]) -sample_review = padded_sample[padded_sample > 0][:12] -sample_review[:12] - -``` -The datasets `imdb_seq_train` and `imdb_seq_test` are -both instances of the class `TensorDataset`. The -tensors used to construct them can be found in the `tensors` attribute, with -the first tensor the features `X` and the second the outcome `Y`. -We have taken the first row of features and stored it as `padded_sample`. In the preprocessing -used to form these data, sequences were padded with 0s in the beginning if they were -not long enough, hence we remove this padding by restricting to entries where -`padded_sample > 0`. We then provide the first 12 words of the sample review. - -We can find these words in the `lookup` dictionary from the `ISLP.torch.imdb` module. - -```{python} -lookup = load_lookup(root='data/IMDB') -' '.join(lookup[i] for i in sample_review) -``` - -For our first model, we have created a binary feature for each -of the 10,000 possible words in the dataset, with an entry of one -in the $i,j$ entry if word $j$ appears in review $i$. As most reviews -are quite short, such a feature matrix has over 98% zeros. These data -are accessed using `load_tensor()` from the `ISLP` library. - -```{python} -max_num_workers=10 -(imdb_train, - imdb_test) = load_tensor(root='data/IMDB') -imdb_dm = SimpleDataModule(imdb_train, - imdb_test, - validation=2000, - num_workers=min(6, max_num_workers), - batch_size=512) - -``` -We’ll use a two-layer model for our first model. - -```{python} -class IMDBModel(nn.Module): - - def __init__(self, input_size): - super(IMDBModel, self).__init__() - self.dense1 = nn.Linear(input_size, 16) - self.activation = nn.ReLU() - self.dense2 = nn.Linear(16, 16) - self.output = nn.Linear(16, 1) - - def forward(self, x): - val = x - for _map in [self.dense1, - self.activation, - self.dense2, - self.activation, - self.output]: - val = _map(val) - return torch.flatten(val) - -``` -We now instantiate our model and look at a summary. - -```{python} -imdb_model = IMDBModel(imdb_test.tensors[0].size()[1]) -summary(imdb_model, - input_size=imdb_test.tensors[0].size(), - col_names=['input_size', - 'output_size', - 'num_params']) - -``` - -We’ll again use -a smaller learning rate for these data, -hence we pass an `optimizer` to the -`SimpleModule`. -Since the reviews are classified into -positive or negative sentiment, we use -`SimpleModule.binary_classification()`. {Our use of - `binary_classification()` instead of `classification()` is - due to some subtlety in how `torchmetrics.Accuracy()` works, -as well as the data type of the targets.} - -```{python} -imdb_optimizer = RMSprop(imdb_model.parameters(), lr=0.001) -imdb_module = SimpleModule.binary_classification( - imdb_model, - optimizer=imdb_optimizer) - -``` - -Having loaded the datasets into a data module -and created a `SimpleModule`, the remaining steps -are familiar. - -```{python} -imdb_logger = CSVLogger('logs', name='IMDB') -imdb_trainer = Trainer(deterministic=True, - max_epochs=30, - logger=imdb_logger, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -imdb_trainer.fit(imdb_module, - datamodule=imdb_dm) -``` - -Evaluating the test error yields roughly 86% accuracy. - -```{python} -test_results = imdb_trainer.test(imdb_module, datamodule=imdb_dm) -test_results -``` - - -### Comparison to Lasso -We now fit a lasso logistic regression model - using `LogisticRegression()` from `sklearn`. Since `sklearn` does not recognize -the sparse tensors of `torch`, we use a sparse -matrix that is recognized by `sklearn.` - -```{python} -((X_train, Y_train), - (X_valid, Y_valid), - (X_test, Y_test)) = load_sparse(validation=2000, - random_state=0, - root='data/IMDB') - -``` - -Similar to what we did in -Section 10.9.1, -we construct a series of 50 values for the lasso reguralization parameter $\lambda$. - -```{python} -lam_max = np.abs(X_train.T * (Y_train - Y_train.mean())).max() -lam_val = lam_max * np.exp(np.linspace(np.log(1), - np.log(1e-4), 50)) - -``` -With `LogisticRegression()` the regularization parameter -$C$ is specified as the inverse of $\lambda$. There are several -solvers for logistic regression; here we use `liblinear` which -works well with the sparse input format. - -```{python} -logit = LogisticRegression(penalty='l1', - C=1/lam_max, - solver='liblinear', - warm_start=True, - fit_intercept=True) - -``` -The path of 50 values takes approximately 40 seconds to run. -As in Chapter 6, we will filter out warnings, this time using a context manager. - -```{python} -with warnings.catch_warnings(): - warnings.simplefilter("ignore") - - coefs = [] - intercepts = [] - - for l in lam_val: - logit.C = 1/l - logit.fit(X_train, Y_train) - coefs.append(logit.coef_.copy()) - intercepts.append(logit.intercept_) - -``` - -The coefficient and intercepts have an extraneous dimension which can be removed -by the `np.squeeze()` function. - -```{python} -coefs = np.squeeze(coefs) -intercepts = np.squeeze(intercepts) - -``` -We’ll now make a plot to compare our neural network results with the -lasso. - -```{python} -# %%capture -fig, axes = subplots(1, 2, figsize=(16, 8), sharey=True) -for ((X_, Y_), - data_, - color) in zip([(X_train, Y_train), - (X_valid, Y_valid), - (X_test, Y_test)], - ['Training', 'Validation', 'Test'], - ['black', 'red', 'blue']): - linpred_ = X_ * coefs.T + intercepts[None,:] - label_ = np.array(linpred_ > 0) - accuracy_ = np.array([np.mean(Y_ == l) for l in label_.T]) - axes[0].plot(-np.log(lam_val / X_train.shape[0]), - accuracy_, - '.--', - color=color, - markersize=13, - linewidth=2, - label=data_) -axes[0].legend() -axes[0].set_xlabel(r'$-\log(\lambda)$', fontsize=20) -axes[0].set_ylabel('Accuracy', fontsize=20) - -``` -Notice the use of `%%capture`, which suppresses the displaying of the partially completed figure. This is useful -when making a complex figure, since the steps can be spread across two or more cells. -We now add a plot of the lasso accuracy, and display the composed figure by simply entering its name at the end of the cell. - -```{python} -imdb_results = pd.read_csv(imdb_logger.experiment.metrics_file_path) -summary_plot(imdb_results, - axes[1], - col='accuracy', - ylabel='Accuracy') -axes[1].set_xticks(np.linspace(0, 30, 7).astype(int)) -axes[1].set_ylabel('Accuracy', fontsize=20) -axes[1].set_xlabel('Epoch', fontsize=20) -axes[1].set_ylim([0.5, 1]); -axes[1].axhline(test_results[0]['test_accuracy'], - color='blue', - linestyle='--', - linewidth=3) -fig -``` -From the graphs we see that the accuracy of the lasso logistic regression peaks at about $0.88$, as it does for the neural network. - -Once again, we end with a cleanup. - -```{python} -del(imdb_model, - imdb_trainer, - imdb_logger, - imdb_dm, - imdb_train, - imdb_test) -``` - - -## Recurrent Neural Networks -In this lab we fit the models illustrated in -Section 10.5. - - -### Sequential Models for Document Classification -Here we fit a simple LSTM RNN for sentiment prediction to -the `IMDb` movie-review data, as discussed in Section 10.5.1. -For an RNN we use the sequence of words in a document, taking their -order into account. We loaded the preprocessed -data at the beginning of -Section 10.9.5. -A script that details the preprocessing can be found in the -`ISLP` library. Notably, since more than 90% of the documents -had fewer than 500 words, we set the document length to 500. For -longer documents, we used the last 500 words, and for shorter -documents, we padded the front with blanks. - - -```{python} -imdb_seq_dm = SimpleDataModule(imdb_seq_train, - imdb_seq_test, - validation=2000, - batch_size=300, - num_workers=min(6, max_num_workers) - ) - -``` - -The first layer of the RNN is an embedding layer of size 32, which will be -learned during training. This layer one-hot encodes each document -as a matrix of dimension $500 \times 10,003$, and then maps these -$10,003$ dimensions down to $32$. {The extra 3 dimensions -correspond to commonly occurring non-word entries in the reviews.} - Since each word is represented by an -integer, this is effectively achieved by the creation of an embedding -matrix of size $10,003\times 32$; each of the 500 integers in the -document are then mapped to the appropriate 32 real numbers by -indexing the appropriate rows of this matrix. - - -The second layer is an LSTM with 32 units, and the output -layer is a single logit for the binary classification task. -In the last line of the `forward()` method below, -we take the last 32-dimensional output of the LSTM and map it to our response. - -```{python} -class LSTMModel(nn.Module): - def __init__(self, input_size): - super(LSTMModel, self).__init__() - self.embedding = nn.Embedding(input_size, 32) - self.lstm = nn.LSTM(input_size=32, - hidden_size=32, - batch_first=True) - self.dense = nn.Linear(32, 1) - def forward(self, x): - val, (h_n, c_n) = self.lstm(self.embedding(x)) - return torch.flatten(self.dense(val[:,-1])) -``` -We instantiate and take a look at the summary of the model, using the -first 10 documents in the corpus. - -```{python} -lstm_model = LSTMModel(X_test.shape[-1]) -summary(lstm_model, - input_data=imdb_seq_train.tensors[0][:10], - col_names=['input_size', - 'output_size', - 'num_params']) - -``` - -The 10,003 is suppressed in the summary, but we see it in the -parameter count, since $10,003\times 32=320,096$. - -```{python} -lstm_module = SimpleModule.binary_classification(lstm_model) -lstm_logger = CSVLogger('logs', name='IMDB_LSTM') - -``` - -```{python} -lstm_trainer = Trainer(deterministic=True, - max_epochs=20, - logger=lstm_logger, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -lstm_trainer.fit(lstm_module, - datamodule=imdb_seq_dm) - -``` -The rest is now similar to other networks we have fit. We -track the test performance as the network is fit, and see that it attains 85% accuracy. - -```{python} -lstm_trainer.test(lstm_module, datamodule=imdb_seq_dm) -``` - -We once again show the learning progress, followed by cleanup. - -```{python} -lstm_results = pd.read_csv(lstm_logger.experiment.metrics_file_path) -fig, ax = subplots(1, 1, figsize=(6, 6)) -summary_plot(lstm_results, - ax, - col='accuracy', - ylabel='Accuracy') -ax.set_xticks(np.linspace(0, 20, 5).astype(int)) -ax.set_ylabel('Accuracy') -ax.set_ylim([0.5, 1]) - -``` - - -```{python} -del(lstm_model, - lstm_trainer, - lstm_logger, - imdb_seq_dm, - imdb_seq_train, - imdb_seq_test) - -``` - - -### Time Series Prediction -We now show how to fit the models in Section 10.5.2 -for time series prediction. -We first load and standardize the data. - -```{python} -NYSE = load_data('NYSE') -cols = ['DJ_return', 'log_volume', 'log_volatility'] -X = pd.DataFrame(StandardScaler( - with_mean=True, - with_std=True).fit_transform(NYSE[cols]), - columns=NYSE[cols].columns, - index=NYSE.index) - -``` - -Next we set up the lagged versions of the data, dropping -any rows with missing values using the `dropna()` method. - -```{python} -for lag in range(1, 6): - for col in cols: - newcol = np.zeros(X.shape[0]) * np.nan - newcol[lag:] = X[col].values[:-lag] - X.insert(len(X.columns), "{0}_{1}".format(col, lag), newcol) -X.insert(len(X.columns), 'train', NYSE['train']) -X = X.dropna() - -``` - -Finally, we extract the response, training indicator, and drop the current day’s `DJ_return` and -`log_volatility` to predict only from previous day’s data. - -```{python} -Y, train = X['log_volume'], X['train'] -X = X.drop(columns=['train'] + cols) -X.columns - -``` - - -We first fit a simple linear model and compute the $R^2$ on the test data using -the `score()` method. - -```{python} -M = LinearRegression() -M.fit(X[train], Y[train]) -M.score(X[~train], Y[~train]) -``` - -We refit this model, including the factor variable `day_of_week`. -For a categorical series in `pandas`, we can form the indicators -using the `get_dummies()` method. - -```{python} -X_day = pd.concat([X, - pd.get_dummies(NYSE['day_of_week'])], - axis=1).dropna() -``` - Note that we do not have -to reinstantiate the linear regression model -as its `fit()` method accepts a design matrix and a response directly. - -```{python} -M.fit(X_day[train], Y[train]) -M.score(X_day[~train], Y[~train]) -``` -This model achieves an $R^2$ of about 46%. - - -To fit the RNN, we must reshape the data, as it will expect 5 lagged -versions of each feature as indicated by the `input_shape` argument -to the layer `nn.RNN()` below. We first -ensure the columns of our data frame are such that a reshaped -matrix will have the variables correctly lagged. We use the -`reindex()` method to do this. - -For an input shape `(5,3)`, each row represents a lagged version of the three variables. -The `nn.RNN()` layer also expects the first row of each -observation to be earliest in time, so we must reverse the current order. Hence we loop over -`range(5,0,-1)` below, which is -an example of using a `slice()` to index -iterable objects. The general notation is `start:end:step`. - -```{python} -ordered_cols = [] -for lag in range(5,0,-1): - for col in cols: - ordered_cols.append('{0}_{1}'.format(col, lag)) -X = X.reindex(columns=ordered_cols) -X.columns - -``` -We now reshape the data. - -```{python} -X_rnn = X.to_numpy().reshape((-1,5,3)) -X_rnn.shape -``` -By specifying the first size as -1, `numpy.reshape()` deduces its size based on the remaining arguments. - -Now we are ready to proceed with the RNN, which uses 12 hidden units, and 10% -dropout. -After passing through the RNN, we extract the final time point as `val[:,-1]` -in `forward()` below. This gets passed through a 10% dropout and then flattened through -a linear layer. - -```{python} -class NYSEModel(nn.Module): - def __init__(self): - super(NYSEModel, self).__init__() - self.rnn = nn.RNN(3, - 12, - batch_first=True) - self.dense = nn.Linear(12, 1) - self.dropout = nn.Dropout(0.1) - def forward(self, x): - val, h_n = self.rnn(x) - val = self.dense(self.dropout(val[:,-1])) - return torch.flatten(val) -nyse_model = NYSEModel() -``` - -We fit the model in a similar fashion to previous networks. We -supply the `fit` function with test data as validation data, so that when -we monitor its progress and plot the history function we can see the -progress on the test data. Of course we should not use this as a basis for -early stopping, since then the test performance would be biased. - -We form the training dataset similar to -our `Hitters` example. - -```{python} -datasets = [] -for mask in [train, ~train]: - X_rnn_t = torch.tensor(X_rnn[mask].astype(np.float32)) - Y_t = torch.tensor(np.asarray(Y[mask], np.float32)) - datasets.append(TensorDataset(X_rnn_t, Y_t)) -nyse_train, nyse_test = datasets - -``` - -Following our usual pattern, we inspect the summary. - -```{python} -summary(nyse_model, - input_data=X_rnn_t, - col_names=['input_size', - 'output_size', - 'num_params']) - -``` -We again put the two datasets into a data module, with a -batch size of 64. - -```{python} -nyse_dm = SimpleDataModule(nyse_train, - nyse_test, - num_workers=min(4, max_num_workers), - validation=nyse_test, - batch_size=64) -``` -We run some data through our model to be sure the sizes match up correctly. - -```{python} -for idx, (x, y) in enumerate(nyse_dm.train_dataloader()): - out = nyse_model(x) - print(y.size(), out.size()) - if idx >= 2: - break - -``` - -We follow our previous example for setting up a trainer for a -regression problem, requesting the $R^2$ metric -to be be computed at each epoch. - -```{python} -nyse_optimizer = RMSprop(nyse_model.parameters(), - lr=0.001) -nyse_module = SimpleModule.regression(nyse_model, - optimizer=nyse_optimizer, - metrics={'r2':R2Score()}) - -``` - -Fitting the model should by now be familiar. -The results on the test data are very similar to the linear AR model. - -```{python} -nyse_trainer = Trainer(deterministic=True, - max_epochs=200, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -nyse_trainer.fit(nyse_module, - datamodule=nyse_dm) -nyse_trainer.test(nyse_module, - datamodule=nyse_dm) -``` - - -We could also fit a model without the `nn.RNN()` layer by just -using a `nn.Flatten()` layer instead. This would be a nonlinear AR model. If in addition we excluded the -hidden layer, this would be equivalent to our earlier linear AR model. - -Instead we will fit a nonlinear AR model using the feature set `X_day` that includes the `day_of_week` indicators. -To do so, we -must first create our test and training datasets and a corresponding -data module. This may seem a little burdensome, but is part of the -general pipeline for `torch`. - -```{python} -datasets = [] -for mask in [train, ~train]: - X_day_t = torch.tensor( - np.asarray(X_day[mask]).astype(np.float32)) - Y_t = torch.tensor(np.asarray(Y[mask]).astype(np.float32)) - datasets.append(TensorDataset(X_day_t, Y_t)) -day_train, day_test = datasets -``` - -Creating a data module follows a familiar pattern. - -```{python} -day_dm = SimpleDataModule(day_train, - day_test, - num_workers=min(4, max_num_workers), - validation=day_test, - batch_size=64) - -``` - -We build a `NonLinearARModel()` that takes as input the 20 features and a hidden layer with 32 units. The remaining steps are familiar. - -```{python} -class NonLinearARModel(nn.Module): - def __init__(self): - super(NonLinearARModel, self).__init__() - self._forward = nn.Sequential(nn.Flatten(), - nn.Linear(20, 32), - nn.ReLU(), - nn.Dropout(0.5), - nn.Linear(32, 1)) - def forward(self, x): - return torch.flatten(self._forward(x)) - -``` - -```{python} -nl_model = NonLinearARModel() -nl_optimizer = RMSprop(nl_model.parameters(), - lr=0.001) -nl_module = SimpleModule.regression(nl_model, - optimizer=nl_optimizer, - metrics={'r2':R2Score()}) - -``` - -We continue with the usual training steps, fit the model, -and evaluate the test error. We see the test $R^2$ is a slight improvement over the linear AR model that also includes `day_of_week`. - -```{python} -nl_trainer = Trainer(deterministic=True, - max_epochs=20, - enable_progress_bar=False, - callbacks=[ErrorTracker()]) -nl_trainer.fit(nl_module, datamodule=day_dm) -nl_trainer.test(nl_module, datamodule=day_dm) -``` - - - - diff --git a/Ch11-surv-lab.Rmd b/Ch11-surv-lab.Rmd deleted file mode 100644 index c5e4e21..0000000 --- a/Ch11-surv-lab.Rmd +++ /dev/null @@ -1,550 +0,0 @@ - - -# Survival Analysis - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch11-surv-lab.ipynb) - - - In this lab, we perform survival analyses on three separate data -sets. In Section 11.8.1 we analyze the `BrainCancer` -data that was first described in Section 11.3. In Section 11.8.2, we examine the `Publication` -data from Section 11.5.4. Finally, Section 11.8.3 explores -a simulated call-center data set. - -We begin by importing some of our libraries at this top -level. This makes the code more readable, as scanning the first few -lines of the notebook tell us what libraries are used in this -notebook. - -```{python} -from matplotlib.pyplot import subplots -import numpy as np -import pandas as pd -from ISLP.models import ModelSpec as MS -from ISLP import load_data - -``` - -We also collect the new imports -needed for this lab. - -```{python} -from lifelines import \ - (KaplanMeierFitter, - CoxPHFitter) -from lifelines.statistics import \ - (logrank_test, - multivariate_logrank_test) -from ISLP.survival import sim_time - -``` - -## Brain Cancer Data - -We begin with the `BrainCancer` data set, contained in the `ISLP` package. - -```{python} -BrainCancer = load_data('BrainCancer') -BrainCancer.columns - -``` - -The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables. -We first briefly examine the data. - -```{python} -BrainCancer['sex'].value_counts() - -``` - - -```{python} -BrainCancer['diagnosis'].value_counts() - -``` - - -```{python} -BrainCancer['status'].value_counts() - -``` - - -Before beginning an analysis, it is important to know how the -`status` variable has been coded. Most software -uses the convention that a `status` of 1 indicates an -uncensored observation (often death), and a `status` of 0 indicates a censored -observation. But some scientists might use the opposite coding. For -the `BrainCancer` data set 35 patients died before the end of -the study, so we are using the conventional coding. - -To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure 11.2. The main -package we will use for survival analysis -is `lifelines`. -The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or -death). The first argument to `km.fit` is the event time, and the -second argument is the censoring variable, with a 1 indicating an observed -failure time. The `plot()` method produces a survival curve with pointwise confidence -intervals. By default, these are 90% confidence intervals, but this can be changed -by setting the `alpha` argument to one minus the desired -confidence level. - -```{python} -fig, ax = subplots(figsize=(8,8)) -km = KaplanMeierFitter() -km_brain = km.fit(BrainCancer['time'], BrainCancer['status']) -km_brain.plot(label='Kaplan Meier estimate', ax=ax) - -``` - -Next we create Kaplan-Meier survival curves that are stratified by -`sex`, in order to reproduce Figure 11.3. -We do this using the `groupby()` method of a dataframe. -This method returns a generator that can -be iterated over in the `for` loop. In this case, -the items in the `for` loop are 2-tuples representing -the groups: the first entry is the value -of the grouping column `sex` while the second value -is the dataframe consisting of all rows in the -dataframe matching that value of `sex`. -We will want to use this data below -in the log-rank test, hence we store this -information in the dictionary `by_sex`. Finally, -we have also used the notion of - *string interpolation* to automatically -label the different lines in the plot. String -interpolation is a powerful technique to format strings --- -`Python` has many ways to facilitate such operations. - -```{python} -fig, ax = subplots(figsize=(8,8)) -by_sex = {} -for sex, df in BrainCancer.groupby('sex'): - by_sex[sex] = df - km_sex = km.fit(df['time'], df['status']) - km_sex.plot(label='Sex=%s' % sex, ax=ax) - -``` - -As discussed in Section 11.4, we can perform a -log-rank test to compare the survival of males to females. We use -the `logrank_test()` function from the `lifelines.statistics` module. -The first two arguments are the event times, with the second -denoting the corresponding (optional) censoring indicators. - -```{python} -logrank_test(by_sex['Male']['time'], - by_sex['Female']['time'], - by_sex['Male']['status'], - by_sex['Female']['status']) - -``` - - -The resulting $p$-value is $0.23$, indicating no evidence of a -difference in survival between the two sexes. - -Next, we use the `CoxPHFitter()` estimator -from `lifelines` to fit Cox proportional hazards models. -To begin, we consider a model that uses `sex` as the only predictor. - -```{python} -coxph = CoxPHFitter # shorthand -sex_df = BrainCancer[['time', 'status', 'sex']] -model_df = MS(['time', 'status', 'sex'], - intercept=False).fit_transform(sex_df) -cox_fit = coxph().fit(model_df, - 'time', - 'status') -cox_fit.summary[['coef', 'se(coef)', 'p']] - -``` - -The first argument to `fit` should be a data frame containing -at least the event time (the second argument `time` in this case), -as well as an optional censoring variable (the argument `status` in this case). -Note also that the Cox model does not include an intercept, which is why -we used the `intercept=False` argument to `ModelSpec` above. -The `summary()` method delivers many columns; we chose to abbreviate its output here. -It is possible to obtain the likelihood ratio test comparing this model to the one -with no features as follows: - -```{python} -cox_fit.log_likelihood_ratio_test() - -``` - -Regardless of which test we use, we see that there is no clear -evidence for a difference in survival between males and females. As -we learned in this chapter, the score test from the Cox model is -exactly equal to the log rank test statistic! - -Now we fit a model that makes use of additional predictors. We first note -that one of our `diagnosis` values is missing, hence -we drop that observation before continuing. - -```{python} -cleaned = BrainCancer.dropna() -all_MS = MS(cleaned.columns, intercept=False) -all_df = all_MS.fit_transform(cleaned) -fit_all = coxph().fit(all_df, - 'time', - 'status') -fit_all.summary[['coef', 'se(coef)', 'p']] - -``` - - The `diagnosis` variable has been coded so that the baseline -corresponds to HG glioma. The results indicate that the risk associated with HG glioma -is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated -with meningioma. In other words, after adjusting for the other -predictors, patients with HG glioma have much worse survival compared -to those with meningioma. In addition, larger values of the Karnofsky -index, `ki`, are associated with lower risk, i.e. longer survival. - -Finally, we plot estimated survival curves for each diagnosis category, -adjusting for the other predictors. To make these plots, we set the -values of the other predictors equal to the mean for quantitative variables -and equal to the mode for categorical. To do this, we use the -`apply()` method across rows (i.e. `axis=0`) with a function -`representative` that checks if a column is categorical -or not. - -```{python} -levels = cleaned['diagnosis'].unique() -def representative(series): - if hasattr(series.dtype, 'categories'): - return pd.Series.mode(series) - else: - return series.mean() -modal_data = cleaned.apply(representative, axis=0) - -``` - -We make four -copies of the column means and assign the `diagnosis` column to be the four different -diagnoses. - -```{python} -modal_df = pd.DataFrame( - [modal_data.iloc[0] for _ in range(len(levels))]) -modal_df['diagnosis'] = levels -modal_df - -``` - -We then construct the model matrix based on the model specification `all_MS` used to fit -the model, and name the rows according to the levels of `diagnosis`. - -```{python} -modal_X = all_MS.transform(modal_df) -modal_X.index = levels -modal_X - -``` - -We can use the `predict_survival_function()` method to obtain the estimated survival function. - -```{python} -predicted_survival = fit_all.predict_survival_function(modal_X) -predicted_survival - -``` -This returns a data frame, -whose plot methods yields the different survival curves. To avoid clutter in -the plots, we do not display confidence intervals. - -```{python} -fig, ax = subplots(figsize=(8, 8)) -predicted_survival.plot(ax=ax); - -``` - - -## Publication Data -The `Publication` data presented in Section 11.5.4 can be -found in the `ISLP` package. -We first reproduce Figure 11.5 by plotting the Kaplan-Meier curves -stratified on the `posres` variable, which records whether the -study had a positive or negative result. - -```{python} -fig, ax = subplots(figsize=(8,8)) -Publication = load_data('Publication') -by_result = {} -for result, df in Publication.groupby('posres'): - by_result[result] = df - km_result = km.fit(df['time'], df['status']) - km_result.plot(label='Result=%d' % result, ax=ax) - -``` - -As discussed previously, the $p$-values from fitting Cox’s -proportional hazards model to the `posres` variable are quite -large, providing no evidence of a difference in time-to-publication -between studies with positive versus negative results. - -```{python} -posres_df = MS(['posres', - 'time', - 'status'], - intercept=False).fit_transform(Publication) -posres_fit = coxph().fit(posres_df, - 'time', - 'status') -posres_fit.summary[['coef', 'se(coef)', 'p']] - -``` - - -However, the results change dramatically when we include other -predictors in the model. Here we exclude the funding mechanism -variable. - -```{python} -model = MS(Publication.columns.drop('mech'), - intercept=False) -coxph().fit(model.fit_transform(Publication), - 'time', - 'status').summary[['coef', 'se(coef)', 'p']] - -``` - -We see that there are a number of statistically significant variables, -including whether the trial focused on a clinical endpoint, the impact -of the study, and whether the study had positive or negative results. - - -## Call Center Data - -In this section, we will simulate survival data using the relationship -between cumulative hazard and -the survival function explored in Exercise 8. -Our simulated data will represent the observed -wait times (in seconds) for 2,000 customers who have phoned a call -center. In this context, censoring occurs if a customer hangs up -before his or her call is answered. - -There are three covariates: `Operators` (the number of call -center operators available at the time of the call, which can range -from $5$ to $15$), `Center` (either A, B, or C), and -`Time` of day (Morning, Afternoon, or Evening). We generate data -for these covariates so that all possibilities are equally likely: for -instance, morning, afternoon and evening calls are equally likely, and -any number of operators from $5$ to $15$ is equally likely. - -```{python} -rng = np.random.default_rng(10) -N = 2000 -Operators = rng.choice(np.arange(5, 16), - N, - replace=True) -Center = rng.choice(['A', 'B', 'C'], - N, - replace=True) -Time = rng.choice(['Morn.', 'After.', 'Even.'], - N, - replace=True) -D = pd.DataFrame({'Operators': Operators, - 'Center': pd.Categorical(Center), - 'Time': pd.Categorical(Time)}) -``` - -We then build a model matrix (omitting the intercept) - -```{python} -model = MS(['Operators', - 'Center', - 'Time'], - intercept=False) -X = model.fit_transform(D) -``` - -It is worthwhile to take a peek at the model matrix `X`, so -that we can be sure that we understand how the variables have been coded. By default, -the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding -of the variable is dropped. - -```{python} -X[:5] - -``` - -Next, we specify the coefficients and the hazard function. - -```{python} -true_beta = np.array([0.04, -0.3, 0, 0.2, -0.2]) -true_linpred = X.dot(true_beta) -hazard = lambda t: 1e-5 * t - -``` - -Here, we have set the coefficient associated with `Operators` to -equal $0.04$; in other words, each additional operator leads to a -$e^{0.04}=1.041$-fold increase in the “risk” that the call will be -answered, given the `Center` and `Time` covariates. This -makes sense: the greater the number of operators at hand, the shorter -the wait time! The coefficient associated with `Center == B` is -$-0.3$, and `Center == A` is treated as the baseline. This means -that the risk of a call being answered at Center B is 0.74 times the -risk that it will be answered at Center A; in other words, the wait -times are a bit longer at Center B. - -Recall from Section 2.3.7 the use of `lambda` -for creating short functions on the fly. -We use the function -`sim_time()` from the `ISLP.survival` package. This function -uses the relationship between the survival function -and cumulative hazard $S(t) = \exp(-H(t))$ and the specific -form of the cumulative hazard function in the Cox model -to simulate data based on values of the linear predictor -`true_linpred` and the cumulative hazard. - We need to provide the cumulative hazard function, which we do here. - -```{python} -cum_hazard = lambda t: 1e-5 * t**2 / 2 - -``` -We are now ready to generate data under the Cox proportional hazards -model. We truncate the maximum time to 1000 seconds to keep -simulated wait times reasonable. The function -`sim_time()` takes a linear predictor, -a cumulative hazard function and a -random number generator. - -```{python} -W = np.array([sim_time(l, cum_hazard, rng) - for l in true_linpred]) -D['Wait time'] = np.clip(W, 0, 1000) - -``` - -We now simulate our censoring variable, for which we assume -90% of calls were answered (`Failed==1`) before the -customer hung up (`Failed==0`). - -```{python} -D['Failed'] = rng.choice([1, 0], - N, - p=[0.9, 0.1]) -D[:5] - -``` - - -```{python} -D['Failed'].mean() - -``` - -We now plot Kaplan-Meier survival curves. First, we stratify by `Center`. - -```{python} -fig, ax = subplots(figsize=(8,8)) -by_center = {} -for center, df in D.groupby('Center'): - by_center[center] = df - km_center = km.fit(df['Wait time'], df['Failed']) - km_center.plot(label='Center=%s' % center, ax=ax) -ax.set_title("Probability of Still Being on Hold") - -``` - -Next, we stratify by `Time`. - -```{python} -fig, ax = subplots(figsize=(8,8)) -by_time = {} -for time, df in D.groupby('Time'): - by_time[time] = df - km_time = km.fit(df['Wait time'], df['Failed']) - km_time.plot(label='Time=%s' % time, ax=ax) -ax.set_title("Probability of Still Being on Hold") - -``` - -It seems that calls at Call Center B take longer to be answered than -calls at Centers A and C. Similarly, it appears that wait times are -longest in the morning and shortest in the evening hours. We can use a -log-rank test to determine whether these differences are statistically -significant using the function `multivariate_logrank_test()`. - -```{python} -multivariate_logrank_test(D['Wait time'], - D['Center'], - D['Failed']) - -``` - - -Next, we consider the effect of `Time`. - -```{python} -multivariate_logrank_test(D['Wait time'], - D['Time'], - D['Failed']) - -``` - - -As in the case of a categorical variable with 2 levels, these -results are similar to the likelihood ratio test -from the Cox proportional hazards model. First, we -look at the results for `Center`. - -```{python} -X = MS(['Wait time', - 'Failed', - 'Center'], - intercept=False).fit_transform(D) -F = coxph().fit(X, 'Wait time', 'Failed') -F.log_likelihood_ratio_test() - -``` - - -Next, we look at the results for `Time`. - -```{python} -X = MS(['Wait time', - 'Failed', - 'Time'], - intercept=False).fit_transform(D) -F = coxph().fit(X, 'Wait time', 'Failed') -F.log_likelihood_ratio_test() - -``` - - -We find that differences between centers are highly significant, as -are differences between times of day. - -Finally, we fit Cox's proportional hazards model to the data. - -```{python} -X = MS(D.columns, - intercept=False).fit_transform(D) -fit_queuing = coxph().fit( - X, - 'Wait time', - 'Failed') -fit_queuing.summary[['coef', 'se(coef)', 'p']] - -``` - - -The $p$-values for Center B and evening time -are very small. It is also clear that the -hazard --- that is, the instantaneous risk that a call will be -answered --- increases with the number of operators. Since we -generated the data ourselves, we know that the true coefficients for - `Operators`, `Center = B`, `Center = C`, -`Time = Even.` and `Time = Morn.` are $0.04$, $-0.3$, -$0$, $0.2$, and $-0.2$, respectively. The coefficient estimates -from the fitted Cox model are fairly accurate. - - diff --git a/Ch12-unsup-lab.Rmd b/Ch12-unsup-lab.Rmd deleted file mode 100644 index a640c11..0000000 --- a/Ch12-unsup-lab.Rmd +++ /dev/null @@ -1,904 +0,0 @@ - - -# Unsupervised Learning - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch12-unsup-lab.ipynb) - - -In this lab we demonstrate PCA and clustering on several datasets. -As in other labs, we import some of our libraries at this top -level. This makes the code more readable, as scanning the first few -lines of the notebook tells us what libraries are used in this -notebook. - -```{python} -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt -from statsmodels.datasets import get_rdataset -from sklearn.decomposition import PCA -from sklearn.preprocessing import StandardScaler -from ISLP import load_data - -``` -We also collect the new imports -needed for this lab. - -```{python} -from sklearn.cluster import \ - (KMeans, - AgglomerativeClustering) -from scipy.cluster.hierarchy import \ - (dendrogram, - cut_tree) -from ISLP.cluster import compute_linkage - -``` - -## Principal Components Analysis -In this lab, we perform PCA on `USArrests`, a data set in the -`R` computing environment. -We retrieve the data using `get_rdataset()`, which can fetch data from -many standard `R` packages. - -The rows of the data set contain the 50 states, in alphabetical order. - -```{python} -USArrests = get_rdataset('USArrests').data -USArrests - -``` - -The columns of the data set contain the four variables. - -```{python} -USArrests.columns - -``` - -We first briefly examine the data. We notice that the variables have vastly different means. - -```{python} -USArrests.mean() - -``` - - -Dataframes have several useful methods for computing -column-wise summaries. We can also examine the -variance of the four variables using the `var()` method. - -```{python} -USArrests.var() - -``` - -Not surprisingly, the variables also have vastly different variances. -The `UrbanPop` variable measures the percentage of the population -in each state living in an urban area, which is not a comparable -number to the number of rapes in each state per 100,000 individuals. -PCA looks for derived variables that account for most of the variance in the data set. -If we do not scale the variables before performing PCA, then the principal components -would mostly be driven by the -`Assault` variable, since it has by far the largest -variance. So if the variables are measured in different units or vary widely in scale, it is recommended to standardize the variables to have standard deviation one before performing PCA. -Typically we set the means to zero as well. - -This scaling can be done via the `StandardScaler()` transform imported above. We first `fit` the -scaler, which computes the necessary means and standard -deviations and then apply it to our data using the -`transform` method. As before, we combine these steps using the `fit_transform()` method. - - -```{python} -scaler = StandardScaler(with_std=True, - with_mean=True) -USArrests_scaled = scaler.fit_transform(USArrests) - -``` -Having scaled the data, we can then -perform principal components analysis using the `PCA()` transform -from the `sklearn.decomposition` package. - -```{python} -pcaUS = PCA() - -``` -(By default, the `PCA()` transform centers the variables to have -mean zero though it does not scale them.) The transform `pcaUS` -can be used to find the PCA -`scores` returned by `fit()`. Once the `fit` method has been called, the `pcaUS` object also contains a number of useful quantities. - -```{python} -pcaUS.fit(USArrests_scaled) - -``` - -After fitting, the `mean_` attribute corresponds to the means -of the variables. In this case, since we centered and scaled the data with -`scaler()` the means will all be 0. - -```{python} -pcaUS.mean_ - -``` - -The scores can be computed using the `transform()` method -of `pcaUS` after it has been fit. - -```{python} -scores = pcaUS.transform(USArrests_scaled) - -``` -We will plot these scores a bit further down. -The `components_` attribute provides the principal component loadings: -each row of `pcaUS.components_` contains the corresponding -principal component loading vector. - - -```{python} -pcaUS.components_ - -``` - -The `biplot` is a common visualization method used with -PCA. It is not built in as a standard -part of `sklearn`, though there are python -packages that do produce such plots. Here we -make a simple biplot manually. - -```{python} -i, j = 0, 1 # which components -fig, ax = plt.subplots(1, 1, figsize=(8, 8)) -ax.scatter(scores[:,0], scores[:,1]) -ax.set_xlabel('PC%d' % (i+1)) -ax.set_ylabel('PC%d' % (j+1)) -for k in range(pcaUS.components_.shape[1]): - ax.arrow(0, 0, pcaUS.components_[i,k], pcaUS.components_[j,k]) - ax.text(pcaUS.components_[i,k], - pcaUS.components_[j,k], - USArrests.columns[k]) - -``` -Notice that this figure is a reflection of Figure 12.1 through the $y$-axis. Recall that the -principal components are only unique up to a sign change, so we can -reproduce that figure by flipping the -signs of the second set of scores and loadings. -We also increase the length of the arrows to emphasize the loadings. - -```{python} -scale_arrow = s_ = 2 -scores[:,1] *= -1 -pcaUS.components_[1] *= -1 # flip the y-axis -fig, ax = plt.subplots(1, 1, figsize=(8, 8)) -ax.scatter(scores[:,0], scores[:,1]) -ax.set_xlabel('PC%d' % (i+1)) -ax.set_ylabel('PC%d' % (j+1)) -for k in range(pcaUS.components_.shape[1]): - ax.arrow(0, 0, s_*pcaUS.components_[i,k], s_*pcaUS.components_[j,k]) - ax.text(s_*pcaUS.components_[i,k], - s_*pcaUS.components_[j,k], - USArrests.columns[k]) - -``` - -The standard deviations of the principal component scores are as follows: - -```{python} -scores.std(0, ddof=1) -``` - - -The variance of each score can be extracted directly from the `pcaUS` object via -the `explained_variance_` attribute. - -```{python} -pcaUS.explained_variance_ - -``` -The proportion of variance explained by each principal -component (PVE) is stored as `explained_variance_ratio_`: - -```{python} -pcaUS.explained_variance_ratio_ - -``` -We see that the first principal component explains 62.0% of the -variance in the data, the next principal component explains 24.7% -of the variance, and so forth. -We can plot the PVE explained by each component, as well as the cumulative PVE. We first -plot the proportion of variance explained. - -```{python} -# %%capture -fig, axes = plt.subplots(1, 2, figsize=(15, 6)) -ticks = np.arange(pcaUS.n_components_)+1 -ax = axes[0] -ax.plot(ticks, - pcaUS.explained_variance_ratio_, - marker='o') -ax.set_xlabel('Principal Component'); -ax.set_ylabel('Proportion of Variance Explained') -ax.set_ylim([0,1]) -ax.set_xticks(ticks) - -``` -Notice the use of `%%capture`, which suppresses the displaying of the partially completed figure. - -```{python} -ax = axes[1] -ax.plot(ticks, - pcaUS.explained_variance_ratio_.cumsum(), - marker='o') -ax.set_xlabel('Principal Component') -ax.set_ylabel('Cumulative Proportion of Variance Explained') -ax.set_ylim([0, 1]) -ax.set_xticks(ticks) -fig - -``` -The result is similar to that shown in Figure 12.3. Note -that the method `cumsum()` computes the cumulative sum of -the elements of a numeric vector. For instance: - -```{python} -a = np.array([1,2,8,-3]) -np.cumsum(a) - -``` -## Matrix Completion - -We now re-create the analysis carried out on the `USArrests` data in -Section 12.3. - -We saw in Section 12.2.2 that solving the optimization -problem~(12.6) on a centered data matrix $\bf X$ is -equivalent to computing the first $M$ principal -components of the data. We use our scaled -and centered `USArrests` data as $\bf X$ below. The *singular value decomposition* -(SVD) is a general algorithm for solving -(12.6). - -```{python} -X = USArrests_scaled -U, D, V = np.linalg.svd(X, full_matrices=False) -U.shape, D.shape, V.shape - -``` -The `np.linalg.svd()` function returns three components, `U`, `D` and `V`. The matrix `V` is equivalent to the -loading matrix from principal components (up to an unimportant sign flip). Using the `full_matrices=False` option ensures that -for a tall matrix the shape of `U` is the same as the shape of `X`. - -```{python} -V - -``` - -```{python} -pcaUS.components_ - -``` -The matrix `U` corresponds to a *standardized* version of the PCA score matrix (each column standardized to have sum-of-squares one). If we multiply each column of `U` by the corresponding element of `D`, we recover the PCA scores exactly (up to a meaningless sign flip). - -```{python} -(U * D[None,:])[:3] - -``` - -```{python} -scores[:3] - -``` -While it would be possible to carry out this lab using the `PCA()` estimator, -here we use the `np.linalg.svd()` function in order to illustrate its use. - -We now omit 20 entries in the $50\times 4$ data matrix at random. We do so -by first selecting 20 rows (states) at random, and then selecting one -of the four entries in each row at random. This ensures that every row has -at least three observed values. - -```{python} -n_omit = 20 -np.random.seed(15) -r_idx = np.random.choice(np.arange(X.shape[0]), - n_omit, - replace=False) -c_idx = np.random.choice(np.arange(X.shape[1]), - n_omit, - replace=True) -Xna = X.copy() -Xna[r_idx, c_idx] = np.nan - -``` - -Here the array `r_idx` -contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains -20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states. - -We now write some code to implement Algorithm 12.1. -We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function. -This will be needed in Step 2 of Algorithm 12.1. - -```{python} -def low_rank(X, M=1): - U, D, V = np.linalg.svd(X) - L = U[:,:M] * D[None,:M] - return L.dot(V[:M]) - -``` -To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\tilde{\bf X}$ in Algorithm 12.1 --- by replacing -the missing values with the column means of the non-missing entries. These are stored in -`Xbar` below after running `np.nanmean()` over the row axis. -We make a copy so that when we assign values to `Xhat` below we do not also overwrite the -values in `Xna`. - -```{python} -Xhat = Xna.copy() -Xbar = np.nanmean(Xhat, axis=0) -Xhat[r_idx, c_idx] = Xbar[c_idx] - -``` - -Before we begin Step 2, we set ourselves up to measure the progress of our -iterations: - -```{python} -thresh = 1e-7 -rel_err = 1 -count = 0 -ismiss = np.isnan(Xna) -mssold = np.mean(Xhat[~ismiss]**2) -mss0 = np.mean(Xna[~ismiss]**2) - -``` -Here `ismiss` is a logical matrix with the same dimensions as `Xna`; -a given element is `True` if the corresponding matrix element is missing. The notation `~ismiss` negates this boolean vector. This is useful -because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`. -We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold` (which currently -agrees with `mss0`). We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then -iterate Step 2 of Algorithm 12.1 until the *relative error*, defined as -`(mssold - mss) / mss0`, falls below `thresh = 1e-7`. - {Algorithm 12.1 tells us to iterate Step 2 until 12.14 is no longer decreasing. Determining whether 12.14 is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm 12.1 to converge does not depend on whether we multiplied the raw data $\bf X$ by a constant factor.} - -In Step 2(a) of Algorithm 12.1, we approximate `Xhat` using `low_rank()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in the following `while` loop: - -```{python} -while rel_err > thresh: - count += 1 - # Step 2(a) - Xapp = low_rank(Xhat, M=1) - # Step 2(b) - Xhat[ismiss] = Xapp[ismiss] - # Step 2(c) - mss = np.mean(((Xna - Xapp)[~ismiss])**2) - rel_err = (mssold - mss) / mss0 - mssold = mss - print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}" - .format(count, mss, rel_err)) - -``` - -We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381. - -Finally, we compute the correlation between the 20 imputed values -and the actual values: - -```{python} -np.corrcoef(Xapp[ismiss], X[ismiss])[0,1] - -``` - - -In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python`{} implementations. - - -## Clustering - - -### $K$-Means Clustering - -The estimator `sklearn.cluster.KMeans()` performs $K$-means clustering in -`Python`. We begin with a simple simulated example in which there -truly are two clusters in the data: the first 25 observations have a -mean shift relative to the next 25 observations. - -```{python} -np.random.seed(0); -X = np.random.standard_normal((50,2)); -X[:25,0] += 3; -X[:25,1] -= 4; - -``` -We now perform $K$-means clustering with $K=2$. - -```{python} -kmeans = KMeans(n_clusters=2, - random_state=2, - n_init=20).fit(X) - -``` -We specify `random_state` to make the results reproducible. The cluster assignments of the 50 observations are contained in `kmeans.labels_`. - -```{python} -kmeans.labels_ - -``` -The $K$-means clustering perfectly separated the observations into two -clusters even though we did not supply any group information to -`KMeans()`. We can plot the data, with each observation -colored according to its cluster assignment. - -```{python} -fig, ax = plt.subplots(1, 1, figsize=(8,8)) -ax.scatter(X[:,0], X[:,1], c=kmeans.labels_) -ax.set_title("K-Means Clustering Results with K=2"); - -``` - -Here the observations can be easily plotted because they are -two-dimensional. If there were more than two variables then we could -instead perform PCA and plot the first two principal component score -vectors to represent the clusters. - -In this example, we knew that there really -were two clusters because we generated the data. However, for real -data, we do not know the true number of clusters, nor whether they exist in any precise way. We could -instead have performed $K$-means clustering on this example with -$K=3$. - -```{python} -kmeans = KMeans(n_clusters=3, - random_state=3, - n_init=20).fit(X) -fig, ax = plt.subplots(figsize=(8,8)) -ax.scatter(X[:,0], X[:,1], c=kmeans.labels_) -ax.set_title("K-Means Clustering Results with K=3"); - -``` -When $K=3$, $K$-means clustering splits up the two clusters. -We have used the `n_init` argument to run the $K$-means with 20 -initial cluster assignments (the default is 10). If a -value of `n_init` greater than one is used, then $K$-means -clustering will be performed using multiple random assignments in -Step 1 of Algorithm 12.2, and the `KMeans()` -function will report only the best results. Here we compare using -`n_init=1` to `n_init=20`. - -```{python} -kmeans1 = KMeans(n_clusters=3, - random_state=3, - n_init=1).fit(X) -kmeans20 = KMeans(n_clusters=3, - random_state=3, - n_init=20).fit(X); -kmeans1.inertia_, kmeans20.inertia_ - -``` -Note that `kmeans.inertia_` is the total within-cluster sum -of squares, which we seek to minimize by performing $K$-means -clustering 12.17. - -We *strongly* recommend always running $K$-means clustering with -a large value of `n_init`, such as 20 or 50, since otherwise an -undesirable local optimum may be obtained. - -When performing $K$-means clustering, in addition to using multiple -initial cluster assignments, it is also important to set a random seed -using the `random_state` argument to `KMeans()`. This way, the initial -cluster assignments in Step 1 can be replicated, and the $K$-means -output will be fully reproducible. - - -### Hierarchical Clustering - -The `AgglomerativeClustering()` class from -the `sklearn.clustering` package implements hierarchical clustering. -As its -name is long, we use the short hand `HClust` for *hierarchical clustering*. Note that this will not change the return type -when using this method, so instances will still be of class `AgglomerativeClustering`. -In the following example we use the data from the previous lab to plot the hierarchical clustering -dendrogram using complete, single, and average linkage clustering -with Euclidean distance as the dissimilarity measure. We begin by -clustering observations using complete linkage. - -```{python} -HClust = AgglomerativeClustering -hc_comp = HClust(distance_threshold=0, - n_clusters=None, - linkage='complete') -hc_comp.fit(X) - -``` - -This computes the entire dendrogram. -We could just as easily perform hierarchical clustering with average or single linkage instead: - -```{python} -hc_avg = HClust(distance_threshold=0, - n_clusters=None, - linkage='average'); -hc_avg.fit(X) -hc_sing = HClust(distance_threshold=0, - n_clusters=None, - linkage='single'); -hc_sing.fit(X); - -``` - -To use a precomputed distance matrix, we provide an additional -argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix. - -```{python} -D = np.zeros((X.shape[0], X.shape[0])); -for i in range(X.shape[0]): - x_ = np.multiply.outer(np.ones(X.shape[0]), X[i]) - D[i] = np.sqrt(np.sum((X - x_)**2, 1)); -hc_sing_pre = HClust(distance_threshold=0, - n_clusters=None, - metric='precomputed', - linkage='single') -hc_sing_pre.fit(D) - -``` - -We use -`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However, -`dendrogram()` expects a so-called *linkage-matrix representation* -of the clustering, which is not provided by `AgglomerativeClustering()`, -but can be computed. The function `compute_linkage()` in the -`ISLP.cluster` package is provided for this purpose. - -We can now plot the dendrograms. The numbers at the bottom of the plot -identify each observation. The `dendrogram()` function has a default method to -color different branches of the tree that suggests a pre-defined cut of the tree at a particular depth. -We prefer to overwrite this default by setting this threshold to be infinite. Since we want this behavior for many dendrograms, we store these values in a dictionary `cargs` and pass this as keyword arguments using the notation `**cargs`. - -```{python} -cargs = {'color_threshold':-np.inf, - 'above_threshold_color':'black'} -linkage_comp = compute_linkage(hc_comp) -fig, ax = plt.subplots(1, 1, figsize=(8, 8)) -dendrogram(linkage_comp, - ax=ax, - **cargs); - -``` - -We may want to color branches of the tree above -and below a cut-threshold differently. This can be achieved -by changing the `color_threshold`. Let’s cut the tree at a height of 4, -coloring links that merge above 4 in black. - -```{python} -fig, ax = plt.subplots(1, 1, figsize=(8, 8)) -dendrogram(linkage_comp, - ax=ax, - color_threshold=4, - above_threshold_color='black'); - -``` - -To determine the cluster labels for each observation associated with a -given cut of the dendrogram, we can use the `cut_tree()` -function from `scipy.cluster.hierarchy`: - -```{python} -cut_tree(linkage_comp, n_clusters=4).T - -``` - -This can also be achieved by providing an argument `n_clusters` -to `HClust()`; however each cut would require recomputing -the clustering. Similarly, trees may be cut by distance threshold -with an argument of `distance_threshold` to `HClust()` -or `height` to `cut_tree()`. - -```{python} -cut_tree(linkage_comp, height=5) - -``` - -To scale the variables before performing hierarchical clustering of -the observations, we use `StandardScaler()` as in our PCA example: - -```{python} -scaler = StandardScaler() -X_scale = scaler.fit_transform(X) -hc_comp_scale = HClust(distance_threshold=0, - n_clusters=None, - linkage='complete').fit(X_scale) -linkage_comp_scale = compute_linkage(hc_comp_scale) -fig, ax = plt.subplots(1, 1, figsize=(8, 8)) -dendrogram(linkage_comp_scale, ax=ax, **cargs) -ax.set_title("Hierarchical Clustering with Scaled Features"); - -``` - -Correlation-based distances between observations can be used for -clustering. The correlation between two observations measures the -similarity of their feature values. {Suppose each observation has - $p$ features, each a single numerical value. We measure the - similarity of two such observations by computing the - correlation of these $p$ pairs of numbers.} -With $n$ observations, the $n\times n$ correlation matrix can then be used as a similarity (or affinity) matrix, i.e. so that one minus the correlation matrix is the dissimilarity matrix used for clustering. - -Note that using correlation only makes sense for -data with at least three features since the absolute correlation -between any two observations with measurements on two features is -always one. Hence, we will cluster a three-dimensional data set. - -```{python} -X = np.random.standard_normal((30, 3)) -corD = 1 - np.corrcoef(X) -hc_cor = HClust(linkage='complete', - distance_threshold=0, - n_clusters=None, - metric='precomputed') -hc_cor.fit(corD) -linkage_cor = compute_linkage(hc_cor) -fig, ax = plt.subplots(1, 1, figsize=(8, 8)) -dendrogram(linkage_cor, ax=ax, **cargs) -ax.set_title("Complete Linkage with Correlation-Based Dissimilarity"); - -``` - - -## NCI60 Data Example -Unsupervised techniques are often used in the analysis of genomic -data. In particular, PCA and hierarchical clustering are popular -tools. We illustrate these techniques on the `NCI60` cancer cell line -microarray data, which consists of 6830 gene expression -measurements on 64 cancer cell lines. - -```{python} -NCI60 = load_data('NCI60') -nci_labs = NCI60['labels'] -nci_data = NCI60['data'] - -``` - -Each cell line is labeled with a cancer type. We do not make use of -the cancer types in performing PCA and clustering, as these are -unsupervised techniques. But after performing PCA and clustering, we -will check to see the extent to which these cancer types agree with -the results of these unsupervised techniques. - -The data has 64 rows and 6830 columns. - -```{python} -nci_data.shape - -``` - - -We begin by examining the cancer types for the cell lines. - - -```{python} -nci_labs.value_counts() - -``` - - -### PCA on the NCI60 Data - -We first perform PCA on the data after scaling the variables (genes) -to have standard deviation one, although here one could reasonably argue -that it is better not to scale the genes as they are measured in the same units. - -```{python} -scaler = StandardScaler() -nci_scaled = scaler.fit_transform(nci_data) -nci_pca = PCA() -nci_scores = nci_pca.fit_transform(nci_scaled) - -``` - -We now plot the first few principal component score vectors, in order -to visualize the data. The observations (cell lines) corresponding to -a given cancer type will be plotted in the same color, so that we can -see to what extent the observations within a cancer type are similar -to each other. - -```{python} -cancer_types = list(np.unique(nci_labs)) -nci_groups = np.array([cancer_types.index(lab) - for lab in nci_labs.values]) -fig, axes = plt.subplots(1, 2, figsize=(15,6)) -ax = axes[0] -ax.scatter(nci_scores[:,0], - nci_scores[:,1], - c=nci_groups, - marker='o', - s=50) -ax.set_xlabel('PC1'); ax.set_ylabel('PC2') -ax = axes[1] -ax.scatter(nci_scores[:,0], - nci_scores[:,2], - c=nci_groups, - marker='o', - s=50) -ax.set_xlabel('PC1'); ax.set_ylabel('PC3'); - -``` -On the whole, cell lines corresponding to a single cancer type do tend to -have similar values on the first few principal component score -vectors. This indicates that cell lines from the same cancer type tend -to have pretty similar gene expression levels. - - - - - -We can also plot the percent variance -explained by the principal components as well as the cumulative percent variance explained. -This is similar to the plots we made earlier for the `USArrests` data. - -```{python} -fig, axes = plt.subplots(1, 2, figsize=(15,6)) -ax = axes[0] -ticks = np.arange(nci_pca.n_components_)+1 -ax.plot(ticks, - nci_pca.explained_variance_ratio_, - marker='o') -ax.set_xlabel('Principal Component'); -ax.set_ylabel('PVE') -ax = axes[1] -ax.plot(ticks, - nci_pca.explained_variance_ratio_.cumsum(), - marker='o'); -ax.set_xlabel('Principal Component') -ax.set_ylabel('Cumulative PVE'); - -``` -We see that together, the first seven principal components explain -around 40% of the variance in the data. This is not a huge amount -of the variance. However, looking at the scree plot, we see that while -each of the first seven principal components explain a substantial -amount of variance, there is a marked decrease in the variance -explained by further principal components. That is, there is an -*elbow* in the plot after approximately the seventh -principal component. This suggests that there may be little benefit -to examining more than seven or so principal components (though even -examining seven principal components may be difficult). - - -### Clustering the Observations of the NCI60 Data - -We now perform hierarchical clustering of the cell lines in the `NCI60` data using -complete, single, and average linkage. Once again, the goal is to find out whether or not the observations cluster into distinct types of cancer. Euclidean -distance is used as the dissimilarity measure. We first write a short -function to produce -the three dendrograms. - -```{python} -def plot_nci(linkage, ax, cut=-np.inf): - cargs = {'above_threshold_color':'black', - 'color_threshold':cut} - hc = HClust(n_clusters=None, - distance_threshold=0, - linkage=linkage.lower()).fit(nci_scaled) - linkage_ = compute_linkage(hc) - dendrogram(linkage_, - ax=ax, - labels=np.asarray(nci_labs), - leaf_font_size=10, - **cargs) - ax.set_title('%s Linkage' % linkage) - return hc - -``` - -Let’s plot our results. - -```{python} -fig, axes = plt.subplots(3, 1, figsize=(15,30)) -ax = axes[0]; hc_comp = plot_nci('Complete', ax) -ax = axes[1]; hc_avg = plot_nci('Average', ax) -ax = axes[2]; hc_sing = plot_nci('Single', ax) - -``` -We see that the -choice of linkage certainly does affect the results -obtained. Typically, single linkage will tend to yield *trailing* -clusters: very large clusters onto which individual observations -attach one-by-one. On the other hand, complete and average linkage -tend to yield more balanced, attractive clusters. For this reason, -complete and average linkage are generally preferred to single -linkage. Clearly cell lines within a single cancer type do tend to -cluster together, although the clustering is not perfect. We will use -complete linkage hierarchical clustering for the analysis that -follows. - -We can cut the dendrogram at the height that will yield a particular -number of clusters, say four: - -```{python} -linkage_comp = compute_linkage(hc_comp) -comp_cut = cut_tree(linkage_comp, n_clusters=4).reshape(-1) -pd.crosstab(nci_labs['label'], - pd.Series(comp_cut.reshape(-1), name='Complete')) - -``` - - -There are some clear patterns. All the leukemia cell lines fall in -one cluster, while the breast cancer cell lines are spread out over -three different clusters. - -We can plot a cut on the dendrogram that produces these four clusters: - -```{python} -fig, ax = plt.subplots(figsize=(10,10)) -plot_nci('Complete', ax, cut=140) -ax.axhline(140, c='r', linewidth=4); - -``` - -The `axhline()` function draws a horizontal line on top of any -existing set of axes. The argument `140` plots a horizontal -line at height 140 on the dendrogram; this is a height that -results in four distinct clusters. It is easy to verify that the -resulting clusters are the same as the ones we obtained in -`comp_cut`. - -We claimed earlier in Section 12.4.2 that -$K$-means clustering and hierarchical clustering with the dendrogram -cut to obtain the same number of clusters can yield very different -results. How do these `NCI60` hierarchical clustering results compare -to what we get if we perform $K$-means clustering with $K=4$? - -```{python} -nci_kmeans = KMeans(n_clusters=4, - random_state=0, - n_init=20).fit(nci_scaled) -pd.crosstab(pd.Series(comp_cut, name='HClust'), - pd.Series(nci_kmeans.labels_, name='K-means')) - -``` - -We see that the four clusters obtained using hierarchical clustering -and $K$-means clustering are somewhat different. First we note -that the labels in the two clusterings are arbitrary. That is, swapping -the identifier of the cluster does not -change the clustering. We see here Cluster 3 in -$K$-means clustering is identical to cluster 2 in hierarchical -clustering. However, the other clusters differ: for instance, -cluster 0 in $K$-means clustering contains a portion of the -observations assigned to cluster 0 by hierarchical clustering, as well -as all of the observations assigned to cluster 1 by hierarchical -clustering. - -Rather than performing hierarchical clustering on the entire data -matrix, we can also perform hierarchical clustering on the first few -principal component score vectors, regarding these first few components -as a less noisy version of the data. - -```{python} -hc_pca = HClust(n_clusters=None, - distance_threshold=0, - linkage='complete' - ).fit(nci_scores[:,:5]) -linkage_pca = compute_linkage(hc_pca) -fig, ax = plt.subplots(figsize=(8,8)) -dendrogram(linkage_pca, - labels=np.asarray(nci_labs), - leaf_font_size=10, - ax=ax, - **cargs) -ax.set_title("Hier. Clust. on First Five Score Vectors") -pca_labels = pd.Series(cut_tree(linkage_pca, - n_clusters=4).reshape(-1), - name='Complete-PCA') -pd.crosstab(nci_labs['label'], pca_labels) - -``` - - - - - diff --git a/Ch13-multiple-lab.Rmd b/Ch13-multiple-lab.Rmd deleted file mode 100644 index f19c7a6..0000000 --- a/Ch13-multiple-lab.Rmd +++ /dev/null @@ -1,574 +0,0 @@ - - -# Multiple Testing - - -Open In Colab - - -[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2.1?labpath=Ch13-multiple-lab.ipynb) - - - - - -We include our usual imports seen in earlier labs. - -```{python} -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt -import statsmodels.api as sm -from ISLP import load_data - -``` - -We also collect the new imports -needed for this lab. - -```{python} -from scipy.stats import \ - (ttest_1samp, - ttest_rel, - ttest_ind, - t as t_dbn) -from statsmodels.stats.multicomp import \ - pairwise_tukeyhsd -from statsmodels.stats.multitest import \ - multipletests as mult_test - -``` - - -## Review of Hypothesis Tests -We begin by performing some one-sample $t$-tests. - -First we create 100 variables, each consisting of 10 observations. The -first 50 variables have mean $0.5$ and variance $1$, while the others -have mean $0$ and variance $1$. - -```{python} -rng = np.random.default_rng(12) -X = rng.standard_normal((10, 100)) -true_mean = np.array([0.5]*50 + [0]*50) -X += true_mean[None,:] - -``` - -To begin, we use `ttest_1samp()` from the -`scipy.stats` module to test $H_{0}: \mu_1=0$, the null -hypothesis that the first variable has mean zero. - -```{python} -result = ttest_1samp(X[:,0], 0) -result.pvalue - -``` - -The $p$-value comes out to 0.931, which is not low enough to -reject the null hypothesis at level $\alpha=0.05$. In this case, -$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made -a Type II error by failing to reject the null hypothesis when the null -hypothesis is false. - -We now test $H_{0,j}: \mu_j=0$ for $j=1,\ldots,100$. We compute the -100 $p$-values, and then construct a vector recording whether the -$j$th $p$-value is less than or equal to 0.05, in which case we reject -$H_{0j}$, or greater than 0.05, in which case we do not reject -$H_{0j}$, for $j=1,\ldots,100$. - -```{python} -p_values = np.empty(100) -for i in range(100): - p_values[i] = ttest_1samp(X[:,i], 0).pvalue -decision = pd.cut(p_values, - [0, 0.05, 1], - labels=['Reject H0', - 'Do not reject H0']) -truth = pd.Categorical(true_mean == 0, - categories=[True, False], - ordered=True) - -``` -Since this is a simulated data set, we can create a $2 \times 2$ table -similar to Table 13.2. - -```{python} -pd.crosstab(decision, - truth, - rownames=['Decision'], - colnames=['H0']) - -``` -Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false -null hypotheses, and we incorrectly reject 5 of the true null -hypotheses. Using the notation from Section 13.3, we have -$V=5$, $S=15$, $U=45$ and $W=35$. -We have set $\alpha=0.05$, which means that we expect to reject around -5% of the true null hypotheses. This is in line with the $2 \times 2$ -table above, which indicates that we rejected $V=5$ of the $50$ true -null hypotheses. - -In the simulation above, for the false null hypotheses, the ratio of -the mean to the standard deviation was only $0.5/1 = 0.5$. This -amounts to quite a weak signal, and it resulted in a high number of -Type II errors. Let’s instead simulate data with a stronger signal, -so that the ratio of the mean to the standard deviation for the false -null hypotheses equals $1$. We make only 10 Type II errors. - - -```{python} -true_mean = np.array([1]*50 + [0]*50) -X = rng.standard_normal((10, 100)) -X += true_mean[None,:] -for i in range(100): - p_values[i] = ttest_1samp(X[:,i], 0).pvalue -decision = pd.cut(p_values, - [0, 0.05, 1], - labels=['Reject H0', - 'Do not reject H0']) -truth = pd.Categorical(true_mean == 0, - categories=[True, False], - ordered=True) -pd.crosstab(decision, - truth, - rownames=['Decision'], - colnames=['H0']) - -``` - - - -## Family-Wise Error Rate -Recall from 13.5 that if the null hypothesis is true -for each of $m$ independent hypothesis tests, then the FWER is equal -to $1-(1-\alpha)^m$. We can use this expression to compute the FWER -for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$. -We plot the FWER for these values of $\alpha$ in order to -reproduce Figure 13.2. - -```{python} -m = np.linspace(1, 501) -fig, ax = plt.subplots() -[ax.plot(m, - 1 - (1 - alpha)**m, - label=r'$\alpha=%s$' % str(alpha)) - for alpha in [0.05, 0.01, 0.001]] -ax.set_xscale('log') -ax.set_xlabel('Number of Hypotheses') -ax.set_ylabel('Family-Wise Error Rate') -ax.legend() -ax.axhline(0.05, c='k', ls='--'); - -``` - -As discussed previously, even for moderate values of $m$ such as $50$, -the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value, -such as $0.001$. Of course, the problem with setting $\alpha$ to such -a low value is that we are likely to make a number of Type II errors: -in other words, our power is very low. - -We now conduct a one-sample $t$-test for each of the first five -managers in the -`Fund` dataset, in order to test the null -hypothesis that the $j$th fund manager’s mean return equals zero, -$H_{0,j}: \mu_j=0$. - -```{python} -Fund = load_data('Fund') -fund_mini = Fund.iloc[:,:5] -fund_mini_pvals = np.empty(5) -for i in range(5): - fund_mini_pvals[i] = ttest_1samp(fund_mini.iloc[:,i], 0).pvalue -fund_mini_pvals - -``` - -The $p$-values are low for Managers One and Three, and high for the -other three managers. However, we cannot simply reject $H_{0,1}$ and -$H_{0,3}$, since this would fail to account for the multiple testing -that we have performed. Instead, we will conduct Bonferroni’s method -and Holm’s method to control the FWER. - -To do this, we use the `multipletests()` function from the -`statsmodels` module (abbreviated to `mult_test()`). Given the $p$-values, -for methods like Holm and Bonferroni the function outputs -adjusted $p$-values, which -can be thought of as a new set of $p$-values that have been corrected -for multiple testing. If the adjusted $p$-value for a given hypothesis -is less than or equal to $\alpha$, then that hypothesis can be -rejected while maintaining a FWER of no more than $\alpha$. In other -words, for such methods, the adjusted $p$-values resulting from the `multipletests()` -function can simply be compared to the desired FWER in order to -determine whether or not to reject each hypothesis. We will later -see that we can use the same function to control FDR as well. - - -The `mult_test()` function takes $p$-values and a `method` argument, as well as an optional -`alpha` argument. It returns the decisions (`reject` below) -as well as the adjusted $p$-values (`bonf`). - -```{python} -reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2] -reject - -``` - - -The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than -or equal to 1. - -```{python} -bonf, np.minimum(fund_mini_pvals * 5, 1) - -``` - -Therefore, using Bonferroni’s method, we are able to reject the null hypothesis only for Manager -One while controlling FWER at $0.05$. - -By contrast, using Holm’s method, the adjusted $p$-values indicate -that we can reject the null -hypotheses for Managers One and Three at a FWER of $0.05$. - -```{python} -mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2] - -``` - - -As discussed previously, Manager One seems to perform particularly -well, whereas Manager Two has poor performance. - - -```{python} -fund_mini.mean() - -``` - - -Is there evidence of a meaningful difference in performance between -these two managers? We can check this by performing a paired $t$-test using the `ttest_rel()` function -from `scipy.stats`: - -```{python} -ttest_rel(fund_mini['Manager1'], - fund_mini['Manager2']).pvalue - -``` - -The test results in a $p$-value of 0.038, -suggesting a statistically significant difference. - -However, we decided to perform this test only after examining the data -and noting that Managers One and Two had the highest and lowest mean -performances. In a sense, this means that we have implicitly -performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than -just one, as discussed in Section 13.3.2. Hence, we use the -`pairwise_tukeyhsd()` function from -`statsmodels.stats.multicomp` to apply Tukey’s method - in order to adjust for multiple testing. This function takes -as input a fitted *ANOVA* regression model, which is -essentially just a linear regression in which all of the predictors -are qualitative. In this case, the response consists of the monthly -excess returns achieved by each manager, and the predictor indicates -the manager to which each return corresponds. - -```{python} -returns = np.hstack([fund_mini.iloc[:,i] for i in range(5)]) -managers = np.hstack([[i+1]*50 for i in range(5)]) -tukey = pairwise_tukeyhsd(returns, managers) -print(tukey.summary()) - -``` - - -The `pairwise_tukeyhsd()` function provides confidence intervals -for the difference between each pair of managers (`lower` and -`upper`), as well as a $p$-value. All of these quantities have -been adjusted for multiple testing. Notice that the $p$-value for the -difference between Managers One and Two has increased from $0.038$ to -$0.186$, so there is no longer clear evidence of a difference between -the managers’ performances. We can plot the confidence intervals for -the pairwise comparisons using the `plot_simultaneous()` method -of `tukey`. Any pair of intervals that don’t overlap indicates a significant difference at the nominal level of 0.05. In this case, -no differences are considered significant as reported in the table above. - -```{python} -fig, ax = plt.subplots(figsize=(8,8)) -tukey.plot_simultaneous(ax=ax); - -``` - -## False Discovery Rate -Now we perform hypothesis tests for all 2,000 fund managers in the -`Fund` dataset. We perform a one-sample $t$-test -of $H_{0,j}: \mu_j=0$, which states that the -$j$th fund manager’s mean return is zero. - -```{python} -fund_pvalues = np.empty(2000) -for i, manager in enumerate(Fund.columns): - fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue - -``` - -There are far too many managers to consider trying to control the FWER. -Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives. -The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure. - -```{python} -fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1] -fund_qvalues[:10] - -``` - -The *q-values* output by the -Benjamini--Hochberg procedure can be interpreted as the smallest FDR -threshold at which we would reject a particular null hypothesis. For -instance, a $q$-value of $0.1$ indicates that we can reject the -corresponding null hypothesis at an FDR of 10% or greater, but that -we cannot reject the null hypothesis at an FDR below 10%. - -If we control the FDR at 10%, then for how many of the fund managers can we reject $H_{0,j}: \mu_j=0$? - -```{python} -(fund_qvalues <= 0.1).sum() - -``` -We find that 146 of the 2,000 fund managers have a $q$-value below -0.1; therefore, we are able to conclude that 146 of the fund managers -beat the market at an FDR of 10%. Only about 15 (10% of 146) of -these fund managers are likely to be false discoveries. - -By contrast, if we had instead used Bonferroni’s method to control the -FWER at level $\alpha=0.1$, then we would have failed to reject any -null hypotheses! - -```{python} -(fund_pvalues <= 0.1 / 2000).sum() - -``` - - -Figure 13.6 displays the ordered -$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for -the `Fund` dataset, as well as the threshold for rejection by the -Benjamini--Hochberg procedure. Recall that the Benjamini--Hochberg -procedure identifies the largest $p$-value such that $p_{(j)} 0: - selected_ = fund_pvalues < sorted_[sorted_set_].max() - sorted_set_ = np.arange(sorted_set_.max()) -else: - selected_ = [] - sorted_set_ = [] - -``` - -We now reproduce the middle panel of Figure 13.6. - -```{python} -fig, ax = plt.subplots() -ax.scatter(np.arange(0, sorted_.shape[0]) + 1, - sorted_, s=10) -ax.set_yscale('log') -ax.set_xscale('log') -ax.set_ylabel('P-Value') -ax.set_xlabel('Index') -ax.scatter(sorted_set_+1, sorted_[sorted_set_], c='r', s=20) -ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3); - -``` - - -## A Re-Sampling Approach -Here, we implement the re-sampling approach to hypothesis testing -using the `Khan` dataset, which we investigated in -Section 13.5. First, we merge the training and -testing data, which results in observations on 83 patients for -2,308 genes. - -```{python} -Khan = load_data('Khan') -D = pd.concat([Khan['xtrain'], Khan['xtest']]) -D['Y'] = pd.concat([Khan['ytrain'], Khan['ytest']]) -D['Y'].value_counts() - -``` - - -There are four classes of cancer. For each gene, we compare the mean -expression in the second class (rhabdomyosarcoma) to the mean -expression in the fourth class (Burkitt’s lymphoma). Performing a -standard two-sample $t$-test -using `ttest_ind()` from `scipy.stats` on the $11$th -gene produces a test-statistic of -2.09 and an associated $p$-value -of 0.0412, suggesting modest evidence of a difference in mean -expression levels between the two cancer types. - -```{python} -D2 = D[lambda df:df['Y'] == 2] -D4 = D[lambda df:df['Y'] == 4] -gene_11 = 'G0011' -observedT, pvalue = ttest_ind(D2[gene_11], - D4[gene_11], - equal_var=True) -observedT, pvalue - -``` - - -However, this $p$-value relies on the assumption that under the null -hypothesis of no difference between the two groups, the test statistic -follows a $t$-distribution with $29+25-2=52$ degrees of freedom. -Instead of using this theoretical null distribution, we can randomly -split the 54 patients into two groups of 29 and 25, and compute a new -test statistic. Under the null hypothesis of no difference between -the groups, this new test statistic should have the same distribution -as our original one. Repeating this process 10,000 times allows us to -approximate the null distribution of the test statistic. We compute -the fraction of the time that our observed test statistic exceeds the -test statistics obtained via re-sampling. - -```{python} -B = 10000 -Tnull = np.empty(B) -D_ = np.hstack([D2[gene_11], D4[gene_11]]) -n_ = D2[gene_11].shape[0] -D_null = D_.copy() -for b in range(B): - rng.shuffle(D_null) - ttest_ = ttest_ind(D_null[:n_], - D_null[n_:], - equal_var=True) - Tnull[b] = ttest_.statistic -(np.abs(Tnull) < np.abs(observedT)).mean() - -``` - - -This fraction, 0.0398, -is our re-sampling-based $p$-value. -It is almost identical to the $p$-value of 0.0412 obtained using the theoretical null distribution. -We can plot a histogram of the re-sampling-based test statistics in order to reproduce Figure 13.7. - -```{python} -fig, ax = plt.subplots(figsize=(8,8)) -ax.hist(Tnull, - bins=100, - density=True, - facecolor='y', - label='Null') -xval = np.linspace(-4.2, 4.2, 1001) -ax.plot(xval, - t_dbn.pdf(xval, D_.shape[0]-2), - c='r') -ax.axvline(observedT, - c='b', - label='Observed') -ax.legend() -ax.set_xlabel("Null Distribution of Test Statistic"); - -``` -The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red. - -Finally, we implement the plug-in re-sampling FDR approach outlined in -Algorithm 13.4. Depending on the speed of your -computer, calculating the FDR for all 2,308 genes in the `Khan` -dataset may take a while. Hence, we will illustrate the approach on a -random subset of 100 genes. For each gene, we first compute the -observed test statistic, and then produce 10,000 re-sampled test -statistics. This may take a few minutes to run. If you are in a rush, -then you could set `B` equal to a smaller value (e.g. `B=500`). - -```{python} -m, B = 100, 10000 -idx = rng.choice(Khan['xtest'].columns, m, replace=False) -T_vals = np.empty(m) -Tnull_vals = np.empty((m, B)) - -for j in range(m): - col = idx[j] - T_vals[j] = ttest_ind(D2[col], - D4[col], - equal_var=True).statistic - D_ = np.hstack([D2[col], D4[col]]) - D_null = D_.copy() - for b in range(B): - rng.shuffle(D_null) - ttest_ = ttest_ind(D_null[:n_], - D_null[n_:], - equal_var=True) - Tnull_vals[j,b] = ttest_.statistic - -``` - -Next, we compute the number of rejected null hypotheses $R$, the -estimated number of false positives $\widehat{V}$, and the estimated -FDR, for a range of threshold values $c$ in -Algorithm 13.4. The threshold values are chosen -using the absolute values of the test statistics from the 100 genes. - -```{python} -cutoffs = np.sort(np.abs(T_vals)) -FDRs, Rs, Vs = np.empty((3, m)) -for j in range(m): - R = np.sum(np.abs(T_vals) >= cutoffs[j]) - V = np.sum(np.abs(Tnull_vals) >= cutoffs[j]) / B - Rs[j] = R - Vs[j] = V - FDRs[j] = V / R - -``` - -Now, for any given FDR, we can find the genes that will be -rejected. For example, with FDR controlled at 0.1, we reject 15 of the -100 null hypotheses. On average, we would expect about one or two of -these genes (i.e. 10% of 15) to be false discoveries. At an FDR of -0.2, we can reject the null hypothesis for 28 genes, of which we -expect around six to be false discoveries. - -The variable `idx` stores which -genes were included in our 100 randomly-selected genes. Let’s look at -the genes whose estimated FDR is less than 0.1. - -```{python} -sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()]) - -``` - -At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected -proportion of false discoveries. - -```{python} -sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()]) - -``` - -The next line generates Figure 13.11, which is similar -to Figure 13.9, -except that it is based on only a subset of the genes. - -```{python} -fig, ax = plt.subplots() -ax.plot(Rs, FDRs, 'b', linewidth=3) -ax.set_xlabel("Number of Rejections") -ax.set_ylabel("False Discovery Rate"); - -``` - - diff --git a/README.md b/README.md index 539c30e..642de12 100644 --- a/README.md +++ b/README.md @@ -95,7 +95,7 @@ Open your terminal and run the following command to set up the environment for v * `--python-version 3.12`: This will use Python 3.12 for the environment. * `Ch02-statlearn-lab.ipynb`: This is an optional argument to run a specific notebook after the setup is complete. It is meant for testing to be sure given notebooks run but is not required. You can list more than one notebook. -### 2. Activate the environment +### 3. Activate the environment Once the script is finished, you can activate the virtual environment to run other notebooks or work with the lab materials. @@ -109,7 +109,7 @@ Once the script is finished, you can activate the virtual environment to run oth ISLP_v2.2.2\.venv\Scripts\activate ``` -### 3. Run other notebooks +### 4. Run other notebooks After activating the environment, you can start Jupyter Lab to run other notebooks. @@ -117,6 +117,14 @@ After activating the environment, you can start Jupyter Lab to run other noteboo jupyter lab ``` +# R Markdown files + +Earlier versions of this repo maintained R Markdown versions of the lab next to the Jupyter notebooks. These have been removed from the repo but can be built from the +notebooks at any time by + +```bash +make Rmd +``` ## Contributors ✨