diff --git a/Ch03-linreg-lab.Rmd b/Ch03-linreg-lab.Rmd index 36b3491..7759395 100644 --- a/Ch03-linreg-lab.Rmd +++ b/Ch03-linreg-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Linear Regression @@ -326,8 +326,8 @@ Let’s use our new function to add this regression line to a plot of ```{python} ax = Boston.plot.scatter('lstat', 'medv') abline(ax, - results.params[0], - results.params[1], + results.params['intercept'], + results.params['lstat'], 'r--', linewidth=3) diff --git a/Ch04-classification-lab.Rmd b/Ch04-classification-lab.Rmd index f4b7329..0c9339f 100644 --- a/Ch04-classification-lab.Rmd +++ b/Ch04-classification-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Logistic Regression, LDA, QDA, and KNN diff --git a/Ch05-resample-lab.Rmd b/Ch05-resample-lab.Rmd index 56f0ec2..24b089f 100644 --- a/Ch05-resample-lab.Rmd +++ b/Ch05-resample-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Cross-Validation and the Bootstrap diff --git a/Ch06-varselect-lab.Rmd b/Ch06-varselect-lab.Rmd index cfbd674..f1cbda8 100644 --- a/Ch06-varselect-lab.Rmd +++ b/Ch06-varselect-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Linear Models and Regularization Methods @@ -148,7 +148,6 @@ hitters_MSE = sklearn_selected(OLS, strategy) hitters_MSE.fit(Hitters, Y) hitters_MSE.selected_state_ - ``` Using `neg_Cp` results in a smaller model, as expected, with just 10 variables selected. diff --git a/Ch07-nonlin-lab.Rmd b/Ch07-nonlin-lab.Rmd index 6af825f..a4cc703 100644 --- a/Ch07-nonlin-lab.Rmd +++ b/Ch07-nonlin-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Non-Linear Modeling diff --git a/Ch08-baggboost-lab.Rmd b/Ch08-baggboost-lab.Rmd index 6671a7e..d6f3ef7 100644 --- a/Ch08-baggboost-lab.Rmd +++ b/Ch08-baggboost-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Tree-Based Methods diff --git a/Ch09-svm-lab.Rmd b/Ch09-svm-lab.Rmd index 998e71b..66f9bb6 100644 --- a/Ch09-svm-lab.Rmd +++ b/Ch09-svm-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Support Vector Machines diff --git a/Ch10-deeplearning-lab.Rmd b/Ch10-deeplearning-lab.Rmd index 494c3f0..77385bb 100644 --- a/Ch10-deeplearning-lab.Rmd +++ b/Ch10-deeplearning-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Deep Learning diff --git a/Ch11-surv-lab.Rmd b/Ch11-surv-lab.Rmd index aa5f85a..ce6c9cf 100644 --- a/Ch11-surv-lab.Rmd +++ b/Ch11-surv-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Survival Analysis diff --git a/Ch12-unsup-lab.Rmd b/Ch12-unsup-lab.Rmd index 1e720ff..4fd16ac 100644 --- a/Ch12-unsup-lab.Rmd +++ b/Ch12-unsup-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Unsupervised Learning diff --git a/Ch13-multiple-lab.Rmd b/Ch13-multiple-lab.Rmd index 31c215f..a250670 100644 --- a/Ch13-multiple-lab.Rmd +++ b/Ch13-multiple-lab.Rmd @@ -8,7 +8,7 @@ jupyter: extension: .Rmd format_name: rmarkdown format_version: '1.2' - jupytext_version: 1.16.7 + jupytext_version: 1.19.1 --- # Multiple Testing diff --git a/daniela.Rmd b/daniela.Rmd new file mode 100644 index 0000000..1c4caf0 --- /dev/null +++ b/daniela.Rmd @@ -0,0 +1,278 @@ +# current items on ISLP errata page + +## On page 44, “Out[22]:” should not be numbered. The authors. + +This is related to rendering the labs into PDF, not the labs themselves. + +## On page 49, the input block after “In[43]:” should be numbered (this will affect the numbering of downstream input blocks as well). The authors. + +This is related to rendering the labs into PDF, not the labs themselves. + +## On the bottom of page 50 of the Chapter 2 lab, the sentence “To fine-tune the output of the ax.contour() function, take a look at the help file by typing ?plt.contour” should instead say “To fine-tune the output of the ax.contour() function, take a look at the help file by typing ?ax.contour” Thanks to Hargen Zheng. + +This is fixed with this [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/f132c18a1cf2bbcdd377a17118f32b3c527c9948) + +## On page 54, last line above the third code cell: "TRUE" should be "True". Thanks to Pedro Zühlke. + +This is fixed now. [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 59, in the last line before the second code cell, there is a repeated “of” in “attribute of of the dataframe”. Thanks to Pedro Zühlke. + +Fixed. [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 61, block 103, there should be a semi-colon in the last line to indicate that the output should be suppressed. Also, the semi-colon in the first line is superfluous, and should be removed. Thanks to Julien Gomes. + +This was fixed here: [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/065a1ae9932952358995a13f897e5b390fdb0ee7) + +## On page 65, “Its because they are different…” should say “It’s because they are different.” Thanks to Pedro Zühlke. + +Fixed. [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 66, there is an error in the code in Exercise 2(f): the line + +college['Elite'] = pd.cut(college['Top10perc'], [0,0.5,1], labels=['No', 'Yes']) + +should be replaced with + +college[“Elite”] = pd.cut(college[“Top10perc”]/100, [0, 0.5, 1], labels = [“No”, “Yes”]). + +Thanks to Dylan Owens. + +**This is in a pending PR in the non-public LaTeX source.** + +## On page 66, Exercise 8(f): the second argument of `pd.cut` should be `[0, 50, 100]`. Thanks to Pedro Zühlke. + +This is the same as previous + +## In the footnote on the bottom of page 76, the sentence "Details of how to compute the 95% confidence interval precisely in R will be provided later in this chapter" should mention Python instead of R. Thanks to Rush Kirubi. + +Not related to code, Trevor should change in the source for the chapter. + +## On the bottom of page 81, the sentence “Any statistical software package can be used to compute these coefficient estimates, and later in this chapter we will show how this can be done in R.” should mention Python instead of R. Thanks to Jasmin Bogatinovski and Omar Mallick. + +Not related to code, Trevor should change in the source for the chapter. + +## On pages 87, 236, 601, “Mallow’s Cp” should be written as “Mallows’ Cp”. Thanks to James MacKinnon. + +Not related to code, Trevor should change in the source for the chapter. Wikipedia writes it Mallows's. This error also appears in ISLR on the website. + +## On the top of page 94: The sentence “It is estimated that those in the South will have $18.69 less debt than those in the East, and that those in the West will have $12.50 less debt than those in the East” should instead say “It is estimated that those in the West will have $18.69 less debt than those in the East, and that those in the South will have $12.50 less debt than those in the East. Thanks to Yongjun Zhu and Felipe Provezano Coutinho. + +Not related to code, Trevor should change in the source for the chapter. Wikipedia writes it Mallows's. This error also appears in ISLR on the website. + +## On page 117, "python" should be "Python", and “rmvar” should be “rm”. Thanks to Pedro Zühlke. + +I don't see the lowercase "python" but yes `rmvar` should have the name changed. Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 120, “Prediction intervals are computing” should say “Prediction intervals are computed.” Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 121, third line after the first code cell: "exisiting" should be "existing". Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 121, 2nd line below 2nd code cell: `*kwargs` should be `**kwargs`. Thanks to Pedro Zühlke. + +Fixed here: [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/f132c18a1cf2bbcdd377a17118f32b3c527c9948) + +## On page 126, penultimate line before the first code cell: "why their are" should be "why there are". Thanks to Pedro Zühlke and Guilherme Roma.. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 131, exercise 11d: "Show algebraically, and confirm numerically in R" should read "Show algebraically, and confirm numerically in Python". Thanks to Julien Gomes. + +**This is in a pending PR in the non-public LaTeX source.** + +## On page 131, exercise 11f should mention Python, not R. Thanks to anonymous. + +**This is in a pending PR in the non-public LaTeX source.** + +## On page 141, second paragraph, 6th line: "using statistical software such as R” should say “using statistical software”. Thanks to Pedro Zühlke. + +Not related to code, Trevor should change in the source for the chapter. + +## On page 158, fourth paragraph, 2nd and 3rd lines: Double "instead" in "Instead of assuming..., we instead make ...". Thanks to Pedro Zühlke. + +Not related to code, Trevor should change in the source for the chapter. + +## On the bottom of page 184, the last sentence is missing two words. It should read: “In this case Purchase has only Yes and No values and the method returns how many values of each there are.” Thanks to Johannes Ruf. + +Fixed in this [commit:](https://github.com/intro-stat-learning/ISLP_labs/commit/dc38c6c5262306a418a724900328a5b6a8b5ccc1) + +## On page 187, the printed text under “In[60]:” should not be in green. The authors. + +This is just rendering issue, not related to source of labs. + +## On page 188, there are a series of typos, all due to an error in code block 61. In code block 61, the line + +logit_labels = np.where(logit_pred[:,1] > 5, 'Yes', 'No') + +should instead say + +logit_labels = np.where(logit_pred[:,1] > 0.5, 'Yes', 'No') + +With this typo corrected, a correction is also needed before code block 62: the first column of the contingency table should contain “931, 2” instead of “933, 0”. + +Finally, in the text that follows, the sentence “If we use 0.5 as the predicted probability cut-off for the classifier, then we have a problem: none of the test observations are predicted to purchase insurance.” should be corrected as follows: “If we use 0.5 as the predicted probability cut-off for the classifier, then we have a problem: only two of the test observations are predicted to purchase insurance.” + +Thanks to Lauren Chen. + +This was fixed in this [commit:](https://github.com/intro-stat-learning/ISLP_labs/commit/dc38c6c5262306a418a724900328a5b6a8b5ccc1) + +## On page 196, exercise 12d, the last two estimates should have the subscript “apple” instead of “orange”. Thanks to Sundong Kim. + +Seems to be something that came over from ISLR but is corrected in ISLR. In a pending PR for non-public source. + +## On page 212, line 7: “R” should be replaced with “Python”. Thanks to Salena Torres Ashton. + +Not related to code, Trevor should change in the source for the chapter. + +## On page 214, Figure 5.10: it would be better for the histogram axis to be labeled $\hat\alpha$ rather than $\alpha$. Thanks to Salena Torres Ashton. + +I disagree with this actually, the $x$-axis would be an argument for the density not necessarily +the random varible. Up to Trevor if he wants to change this. + +## On page 214, 7th line from the bottom: in the line “In particular the bootstrap estimate SE(\hat\alpha) from (5.8) is 0.087,” there should be a subscript “B” on “SE”. Thanks to Pedro Zühlke. + +Not related to code, Trevor should change in the source for the chapter. Also appears in ISLR. + +## On page 216, line preceding the last code cell: "training and test set" should be "training and test sets". Thanks to Pedro Zühlke. + +Fixed, but not exactly with this fix as it is also grammatically incorrect. Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 218, 4th line below the first output (Out[9]): for consistency with the remainder of the chapter, the 'K' in "K results in K-fold ..." should be in lowercase. A similar comment applies on page 219 for the three occurrences of K in the paragraph above the second cell, and the single occurrence in each of the two paragraphs below that same code cell; moreover, this last occurrence should be italicized. Thanks to Pedro Zühlke. + +Yes, for consistency's sake. Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 219, penultimate line above the last code cell: "funtion to implement" should be "function to implement". Thanks to Pedro Zühlke and Titus Teodorescu. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 223, in the penultimate paragraph, the line “Now although the formula for the standard errors do not…” should say “Now although the formulas for the standard errors do not….” Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 225, there’s an error in the code for performing the bootstrap. The line + +store[i] = np.sum(rng.choice(100, replace=True) == 4) > 0 + +should be replaced with + +store[i] = np.sum(rng.choice(100, size=100, replace=True) == 4) > 0 + +Thanks to Alistair Bertrand Sands Keiller. + +**In a pending PR for non-public source.** + +## On page 227, Exercise 8c): data.frame() should be replaced by pd.DataFrame(). Thanks to Adrian Hayler. + +**In a pending PR for non-public source.** + +## On page 231, Algorithm 6.1, Step 3: delete the extra word “using”. Thanks to Mario Pepe. + +Not related to code, Trevor should change in the source for the chapter. Also appears in ISLR. + +## On page 235, 2nd paragraph after Algorithm 6.3, 1st line: "requires that the number ... is larger" should be "requires that the number ... be larger". Thanks to Pedro Zühlke. + +Not related to code, Trevor should change in the source for the chapter. Also appears in ISLR. + +## On page 263, 5 lines from the bottom: “a simple least squares regression line” should say “a least squares regression”. Thanks to Pedro Zühlke. + +Not related to code, Trevor should change in the source for the chapter. Also appears in ISLR. + +## On page 274, middle of the page: “….corresponding to the value 0.114 for the….” should say “….corresponding to the value 0.0114 for the”. Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 282, bottom of the page: “is little noticable difference” should say “is little noticeable difference”. Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 316, the output of command "In[18]" should have "bs(age)" instead of "bs(age, knots)". Thanks to Marcin Łukasik. + +This must happen when Trevor rendered this notebook to LaTeX. The jupyter lab run is fine. + +## On page 334, line preceding (8.3): "minimize the equation" should be "minimize the expression". Thanks to Pedro Zühlke. + +Not sure of the "correct way to see this". Same phrase appears in ISLR. + +## Figure 8.3, bottom left: To be consistent with the text, the labels at the nodes should have the form "X < t" instead of "X <= t". Thanks to Pedro Zühlke. + +Agree a bit with this. This also appears in ISLR + +## On page 355, the output of cell [6] should be 0.79 instead of 0.7275. Thanks to Karlo Delic. + +The value of [6] in the labs is 0.79: [commit](https://github.com/intro-stat-learning/ISLP_labs/blob/main/Ch08-baggboost-lab.ipynb) + +## On page 358, there is an error in the confusion table. Instead of [108 61, 10 21] it should say [94 32, 24 50]. Thanks to Lauren Chen. + +The value of the confusion table here agrees with this [commit](https://github.com/intro-stat-learning/ISLP_labs/blob/main/Ch08-baggboost-lab.ipynb) + +## On page 358, “pruned true” should say “pruned tree”. Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 362, middle of the page: “leads to a almost the same test MSE as when” should say “leads to almost the same test MSE as when”. Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 363, Exercise 3 should mention Python, not R. Thanks to Marcin Łukasik. + +Pending PR on private repo for latex + +## On page 365, Exercise 9f and Excercise 9h are redundant. Thanks to Pedro Zühlke. + +I think I agree with this. Similar issue in ISLR + +## On page 387, in the first paragraph of Section 9.6.1, “When the cost argument is small” should say “When the C argument is small”. Thanks to Ameer Dharamshi. + +Fixed: [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On the bottom of page 414, the mention of glmnet should be replaced with a mention of sklearn. Thanks to Pedro Zühlke. + +Trevor should change in the source + +## On page 420, second paragraph: the word “accompanying” is misspelled. Thanks to Pedro Zühlke. + +Trevor should change in the source + +## On page 438, we define standard_lasso in cell [14] and never use it. We have changed the lab slightly, and now cell [15] and [16] are slightly modified. Pick up the modified lab from the GitHub site linked here. Thanks to Martin Storath. + +Fixed here: [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/e5bbb1a5bc264a7508e103e21649d6f589a2ed95) + +## On page 460, before code block 65: “convert it to our a more familiar” should say “convert it to a more familiar”. Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 486, the x-axis of Figure 11.7 is missing a vertical line in the denominator (i.e. a single vertical line should be replaced with a double vertical line in the norm symbol). + +Trevor should change in source + +## On the bottom of page 511: “we can use (12.11) to see that the PVE defined in (12.10) equals . . . ” should be replaced with “we can use (12.11) to see that the PVE defined in (12.10), summed over the first $M$ principal components, equals . . .”. Thanks to Zhuyun Yin. + +Trevor should change in source. I might use the term "cumulative PVE" there as was used above rather than this suggested change. Appears also in ISLR + +## On page 508, 4th line after the table: the reference to Table 12.2.1 should mention Table 12.1. Thanks to Pedro Zühlke. + +Trevor should change in source. Must be a latex \ref issue + +## On page 512, Figure 12.3: it would be better for the x-axis to not display increments of 0.5, since the figure displays principal component indices, which are discrete. Thanks to Salena Torres Ashton. + +This seems like a good point. I don't know who controls the source for these plots. + +## On page 535, top of Section 12.5: “scanning the first few lines . . . tell us” should be “scanning the first few lines . . . tells us”. Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 549, Figure 12.18 caption: “100,%” should be “100%”. Thanks to Pedro Zühlke. + +Trevor should check latex source. + +## On page 551, 1st line after code cell [56]: extra word "line" in "draws a horizontal line line". Thanks to Pedro Zühlke. + +Fixed [commit](https://github.com/intro-stat-learning/ISLP_labs/commit/132bda168d16e9c2c7d772e996bd2333846cfede) + +## On page 561, the sentence “Typically, the R function that is used to compute a test statistic will make…” should mention Python, not R. Thanks to Yongjun Zhu. + +Trevor should change in source diff --git a/foo.Rmd b/foo.Rmd new file mode 100644 index 0000000..3c9d894 --- /dev/null +++ b/foo.Rmd @@ -0,0 +1,1395 @@ +# Introduction to Python + + +Open In Colab + + +[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch02-statlearn-lab.ipynb) + + + + + + +## Getting Started + + +To run the labs in this book, you will need two things: + +* An installation of `Python3`, which is the specific version of `Python` used in the labs. +* Access to `Jupyter`, a very popular `Python` interface that runs code through a file called a *notebook*. + + +You can download and install `Python3` by following the instructions available at [anaconda.com](http://anaconda.com). + + + There are a number of ways to get access to `Jupyter`. Here are just a few: + + * Using Google's `Colaboratory` service: [colab.research.google.com/](https://colab.research.google.com/). + * Using `JupyterHub`, available at [jupyter.org/hub](https://jupyter.org/hub). + * Using your own `jupyter` installation. Installation instructions are available at [jupyter.org/install](https://jupyter.org/install). + +Please see the `Python` resources page on the book website [statlearning.com](https://www.statlearning.com) for up-to-date information about getting `Python` and `Jupyter` working on your computer. + +You will need to install the `ISLP` package, which provides access to the datasets and custom-built functions that we provide. +Inside a macOS or Linux terminal type `pip install ISLP`; this also installs most other packages needed in the labs. The `Python` resources page has a link to the `ISLP` documentation website. + +To run this lab, download the file `Ch2-statlearn-lab.ipynb` from the `Python` resources page. +Now run the following code at the command line: `jupyter lab Ch2-statlearn-lab.ipynb`. + +If you're using Windows, you can use the `start menu` to access `anaconda`, and follow the links. For example, to install `ISLP` and run this lab, you can run the same code above in an `anaconda` shell. + + + +## Basic Commands + + + +In this lab, we will introduce some simple `Python` commands. + For more resources about `Python` in general, readers may want to consult the tutorial at [docs.python.org/3/tutorial/](https://docs.python.org/3/tutorial/). + + + + + + +Like most programming languages, `Python` uses *functions* +to perform operations. To run a +function called `fun`, we type +`fun(input1,input2)`, where the inputs (or *arguments*) +`input1` and `input2` tell +`Python` how to run the function. A function can have any number of +inputs. For example, the +`print()` function outputs a text representation of all of its arguments to the console. + +```{python} +print('fit a model with', 11, 'variables') + +``` + + The following command will provide information about the `print()` function. + +```{python} +# print? + +``` + +Adding two integers in `Python` is pretty intuitive. + +```{python} +3 + 5 + +``` + +In `Python`, textual data is handled using +*strings*. For instance, `"hello"` and +`'hello'` +are strings. +We can concatenate them using the addition `+` symbol. + +```{python} +"hello" + " " + "world" + +``` + + A string is actually a type of *sequence*: this is a generic term for an ordered list. + The three most important types of sequences are lists, tuples, and strings. +We introduce lists now. + + +The following command instructs `Python` to join together +the numbers 3, 4, and 5, and to save them as a +*list* named `x`. When we +type `x`, it gives us back the list. + +```{python} +x = [3, 4, 5] +x + +``` + +Note that we used the brackets +`[]` to construct this list. + +We will often want to add two sets of numbers together. It is reasonable to try the following code, +though it will not produce the desired results. + +```{python} +y = [4, 9, 7] +x + y + +``` + +The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists +element-by-element? + In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*. + In fact, concatenation is the behavior that we saw earlier when we entered `"hello" + " " + "world"`. + + + +This example reflects the fact that + `Python` is a general-purpose programming language. Much of `Python`'s data-specific +functionality comes from other packages, notably `numpy` +and `pandas`. +In the next section, we will introduce the `numpy` package. +See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/numpy/user/quickstart.html) for more information about `numpy`. + + + +## Introduction to Numerical Python + +As mentioned earlier, this book makes use of functionality that is contained in the `numpy` + *library*, or *package*. A package is a collection of modules that are not necessarily included in + the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*. + + + To access `numpy`, we must first `import` it. + +```{python} +import numpy as np +``` +In the previous line, we named the `numpy` *module* `np`; an abbreviation for easier referencing. + + +In `numpy`, an *array* is a generic term for a multidimensional +set of numbers. +We use the `np.array()` function to define `x` and `y`, which are one-dimensional arrays, i.e. vectors. + +```{python} +x = np.array([3, 4, 5]) +y = np.array([4, 9, 7]) +``` +Note that if you forgot to run the `import numpy as np` command earlier, then +you will encounter an error in calling the `np.array()` function in the previous line. + The syntax `np.array()` indicates that the function being called +is part of the `numpy` package, which we have abbreviated as `np`. + + +Since `x` and `y` have been defined using `np.array()`, we get a sensible result when we add them together. Compare this to our results in the previous section, + when we tried to add two lists without using `numpy`. + +```{python} +x + y +``` + + + + + +In `numpy`, matrices are typically represented as two-dimensional arrays, and vectors as one-dimensional arrays. {While it is also possible to create matrices using `np.matrix()`, we will use `np.array()` throughout the labs in this book.} +We can create a two-dimensional array as follows. + +```{python} +x = np.array([[1, 2], [3, 4]]) +x +``` + + + + + +The object `x` has several +*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute` +with the name of the attribute. +For instance, we can access the `ndim` attribute of `x` as follows. + +```{python} +x.ndim +``` + +The output indicates that `x` is a two-dimensional array. +Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is +comprised of 64-bit integers: + +```{python} +x.dtype +``` +Why is `x` comprised of integers? This is because we created `x` by passing in exclusively integers to the `np.array()` function. + If +we had passed in any decimals, then we would have obtained an array of +*floating point numbers* (i.e. real-valued numbers). + +```{python} +np.array([[1, 2], [3.0, 4]]).dtype + +``` + + +Typing `fun?` will cause `Python` to display +documentation associated with the function `fun`, if it exists. +We can try this for `np.array()`. + +```{python} +# np.array? + +``` +This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`. + +```{python} +np.array([[1, 2], [3, 4]], float).dtype + +``` + + +The array `x` is two-dimensional. We can find out the number of rows and columns by looking +at its `shape` attribute. + +```{python} +x.shape + +``` + + +A *method* is a function that is associated with an +object. +For instance, given an array `x`, the expression +`x.sum()` sums all of its elements, using the `sum()` +method for arrays. +The call `x.sum()` automatically provides `x` as the +first argument to its `sum()` method. + +```{python} +x = np.array([1, 2, 3, 4]) +x.sum() +``` +We could also sum the elements of `x` by passing in `x` as an argument to the `np.sum()` function. + +```{python} +x = np.array([1, 2, 3, 4]) +np.sum(x) +``` + As another example, the +`reshape()` method returns a new array with the same elements as +`x`, but a different shape. + We do this by passing in a `tuple` in our call to + `reshape()`, in this case `(2, 3)`. This tuple specifies that we would like to create a two-dimensional array with +$2$ rows and $3$ columns. {Like lists, tuples represent a sequence of objects. Why do we need more than one way to create a sequence? There are a few differences between tuples and lists, but perhaps the most important is that elements of a tuple cannot be modified, whereas elements of a list can be.} + +In what follows, the +`\n` character creates a *new line*. + +```{python} +x = np.array([1, 2, 3, 4, 5, 6]) +print('beginning x:\n', x) +x_reshape = x.reshape((2, 3)) +print('reshaped x:\n', x_reshape) + +``` + +The previous output reveals that `numpy` arrays are specified as a sequence +of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*. + + +`Python` (and hence `numpy`) uses 0-based +indexing. This means that to access the top left element of `x_reshape`, +we type in `x_reshape[0,0]`. + +```{python} +x_reshape[0, 0] +``` +Similarly, `x_reshape[1,2]` yields the element in the second row and the third column +of `x_reshape`. + +```{python} +x_reshape[1, 2] +``` +Similarly, `x[2]` yields the +third entry of `x`. + +Now, let's modify the top left element of `x_reshape`. To our surprise, we discover that the first element of `x` has been modified as well! + + + +```{python} +print('x before we modify x_reshape:\n', x) +print('x_reshape before we modify x_reshape:\n', x_reshape) +x_reshape[0, 0] = 5 +print('x_reshape after we modify its top left element:\n', x_reshape) +print('x after we modify top left element of x_reshape:\n', x) + +``` + +Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory. + + + + + +We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces +an *exception*, or error. + +```{python} +my_tuple = (3, 4, 5) +my_tuple[0] = 2 + +``` + + +We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple. +The `ndim` attribute yields the number of dimensions, and `T` provides its transpose. + +```{python} +x_reshape.shape, x_reshape.ndim, x_reshape.T + +``` + +Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple. + +We will often want to apply functions to arrays. +For instance, we can compute the +square root of the entries using the `np.sqrt()` function: + +```{python} +np.sqrt(x) + +``` + +We can also square the elements: + +```{python} +x**2 + +``` + +We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2. + +```{python} +x**0.5 + +``` + + +Throughout this book, we will often want to generate random data. +The `np.random.normal()` function generates a vector of random +normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`. +The first line of the help page reads `normal(loc=0.0, scale=1.0, size=None)`. + This *signature* line tells us that the function's arguments are `loc`, `scale`, and `size`. These are *keyword* arguments, which means that when they are passed into + the function, they can be referred to by name (in any order). {`Python` also uses *positional* arguments. Positional arguments do not need to use a keyword. To see an example, type in `np.sum?`. We see that `a` is a positional argument, i.e. this function assumes that the first unnamed argument that it receives is the array to be summed. By contrast, `axis` and `dtype` are keyword arguments: the position in which these arguments are entered into `np.sum()` does not matter.} + By default, this function will generate random normal variable(s) with mean (`loc`) $0$ and standard deviation (`scale`) $1$; furthermore, + a single random variable will be generated unless the argument to `size` is changed. + +We now generate 50 independent random variables from a $N(0,1)$ distribution. + +```{python} +x = np.random.normal(size=50) +x + +``` + +We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`. + +```{python} +y = x + np.random.normal(loc=50, scale=1, size=50) +``` +The `np.corrcoef()` function computes the correlation matrix between `x` and `y`. The off-diagonal elements give the +correlation between `x` and `y`. + +```{python} +np.corrcoef(x, y) +``` + +If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few +commands. In particular, + each +time we call `np.random.normal()`, we will get a different answer, as shown in the following example. + +```{python} +print(np.random.normal(scale=5, size=2)) +print(np.random.normal(scale=5, size=2)) + +``` + + + +In order to ensure that our code provides exactly the same results +each time it is run, we can set a *random seed* +using the +`np.random.default_rng()` function. +This function takes an arbitrary, user-specified integer argument. If we set a random seed before +generating random data, then re-running our code will yield the same results. The +object `rng` has essentially all the random number generating methods found in `np.random`. Hence, to +generate normal data we use `rng.normal()`. + +```{python} +rng = np.random.default_rng(1303) +print(rng.normal(scale=5, size=2)) +rng2 = np.random.default_rng(1303) +print(rng2.normal(scale=5, size=2)) +``` + +Throughout the labs in this book, we use `np.random.default_rng()` whenever we +perform calculations involving random quantities within `numpy`. In principle, this +should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible +that some small discrepancies may occur between the output +in the labs and the output +from `numpy`. + +The `np.mean()`, `np.var()`, and `np.std()` functions can be used +to compute the mean, variance, and standard deviation of arrays. These functions are also +available as methods on the arrays. + +```{python} +rng = np.random.default_rng(3) +y = rng.standard_normal(10) +np.mean(y), y.mean() +``` + + + +```{python} +np.var(y), y.var(), np.mean((y - y.mean())**2) +``` + + +Notice that by default `np.var()` divides by the sample size $n$ rather +than $n-1$; see the `ddof` argument in `np.var?`. + + +```{python} +np.sqrt(np.var(y)), np.std(y) +``` + +The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix. +To see this, we construct a $10 \times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums. + +```{python} +X = rng.standard_normal((10, 3)) +X +``` + +Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its rows. We pass this argument into the `mean()` method for the object `X`. + +```{python} +X.mean(axis=0) +``` + +The following yields the same result. + +```{python} +X.mean(0) +``` + + + +## Graphics +In `Python`, common practice is to use the library +`matplotlib` for graphics. +However, since `Python` was not written with data analysis in mind, + the notion of plotting is not intrinsic to the language. +We will use the `subplots()` function +from `matplotlib.pyplot` to create a figure and the +axes onto which we plot our data. +For many more examples of how to make plots in `Python`, +readers are encouraged to visit [matplotlib.org/stable/gallery/](https://matplotlib.org/stable/gallery/index.html). + +In `matplotlib`, a plot consists of a *figure* and one or more *axes*. You can think of the figure as the blank canvas upon which +one or more plots will be displayed: it is the entire plotting window. +The *axes* contain important information about each plot, such as its $x$- and $y$-axis labels, +title, and more. (Note that in `matplotlib`, the word *axes* is not the plural of *axis*: a plot's *axes* contains much more information +than just the $x$-axis and the $y$-axis.) + +We begin by importing the `subplots()` function +from `matplotlib`. We use this function +throughout when creating figures. +The function returns a tuple of length two: a figure +object as well as the relevant axes object. We will typically +pass `figsize` as a keyword argument. +Having created our axes, we attempt our first plot using its `plot()` method. +To learn more about it, +type `ax.plot?`. + +```{python} +from matplotlib.pyplot import subplots +fig, ax = subplots(figsize=(8, 8)) +x = rng.standard_normal(100) +y = rng.standard_normal(100) +ax.plot(x, y); + +``` + +We pause here to note that we have *unpacked* the tuple of length two returned by `subplots()` into the two distinct +variables `fig` and `ax`. Unpacking +is typically preferred to the following equivalent but slightly more verbose code: + +```{python} +output = subplots(figsize=(8, 8)) +fig = output[0] +ax = output[1] +``` + +We see that our earlier cell produced a line plot, which is the default. To create a scatterplot, we provide an additional argument to `ax.plot()`, indicating that circles should be displayed. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.plot(x, y, 'o'); +``` +Different values +of this additional argument can be used to produce different colored lines +as well as different linestyles. + + + +As an alternative, we could use the `ax.scatter()` function to create a scatterplot. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.scatter(x, y, marker='o'); +``` + +Notice that in the code blocks above, we have ended +the last line with a semicolon. This prevents `ax.plot(x, y)` from printing +text to the notebook. However, it does not prevent a plot from being produced. + If we omit the trailing semi-colon, then we obtain the following output: + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.scatter(x, y, marker='o') + +``` +In what follows, we will use + trailing semicolons whenever the text that would be output is not +germane to the discussion at hand. + + + + + + +To label our plot, we make use of the `set_xlabel()`, `set_ylabel()`, and `set_title()` methods +of `ax`. + + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.scatter(x, y, marker='o') +ax.set_xlabel("this is the x-axis") +ax.set_ylabel("this is the y-axis") +ax.set_title("Plot of X vs Y"); +``` + + Having access to the figure object `fig` itself means that we can go in and change some aspects and then redisplay it. Here, we change + the size from `(8, 8)` to `(12, 3)`. + + +```{python} +fig.set_size_inches(12,3) +fig +``` + + + +Occasionally we will want to create several plots within a figure. This can be +achieved by passing additional arguments to `subplots()`. +Below, we create a $2 \times 3$ grid of plots +in a figure of size determined by the `figsize` argument. In such +situations, there is often a relationship between the axes in the plots. For example, +all plots may have a common $x$-axis. The `subplots()` function can automatically handle +this situation when passed the keyword argument `sharex=True`. +The `axes` object below is an array pointing to different plots in the figure. + +```{python} +fig, axes = subplots(nrows=2, + ncols=3, + figsize=(15, 5)) +``` +We now produce a scatter plot with `'o'` in the second column of the first row and +a scatter plot with `'+'` in the third column of the second row. + +```{python} +axes[0,1].plot(x, y, 'o') +axes[1,2].scatter(x, y, marker='+') +fig +``` +Type `subplots?` to learn more about +`subplots()`. + + + + + +To save the output of `fig`, we call its `savefig()` +method. The argument `dpi` is the dots per inch, used +to determine how large the figure will be in pixels. + +```{python} +fig.savefig("Figure.png", dpi=400) +fig.savefig("Figure.pdf", dpi=200); + +``` + + +We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it. + +```{python} +axes[0,1].set_xlim([-1,1]) +fig.savefig("Figure_updated.jpg") +fig +``` + +We now create some more sophisticated plots. The +`ax.contour()` method produces a *contour plot* +in order to represent three-dimensional data, similar to a +topographical map. It takes three arguments: + +* A vector of `x` values (the first dimension), +* A vector of `y` values (the second dimension), and +* A matrix whose elements correspond to the `z` value (the third +dimension) for each pair of `(x,y)` coordinates. + +To create `x` and `y`, we’ll use the command `np.linspace(a, b, n)`, +which returns a vector of `n` numbers starting at `a` and ending at `b`. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +x = np.linspace(-np.pi, np.pi, 50) +y = x +f = np.multiply.outer(np.cos(y), 1 / (1 + x**2)) +ax.contour(x, y, f); + +``` +We can increase the resolution by adding more levels to the image. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.contour(x, y, f, levels=45); +``` +To fine-tune the output of the +`ax.contour()` function, take a +look at the help file by typing `?plt.contour`. + +The `ax.imshow()` method is similar to +`ax.contour()`, except that it produces a color-coded plot +whose colors depend on the `z` value. This is known as a +*heatmap*, and is sometimes used to plot temperature in +weather forecasts. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.imshow(f); + +``` + + +## Sequences and Slice Notation + + +As seen above, the +function `np.linspace()` can be used to create a sequence +of numbers. + +```{python} +seq1 = np.linspace(0, 10, 11) +seq1 + +``` + + +The function `np.arange()` + returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence + that starts at $0$ and ends at $10$. + +```{python} +seq2 = np.arange(0, 10) +seq2 + +``` + +Why isn't $10$ output above? This has to do with *slice* notation in `Python`. +Slice notation +is used to index sequences such as lists, tuples and arrays. +Suppose we want to retrieve the fourth through sixth (inclusive) entries +of a string. We obtain a slice of the string using the indexing notation `[3:6]`. + +```{python} +"hello world"[3:6] +``` +In the code block above, the notation `3:6` is shorthand for `slice(3,6)` when used inside +`[]`. + +```{python} +"hello world"[slice(3,6)] + +``` + +You might have expected `slice(3,6)` to output the fourth through seventh characters in the text string (recalling that `Python` begins its indexing at zero), but instead it output the fourth through sixth. + This also explains why the earlier `np.arange(0, 10)` command output only the integers from $0$ to $9$. +See the documentation `slice?` for useful options in creating slices. + + + + + + + + + + + + + + + + + + + + + + + +## Indexing Data +To begin, we create a two-dimensional `numpy` array. + +```{python} +A = np.array(np.arange(16)).reshape((4, 4)) +A + +``` + +Typing `A[1,2]` retrieves the element corresponding to the second row and third +column. (As usual, `Python` indexes from $0.$) + +```{python} +A[1,2] + +``` + +The first number after the open-bracket symbol `[` + refers to the row, and the second number refers to the column. + +### Indexing Rows, Columns, and Submatrices + To select multiple rows at a time, we can pass in a list + specifying our selection. For instance, `[1,3]` will retrieve the second and fourth rows: + +```{python} +A[[1,3]] + +``` + +To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets. +In this case we need to supply the first argument `:` +which selects all rows. + +```{python} +A[:,[0,2]] + +``` + +Now, suppose that we want to select the submatrix made up of the second and fourth +rows as well as the first and third columns. This is where +indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns: + +```{python} +A[[1,3],[0,2]] + +``` + + Oops --- what happened? We got a one-dimensional array of length two identical to + +```{python} +np.array([A[1,0],A[3,2]]) + +``` + + Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns: + +```{python} +A[[1,3],[0,2,3]] + +``` + +We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix. + +One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns. + + +```{python} +A[[1,3]][:,[0,2]] + +``` + + + +There are more efficient ways of achieving the same result. + +The *convenience function* `np.ix_()` allows us to extract a submatrix +using lists, by creating an intermediate *mesh* object. + +```{python} +idx = np.ix_([1,3],[0,2,3]) +A[idx] + +``` + + +Alternatively, we can subset matrices efficiently using slices. + +The slice +`1:4:2` captures the second and fourth items of a sequence, while the slice `0:3:2` captures +the first and third items (the third element in a slice sequence is the step size). + +```{python} +A[1:4:2,0:3:2] + +``` + + + +Why are we able to retrieve a submatrix directly using slices but not using lists? +Its because they are different `Python` types, and +are treated differently by `numpy`. +Slices can be used to extract objects from arbitrary sequences, such as strings, lists, and tuples, while the use of lists for indexing is more limited. + + + + + + + + + + + + + +### Boolean Indexing +In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively). +The next line creates a vector of $0$'s, represented as Booleans, of length equal to the first dimension of `A`. + +```{python} +keep_rows = np.zeros(A.shape[0], bool) +keep_rows +``` +We now set two of the elements to `True`. + +```{python} +keep_rows[[1,3]] = True +keep_rows + +``` + +Note that the elements of `keep_rows`, when viewed as integers, are the same as the +values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When +applied to two arrays, the `==` operation is applied elementwise. + +```{python} +np.all(keep_rows == np.array([0,1,0,1])) + +``` + +(Here, the function `np.all()` has checked whether +all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.) + + + However, even though `np.array([0,1,0,1])` and `keep_rows` are equal according to `==`, they index different sets of rows! +The former retrieves the first, second, first, and second rows of `A`. + +```{python} +A[np.array([0,1,0,1])] + +``` + + By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`. + +```{python} +A[keep_rows] + +``` + +This example shows that Booleans and integers are treated differently by `numpy`. + + +We again make use of the `np.ix_()` function + to create a mesh containing the second and fourth rows, and the first, third, and fourth columns. This time, we apply the function to Booleans, + rather than lists. + +```{python} +keep_cols = np.zeros(A.shape[1], bool) +keep_cols[[0, 2, 3]] = True +idx_bool = np.ix_(keep_rows, keep_cols) +A[idx_bool] + +``` + +We can also mix a list with an array of Booleans in the arguments to `np.ix_()`: + +```{python} +idx_mixed = np.ix_([1,3], keep_cols) +A[idx_mixed] + +``` + + + +For more details on indexing in `numpy`, readers are referred +to the `numpy` tutorial mentioned earlier. + + + +## Loading Data + +Data sets often contain different types of data, and may have names associated with the rows or columns. +For these reasons, they typically are best accommodated using a + *data frame*. + We can think of a data frame as a sequence +of arrays of identical length; these are the columns. Entries in the +different arrays can be combined to form a row. + The `pandas` +library can be used to create and work with data frame objects. + + +### Reading in a Data Set + +The first step of most analyses involves importing a data set into +`Python`. + Before attempting to load +a data set, we must make sure that `Python` knows where to find the file containing it. +If the +file is in the same location +as this notebook file, then we are all set. +Otherwise, +the command +`os.chdir()` can be used to *change directory*. (You will need to call `import os` before calling `os.chdir()`.) + + +We will begin by reading in `Auto.csv`, available on the book website. This is a comma-separated file, and can be read in using `pd.read_csv()`: + +```{python} +import pandas as pd +Auto = pd.read_csv('Auto.csv') +Auto + +``` + +The book website also has a whitespace-delimited version of this data, called `Auto.data`. This can be read in as follows: + +```{python} +Auto = pd.read_csv('Auto.data', delim_whitespace=True) + +``` + Both `Auto.csv` and `Auto.data` are simply text +files. Before loading data into `Python`, it is a good idea to view it using +a text editor or other software, such as Microsoft Excel. + + + + +We now take a look at the column of `Auto` corresponding to the variable `horsepower`: + +```{python} +Auto['horsepower'] + +``` +We see that the `dtype` of this column is `object`. +It turns out that all values of the `horsepower` column were interpreted as strings when reading +in the data. +We can find out why by looking at the unique values. + +```{python} +np.unique(Auto['horsepower']) + +``` +We see the culprit is the value `?`, which is being used to encode missing values. + + + + +To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`. +Now, each instance of `?` in the file is replaced with the +value `np.nan`, which means *not a number*: + +```{python} +Auto = pd.read_csv('Auto.data', + na_values=['?'], + delim_whitespace=True) +Auto['horsepower'].sum() + +``` + + +The `Auto.shape` attribute tells us that the data has 397 +observations, or rows, and nine variables, or columns. + +```{python} +Auto.shape + +``` + +There are +various ways to deal with missing data. +In this case, since only five of the rows contain missing +observations, we choose to use the `Auto.dropna()` method to simply remove these rows. + +```{python} +Auto_new = Auto.dropna() +Auto_new.shape + +``` + + +### Basics of Selecting Rows and Columns + +We can use `Auto.columns` to check the variable names. + +```{python} +Auto = Auto_new # overwrite the previous value +Auto.columns + +``` + + +Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array. +Recall that the first argument to the `[]` method +is always applied to the rows of the array. +Similarly, +passing in a slice to the `[]` method creates a data frame whose *rows* are determined by the slice: + +```{python} +Auto[:3] + +``` +Similarly, an array of Booleans can be used to subset the rows: + +```{python} +idx_80 = Auto['year'] > 80 +Auto[idx_80] + +``` +However, if we pass in a list of strings to the `[]` method, then we obtain a data frame containing the corresponding set of *columns*. + +```{python} +Auto[['mpg', 'horsepower']] + +``` +Since we did not specify an *index* column when we loaded our data frame, the rows are labeled using integers +0 to 396. + +```{python} +Auto.index + +``` +We can use the +`set_index()` method to re-name the rows using the contents of `Auto['name']`. + +```{python} +Auto_re = Auto.set_index('name') +Auto_re + +``` + +```{python} +Auto_re.columns + +``` +We see that the column `'name'` is no longer there. + +Now that the index has been set to `name`, we can access rows of the data +frame by `name` using the `{loc[]`} method of +`Auto`: + +```{python} +rows = ['amc rebel sst', 'ford torino'] +Auto_re.loc[rows] + +``` +As an alternative to using the index name, we could retrieve the 4th and 5th rows of `Auto` using the `{iloc[]`} method: + +```{python} +Auto_re.iloc[[3,4]] + +``` +We can also use it to retrieve the 1st, 3rd and and 4th columns of `Auto_re`: + +```{python} +Auto_re.iloc[:,[0,2,3]] + +``` +We can extract the 4th and 5th rows, as well as the 1st, 3rd and 4th columns, using +a single call to `iloc[]`: + +```{python} +Auto_re.iloc[[3,4],[0,2,3]] + +``` +Index entries need not be unique: there are several cars in the data frame named `ford galaxie 500`. + +```{python} +Auto_re.loc['ford galaxie 500', ['mpg', 'origin']] + +``` +### More on Selecting Rows and Columns +Suppose now that we want to create a data frame consisting of the `weight` and `origin` of the subset of cars with +`year` greater than 80 --- i.e. those built after 1980. +To do this, we first create a Boolean array that indexes the rows. +The `loc[]` method allows for Boolean entries as well as strings: + +```{python} +idx_80 = Auto_re['year'] > 80 +Auto_re.loc[idx_80, ['weight', 'origin']] + +``` + + +To do this more concisely, we can use an anonymous function called a `lambda`: + +```{python} +Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']] + +``` +The `lambda` call creates a function that takes a single +argument, here `df`, and returns `df['year']>80`. +Since it is created inside the `loc[]` method for the +dataframe `Auto_re`, that dataframe will be the argument supplied. +As another example of using a `lambda`, suppose that +we want all cars built after 1980 that achieve greater than 30 miles per gallon: + +```{python} +Auto_re.loc[lambda df: (df['year'] > 80) & (df['mpg'] > 30), + ['weight', 'origin'] + ] + +``` +The symbol `&` computes an element-wise *and* operation. +As another example, suppose that we want to retrieve all `Ford` and `Datsun` +cars with `displacement` less than 300. We check whether each `name` entry contains either the string `ford` or `datsun` using the `str.contains()` method of the `index` attribute of +of the dataframe: + +```{python} +Auto_re.loc[lambda df: (df['displacement'] < 300) + & (df.index.str.contains('ford') + | df.index.str.contains('datsun')), + ['weight', 'origin'] + ] + +``` +Here, the symbol `|` computes an element-wise *or* operation. + +In summary, a powerful set of operations is available to index the rows and columns of data frames. For integer based queries, use the `iloc[]` method. For string and Boolean +selections, use the `loc[]` method. For functional queries that filter rows, use the `loc[]` method +with a function (typically a `lambda`) in the rows argument. + +## For Loops +A `for` loop is a standard tool in many languages that +repeatedly evaluates some chunk of code while +varying different values inside the code. +For example, suppose we loop over elements of a list and compute their sum. + +```{python} +total = 0 +for value in [3,2,19]: + total += value +print('Total is: {0}'.format(total)) + +``` +The indented code beneath the line with the `for` statement is run +for each value in the sequence +specified in the `for` statement. The loop ends either +when the cell ends or when code is indented at the same level +as the original `for` statement. +We see that the final line above which prints the total is executed +only once after the for loop has terminated. Loops +can be nested by additional indentation. + +```{python} +total = 0 +for value in [2,3,19]: + for weight in [3, 2, 1]: + total += value * weight +print('Total is: {0}'.format(total)) +``` +Above, we summed over each combination of `value` and `weight`. +We also took advantage of the *increment* notation +in `Python`: the expression `a += b` is equivalent +to `a = a + b`. Besides +being a convenient notation, this can save time in computationally +heavy tasks in which the intermediate value of `a+b` need not +be explicitly created. + +Perhaps a more +common task would be to sum over `(value, weight)` pairs. For instance, +to compute the average value of a random variable that takes on +possible values 2, 3 or 19 with probability 0.2, 0.3, 0.5 respectively +we would compute the weighted sum. Tasks such as this +can often be accomplished using the `zip()` function that +loops over a sequence of tuples. + +```{python} +total = 0 +for value, weight in zip([2,3,19], + [0.2,0.3,0.5]): + total += weight * value +print('Weighted average is: {0}'.format(total)) + +``` + +### String Formatting +In the code chunk above we also printed a string +displaying the total. However, the object `total` +is an integer and not a string. +Inserting the value of something into +a string is a common task, made +simple using +some of the powerful string formatting +tools in `Python`. +Many data cleaning tasks involve +manipulating and programmatically +producing strings. + +For example we may want to loop over the columns of a data frame and +print the percent missing in each column. +Let’s create a data frame `D` with columns in which 20% of the entries are missing i.e. set +to `np.nan`. We’ll create the +values in `D` from a normal distribution with mean 0 and variance 1 using `rng.standard_normal()` +and then overwrite some random entries using `rng.choice()`. + +```{python} +rng = np.random.default_rng(1) +A = rng.standard_normal((127, 5)) +M = rng.choice([0, np.nan], p=[0.8,0.2], size=A.shape) +A += M +D = pd.DataFrame(A, columns=['food', + 'bar', + 'pickle', + 'snack', + 'popcorn']) +D[:3] + +``` + + +```{python} +for col in D.columns: + template = 'Column "{0}" has {1:.2%} missing values' + print(template.format(col, + np.isnan(D[col]).mean())) + +``` +We see that the `template.format()` method expects two arguments `{0}` +and `{1:.2%}`, and the latter includes some formatting +information. In particular, it specifies that the second argument should be expressed as a percent with two decimal digits. + +The reference +[docs.python.org/3/library/string.html](https://docs.python.org/3/library/string.html) +includes many helpful and more complex examples. + + +## Additional Graphical and Numerical Summaries +We can use the `ax.plot()` or `ax.scatter()` functions to display the quantitative variables. However, simply typing the variable names will produce an error message, +because `Python` does not know to look in the `Auto` data set for those variables. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.plot(horsepower, mpg, 'o'); +``` +We can address this by accessing the columns directly: + +```{python} +fig, ax = subplots(figsize=(8, 8)) +ax.plot(Auto['horsepower'], Auto['mpg'], 'o'); + +``` +Alternatively, we can use the `plot()` method with the call `Auto.plot()`. +Using this method, +the variables can be accessed by name. +The plot methods of a data frame return a familiar object: +an axes. We can use it to update the plot as we did previously: + +```{python} +ax = Auto.plot.scatter('horsepower', 'mpg') +ax.set_title('Horsepower vs. MPG'); +``` +If we want to save +the figure that contains a given axes, we can find the relevant figure +by accessing the `figure` attribute: + +```{python} +fig = ax.figure +fig.savefig('horsepower_mpg.png'); +``` + +We can further instruct the data frame to plot to a particular axes object. In this +case the corresponding `plot()` method will return the +modified axes we passed in as an argument. Note that +when we request a one-dimensional grid of plots, the object `axes` is similarly +one-dimensional. We place our scatter plot in the middle plot of a row of three plots +within a figure. + +```{python} +fig, axes = subplots(ncols=3, figsize=(15, 5)) +Auto.plot.scatter('horsepower', 'mpg', ax=axes[1]); + +``` + +Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`. + + +We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable. +However, since there is only a small number of possible values for this variable, we may wish to treat it as + qualitative. Below, we replace +the `cylinders` column with a categorical version of `Auto.cylinders`. The function `pd.Series()` owes its name to the fact that `pandas` is often used in time series applications. + +```{python} +Auto.cylinders = pd.Series(Auto.cylinders, dtype='category') +Auto.cylinders.dtype + +``` + Now that `cylinders` is qualitative, we can display it using + the `boxplot()` method. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +Auto.boxplot('mpg', by='cylinders', ax=ax); + +``` + +The `hist()` method can be used to plot a *histogram*. + +```{python} +fig, ax = subplots(figsize=(8, 8)) +Auto.hist('mpg', ax=ax); + +``` +The color of the bars and the number of bins can be changed: + +```{python} +fig, ax = subplots(figsize=(8, 8)) +Auto.hist('mpg', color='red', bins=12, ax=ax); + +``` + See `Auto.hist?` for more plotting +options. + +We can use the `pd.plotting.scatter_matrix()` function to create a *scatterplot matrix* to visualize all of the pairwise relationships between the columns in +a data frame. + +```{python} +pd.plotting.scatter_matrix(Auto); + +``` + We can also produce scatterplots +for a subset of the variables. + +```{python} +pd.plotting.scatter_matrix(Auto[['mpg', + 'displacement', + 'weight']]); + +``` +The `describe()` method produces a numerical summary of each column in a data frame. + +```{python} +Auto[['mpg', 'weight']].describe() + +``` +We can also produce a summary of just a single column. + +```{python} +Auto['cylinders'].describe() +Auto['mpg'].describe() + +``` +To exit `Jupyter`, select `File / Shut Down`. + + + +