pairing notebooks

This commit is contained in:
Jonathan Taylor
2023-08-20 19:41:01 -07:00
parent c82e9d5067
commit 058e89ef1c
22 changed files with 489 additions and 346 deletions

View File

@@ -1,3 +1,16 @@
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: ipynb,Rmd
main_language: python
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.14.7
---
# Chapter 12
@@ -31,7 +44,7 @@ from scipy.cluster.hierarchy import \
from ISLP.cluster import compute_linkage
```
## Principal Components Analysis
In this lab, we perform PCA on `USArrests`, a data set in the
`R` computing environment.
@@ -45,22 +58,22 @@ USArrests = get_rdataset('USArrests').data
USArrests
```
The columns of the data set contain the four variables.
```{python}
USArrests.columns
```
We first briefly examine the data. We notice that the variables have vastly different means.
```{python}
USArrests.mean()
```
Dataframes have several useful methods for computing
column-wise summaries. We can also examine the
variance of the four variables using the `var()` method.
@@ -69,7 +82,7 @@ variance of the four variables using the `var()` method.
USArrests.var()
```
Not surprisingly, the variables also have vastly different variances.
The `UrbanPop` variable measures the percentage of the population
in each state living in an urban area, which is not a comparable
@@ -119,7 +132,7 @@ of the variables. In this case, since we centered and scaled the data with
pcaUS.mean_
```
The scores can be computed using the `transform()` method
of `pcaUS` after it has been fit.
@@ -137,7 +150,7 @@ principal component loading vector.
pcaUS.components_
```
The `biplot` is a common visualization method used with
PCA. It is not built in as a standard
part of `sklearn`, though there are python
@@ -178,14 +191,14 @@ for k in range(pcaUS.components_.shape[1]):
USArrests.columns[k])
```
The standard deviations of the principal component scores are as follows:
```{python}
scores.std(0, ddof=1)
```
The variance of each score can be extracted directly from the `pcaUS` object via
the `explained_variance_` attribute.
@@ -207,7 +220,7 @@ We can plot the PVE explained by each component, as well as the cumulative PVE.
plot the proportion of variance explained.
```{python}
%%capture
# %%capture
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
ticks = np.arange(pcaUS.n_components_)+1
ax = axes[0]
@@ -307,7 +320,7 @@ Xna = X.copy()
Xna[r_idx, c_idx] = np.nan
```
Here the array `r_idx`
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
@@ -335,7 +348,7 @@ Xbar = np.nanmean(Xhat, axis=0)
Xhat[r_idx, c_idx] = Xbar[c_idx]
```
Before we begin Step 2, we set ourselves up to measure the progress of our
iterations:
@@ -374,7 +387,7 @@ while rel_err > thresh:
.format(count, mss, rel_err))
```
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
Finally, we compute the correlation between the 20 imputed values
@@ -384,8 +397,8 @@ and the actual values:
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
```
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python` implementations.
@@ -431,7 +444,7 @@ ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
ax.set_title("K-Means Clustering Results with K=2");
```
Here the observations can be easily plotted because they are
two-dimensional. If there were more than two variables then we could
instead perform PCA and plot the first two principal component score
@@ -506,7 +519,7 @@ hc_comp = HClust(distance_threshold=0,
hc_comp.fit(X)
```
This computes the entire dendrogram.
We could just as easily perform hierarchical clustering with average or single linkage instead:
@@ -521,7 +534,7 @@ hc_sing = HClust(distance_threshold=0,
hc_sing.fit(X);
```
To use a precomputed distance matrix, we provide an additional
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
@@ -537,7 +550,7 @@ hc_sing_pre = HClust(distance_threshold=0,
hc_sing_pre.fit(D)
```
We use
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
`dendrogram()` expects a so-called *linkage-matrix representation*
@@ -560,7 +573,7 @@ dendrogram(linkage_comp,
**cargs);
```
We may want to color branches of the tree above
and below a cut-threshold differently. This can be achieved
by changing the `color_threshold`. Lets cut the tree at a height of 4,
@@ -574,7 +587,7 @@ dendrogram(linkage_comp,
above_threshold_color='black');
```
To determine the cluster labels for each observation associated with a
given cut of the dendrogram, we can use the `cut_tree()`
function from `scipy.cluster.hierarchy`:
@@ -594,7 +607,7 @@ or `height` to `cut_tree()`.
cut_tree(linkage_comp, height=5)
```
To scale the variables before performing hierarchical clustering of
the observations, we use `StandardScaler()` as in our PCA example:
@@ -638,7 +651,7 @@ dendrogram(linkage_cor, ax=ax, **cargs)
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
```
## NCI60 Data Example
Unsupervised techniques are often used in the analysis of genomic
@@ -653,7 +666,7 @@ nci_labs = NCI60['labels']
nci_data = NCI60['data']
```
Each cell line is labeled with a cancer type. We do not make use of
the cancer types in performing PCA and clustering, as these are
unsupervised techniques. But after performing PCA and clustering, we
@@ -666,8 +679,8 @@ The data has 64 rows and 6830 columns.
nci_data.shape
```
We begin by examining the cancer types for the cell lines.
@@ -675,7 +688,7 @@ We begin by examining the cancer types for the cell lines.
nci_labs.value_counts()
```
### PCA on the NCI60 Data
@@ -690,7 +703,7 @@ nci_pca = PCA()
nci_scores = nci_pca.fit_transform(nci_scaled)
```
We now plot the first few principal component score vectors, in order
to visualize the data. The observations (cell lines) corresponding to
a given cancer type will be plotted in the same color, so that we can
@@ -726,7 +739,7 @@ to have pretty similar gene expression levels.
We can also plot the percent variance
explained by the principal components as well as the cumulative percent variance explained.
This is similar to the plots we made earlier for the `USArrests` data.
@@ -785,7 +798,7 @@ def plot_nci(linkage, ax, cut=-np.inf):
return hc
```
Lets plot our results.
```{python}
@@ -817,7 +830,7 @@ pd.crosstab(nci_labs['label'],
pd.Series(comp_cut.reshape(-1), name='Complete'))
```
There are some clear patterns. All the leukemia cell lines fall in
one cluster, while the breast cancer cell lines are spread out over
@@ -831,7 +844,7 @@ plot_nci('Complete', ax, cut=140)
ax.axhline(140, c='r', linewidth=4);
```
The `axhline()` function draws a horizontal line line on top of any
existing set of axes. The argument `140` plots a horizontal
line at height 140 on the dendrogram; this is a height that
@@ -853,7 +866,7 @@ pd.crosstab(pd.Series(comp_cut, name='HClust'),
pd.Series(nci_kmeans.labels_, name='K-means'))
```
We see that the four clusters obtained using hierarchical clustering
and $K$-means clustering are somewhat different. First we note
that the labels in the two clusterings are arbitrary. That is, swapping