pairing notebooks
This commit is contained in:
@@ -1,3 +1,16 @@
|
||||
---
|
||||
jupyter:
|
||||
jupytext:
|
||||
cell_metadata_filter: -all
|
||||
formats: ipynb,Rmd
|
||||
main_language: python
|
||||
text_representation:
|
||||
extension: .Rmd
|
||||
format_name: rmarkdown
|
||||
format_version: '1.2'
|
||||
jupytext_version: 1.14.7
|
||||
---
|
||||
|
||||
|
||||
# Chapter 12
|
||||
|
||||
@@ -31,7 +44,7 @@ from scipy.cluster.hierarchy import \
|
||||
from ISLP.cluster import compute_linkage
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Principal Components Analysis
|
||||
In this lab, we perform PCA on `USArrests`, a data set in the
|
||||
`R` computing environment.
|
||||
@@ -45,22 +58,22 @@ USArrests = get_rdataset('USArrests').data
|
||||
USArrests
|
||||
|
||||
```
|
||||
|
||||
|
||||
The columns of the data set contain the four variables.
|
||||
|
||||
```{python}
|
||||
USArrests.columns
|
||||
|
||||
```
|
||||
|
||||
|
||||
We first briefly examine the data. We notice that the variables have vastly different means.
|
||||
|
||||
```{python}
|
||||
USArrests.mean()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
Dataframes have several useful methods for computing
|
||||
column-wise summaries. We can also examine the
|
||||
variance of the four variables using the `var()` method.
|
||||
@@ -69,7 +82,7 @@ variance of the four variables using the `var()` method.
|
||||
USArrests.var()
|
||||
|
||||
```
|
||||
|
||||
|
||||
Not surprisingly, the variables also have vastly different variances.
|
||||
The `UrbanPop` variable measures the percentage of the population
|
||||
in each state living in an urban area, which is not a comparable
|
||||
@@ -119,7 +132,7 @@ of the variables. In this case, since we centered and scaled the data with
|
||||
pcaUS.mean_
|
||||
|
||||
```
|
||||
|
||||
|
||||
The scores can be computed using the `transform()` method
|
||||
of `pcaUS` after it has been fit.
|
||||
|
||||
@@ -137,7 +150,7 @@ principal component loading vector.
|
||||
pcaUS.components_
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `biplot` is a common visualization method used with
|
||||
PCA. It is not built in as a standard
|
||||
part of `sklearn`, though there are python
|
||||
@@ -178,14 +191,14 @@ for k in range(pcaUS.components_.shape[1]):
|
||||
USArrests.columns[k])
|
||||
|
||||
```
|
||||
|
||||
|
||||
The standard deviations of the principal component scores are as follows:
|
||||
|
||||
```{python}
|
||||
scores.std(0, ddof=1)
|
||||
```
|
||||
|
||||
|
||||
|
||||
The variance of each score can be extracted directly from the `pcaUS` object via
|
||||
the `explained_variance_` attribute.
|
||||
|
||||
@@ -207,7 +220,7 @@ We can plot the PVE explained by each component, as well as the cumulative PVE.
|
||||
plot the proportion of variance explained.
|
||||
|
||||
```{python}
|
||||
%%capture
|
||||
# %%capture
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
|
||||
ticks = np.arange(pcaUS.n_components_)+1
|
||||
ax = axes[0]
|
||||
@@ -307,7 +320,7 @@ Xna = X.copy()
|
||||
Xna[r_idx, c_idx] = np.nan
|
||||
|
||||
```
|
||||
|
||||
|
||||
Here the array `r_idx`
|
||||
contains 20 integers from 0 to 49; this represents the states (rows of `X`) that are selected to contain missing values. And `c_idx` contains
|
||||
20 integers from 0 to 3, representing the features (columns in `X`) that contain the missing values for each of the selected states.
|
||||
@@ -335,7 +348,7 @@ Xbar = np.nanmean(Xhat, axis=0)
|
||||
Xhat[r_idx, c_idx] = Xbar[c_idx]
|
||||
|
||||
```
|
||||
|
||||
|
||||
Before we begin Step 2, we set ourselves up to measure the progress of our
|
||||
iterations:
|
||||
|
||||
@@ -374,7 +387,7 @@ while rel_err > thresh:
|
||||
.format(count, mss, rel_err))
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals 0.381.
|
||||
|
||||
Finally, we compute the correlation between the 20 imputed values
|
||||
@@ -384,8 +397,8 @@ and the actual values:
|
||||
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data might look to more specialized `Python` implementations.
|
||||
|
||||
|
||||
@@ -431,7 +444,7 @@ ax.scatter(X[:,0], X[:,1], c=kmeans.labels_)
|
||||
ax.set_title("K-Means Clustering Results with K=2");
|
||||
|
||||
```
|
||||
|
||||
|
||||
Here the observations can be easily plotted because they are
|
||||
two-dimensional. If there were more than two variables then we could
|
||||
instead perform PCA and plot the first two principal component score
|
||||
@@ -506,7 +519,7 @@ hc_comp = HClust(distance_threshold=0,
|
||||
hc_comp.fit(X)
|
||||
|
||||
```
|
||||
|
||||
|
||||
This computes the entire dendrogram.
|
||||
We could just as easily perform hierarchical clustering with average or single linkage instead:
|
||||
|
||||
@@ -521,7 +534,7 @@ hc_sing = HClust(distance_threshold=0,
|
||||
hc_sing.fit(X);
|
||||
|
||||
```
|
||||
|
||||
|
||||
To use a precomputed distance matrix, we provide an additional
|
||||
argument `metric="precomputed"`. In the code below, the first four lines computes the $50\times 50$ pairwise-distance matrix.
|
||||
|
||||
@@ -537,7 +550,7 @@ hc_sing_pre = HClust(distance_threshold=0,
|
||||
hc_sing_pre.fit(D)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We use
|
||||
`dendrogram()` from `scipy.cluster.hierarchy` to plot the dendrogram. However,
|
||||
`dendrogram()` expects a so-called *linkage-matrix representation*
|
||||
@@ -560,7 +573,7 @@ dendrogram(linkage_comp,
|
||||
**cargs);
|
||||
|
||||
```
|
||||
|
||||
|
||||
We may want to color branches of the tree above
|
||||
and below a cut-threshold differently. This can be achieved
|
||||
by changing the `color_threshold`. Let’s cut the tree at a height of 4,
|
||||
@@ -574,7 +587,7 @@ dendrogram(linkage_comp,
|
||||
above_threshold_color='black');
|
||||
|
||||
```
|
||||
|
||||
|
||||
To determine the cluster labels for each observation associated with a
|
||||
given cut of the dendrogram, we can use the `cut_tree()`
|
||||
function from `scipy.cluster.hierarchy`:
|
||||
@@ -594,7 +607,7 @@ or `height` to `cut_tree()`.
|
||||
cut_tree(linkage_comp, height=5)
|
||||
|
||||
```
|
||||
|
||||
|
||||
To scale the variables before performing hierarchical clustering of
|
||||
the observations, we use `StandardScaler()` as in our PCA example:
|
||||
|
||||
@@ -638,7 +651,7 @@ dendrogram(linkage_cor, ax=ax, **cargs)
|
||||
ax.set_title("Complete Linkage with Correlation-Based Dissimilarity");
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## NCI60 Data Example
|
||||
Unsupervised techniques are often used in the analysis of genomic
|
||||
@@ -653,7 +666,7 @@ nci_labs = NCI60['labels']
|
||||
nci_data = NCI60['data']
|
||||
|
||||
```
|
||||
|
||||
|
||||
Each cell line is labeled with a cancer type. We do not make use of
|
||||
the cancer types in performing PCA and clustering, as these are
|
||||
unsupervised techniques. But after performing PCA and clustering, we
|
||||
@@ -666,8 +679,8 @@ The data has 64 rows and 6830 columns.
|
||||
nci_data.shape
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
We begin by examining the cancer types for the cell lines.
|
||||
|
||||
|
||||
@@ -675,7 +688,7 @@ We begin by examining the cancer types for the cell lines.
|
||||
nci_labs.value_counts()
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### PCA on the NCI60 Data
|
||||
|
||||
@@ -690,7 +703,7 @@ nci_pca = PCA()
|
||||
nci_scores = nci_pca.fit_transform(nci_scaled)
|
||||
|
||||
```
|
||||
|
||||
|
||||
We now plot the first few principal component score vectors, in order
|
||||
to visualize the data. The observations (cell lines) corresponding to
|
||||
a given cancer type will be plotted in the same color, so that we can
|
||||
@@ -726,7 +739,7 @@ to have pretty similar gene expression levels.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
We can also plot the percent variance
|
||||
explained by the principal components as well as the cumulative percent variance explained.
|
||||
This is similar to the plots we made earlier for the `USArrests` data.
|
||||
@@ -785,7 +798,7 @@ def plot_nci(linkage, ax, cut=-np.inf):
|
||||
return hc
|
||||
|
||||
```
|
||||
|
||||
|
||||
Let’s plot our results.
|
||||
|
||||
```{python}
|
||||
@@ -817,7 +830,7 @@ pd.crosstab(nci_labs['label'],
|
||||
pd.Series(comp_cut.reshape(-1), name='Complete'))
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
There are some clear patterns. All the leukemia cell lines fall in
|
||||
one cluster, while the breast cancer cell lines are spread out over
|
||||
@@ -831,7 +844,7 @@ plot_nci('Complete', ax, cut=140)
|
||||
ax.axhline(140, c='r', linewidth=4);
|
||||
|
||||
```
|
||||
|
||||
|
||||
The `axhline()` function draws a horizontal line line on top of any
|
||||
existing set of axes. The argument `140` plots a horizontal
|
||||
line at height 140 on the dendrogram; this is a height that
|
||||
@@ -853,7 +866,7 @@ pd.crosstab(pd.Series(comp_cut, name='HClust'),
|
||||
pd.Series(nci_kmeans.labels_, name='K-means'))
|
||||
|
||||
```
|
||||
|
||||
|
||||
We see that the four clusters obtained using hierarchical clustering
|
||||
and $K$-means clustering are somewhat different. First we note
|
||||
that the labels in the two clusterings are arbitrary. That is, swapping
|
||||
|
||||
Reference in New Issue
Block a user