moved to parent directory

2023-08-06 18:31:39 -07:00
parent 677599530e
commit 2295708cb4
33 changed files with 2454 additions and 2446 deletions
--- a/Ch13-multiple-lab.Rmd
+++ b/Ch13-multiple-lab.Rmd
@@ -0,0 +1,579 @@
+---
+jupyter:
+  jupytext:
+    cell_metadata_filter: -all
+    formats: ipynb,Rmd
+    main_language: python
+    text_representation:
+      extension: .Rmd
+      format_name: rmarkdown
+      format_version: '1.2'
+      jupytext_version: 1.14.7
+---
+
+
+# Chapter 13
+
+# Lab: Multiple Testing
+ 
+
+
+We include our usual imports seen in earlier labs.
+
+```{python}
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import statsmodels.api as sm
+from ISLP import load_data
+
+```
+
+We also collect the new imports
+needed for this lab.
+
+```{python}
+from scipy.stats import \
+    (ttest_1samp,
+     ttest_rel,
+     ttest_ind,
+     t as t_dbn)
+from statsmodels.stats.multicomp import \
+     pairwise_tukeyhsd
+from statsmodels.stats.multitest import \
+     multipletests as mult_test
+
+```
+
+
+## Review of Hypothesis Tests
+We begin by performing some one-sample $t$-tests.
+
+First we create 100 variables, each consisting of 10 observations. The
+first 50 variables have mean $0.5$ and variance $1$, while the others
+have mean $0$ and variance $1$.
+
+```{python}
+rng = np.random.default_rng(12)
+X = rng.standard_normal((10, 100))
+true_mean = np.array([0.5]*50 + [0]*50)
+X += true_mean[None,:]
+
+```
+
+To begin, we use `ttest_1samp()`  from the
+`scipy.stats` module to test $H_{0}: \mu_1=0$, the null
+hypothesis that the first variable has mean zero.
+
+```{python}
+result = ttest_1samp(X[:,0], 0)
+result.pvalue
+
+```
+
+The $p$-value comes out to 0.931, which is not low enough to
+reject the null hypothesis at level $\alpha=0.05$.  In this case,
+$\mu_1=0.5$, so the null hypothesis is false. Therefore, we have made
+a Type II error by failing to reject the null hypothesis when the null
+hypothesis is false. 
+
+We now test $H_{0,j}: \mu_j=0$ for $j=1,\ldots,100$. We compute the
+100 $p$-values, and then construct a vector recording whether the
+$j$th $p$-value is less than or equal to 0.05, in which case we reject
+$H_{0j}$, or greater than 0.05, in which case we do not reject
+$H_{0j}$, for $j=1,\ldots,100$.
+
+```{python}
+p_values = np.empty(100)
+for i in range(100):
+   p_values[i] = ttest_1samp(X[:,i], 0).pvalue
+decision = pd.cut(p_values,
+                  [0, 0.05, 1],
+                  labels=['Reject H0',
+                          'Do not reject H0'])
+truth = pd.Categorical(true_mean == 0,
+                       categories=[True, False],
+                       ordered=True)
+
+```
+Since this is a simulated data set, we can create a $2 \times 2$ table
+similar to  Table 13.2.
+
+```{python}
+pd.crosstab(decision,
+            truth,
+     rownames=['Decision'],
+     colnames=['H0'])
+
+```
+Therefore, at level $\alpha=0.05$, we reject 15 of the 50 false
+null hypotheses, and we incorrectly reject 5 of the true null
+hypotheses. Using the notation from Section 13.3, we have
+$V=5$, $S=15$, $U=45$ and $W=35$.
+We have set $\alpha=0.05$, which means that we expect to reject around
+5% of the true null hypotheses. This is in line with the $2 \times 2$
+table above, which indicates that we rejected $V=5$ of the $50$ true
+null hypotheses.
+
+In the simulation above, for the false null hypotheses, the ratio of
+the mean to the standard deviation was only $0.5/1 = 0.5$. This
+amounts to quite a weak signal, and it resulted in a high number of
+Type II errors.  Let’s instead simulate data with a stronger signal,
+so that the ratio of the mean to the standard deviation for the false
+null hypotheses equals $1$. We make only 10 Type II errors.
+
+
+```{python}
+true_mean = np.array([1]*50 + [0]*50)
+X = rng.standard_normal((10, 100))
+X += true_mean[None,:]
+for i in range(100):
+   p_values[i] = ttest_1samp(X[:,i], 0).pvalue
+decision = pd.cut(p_values,
+                  [0, 0.05, 1],
+                  labels=['Reject H0',
+                          'Do not reject H0'])
+truth = pd.Categorical(true_mean == 0,
+                       categories=[True, False],
+                       ordered=True)
+pd.crosstab(decision,
+            truth,
+            rownames=['Decision'],
+            colnames=['H0'])
+
+```
+    
+
+
+## Family-Wise Error Rate
+Recall from  (13.5)  that if the null hypothesis is true
+for each of $m$ independent hypothesis tests, then the FWER is equal
+to $1-(1-\alpha)^m$. We can use this expression to compute the FWER
+for $m=1,\ldots, 500$ and $\alpha=0.05$, $0.01$, and $0.001$.
+We plot the FWER for these values of $\alpha$ in order to
+reproduce  Figure 13.2.
+
+```{python}
+m = np.linspace(1, 501)
+fig, ax = plt.subplots()
+[ax.plot(m,
+         1 - (1 - alpha)**m,
+         label=r'$\alpha=%s$' % str(alpha))
+         for alpha in [0.05, 0.01, 0.001]]
+ax.set_xscale('log')
+ax.set_xlabel('Number of Hypotheses')
+ax.set_ylabel('Family-Wise Error Rate')
+ax.legend()
+ax.axhline(0.05, c='k', ls='--');
+
+```
+
+As discussed previously, even for moderate values of $m$ such as $50$,
+the FWER exceeds $0.05$ unless $\alpha$ is set to a very low value,
+such as $0.001$.  Of course, the problem with setting $\alpha$ to such
+a low value is that we are likely to make a number of Type II errors:
+in other words, our power is very low.
+
+We now conduct a one-sample $t$-test for each of the first five
+managers in the  
+`Fund`   dataset, in order to test the null
+hypothesis that the $j$th fund manager’s mean return equals zero,
+$H_{0,j}: \mu_j=0$.
+
+```{python}
+Fund = load_data('Fund')
+fund_mini = Fund.iloc[:,:5]
+fund_mini_pvals = np.empty(5)
+for i in range(5):
+    fund_mini_pvals[i] = ttest_1samp(fund_mini.iloc[:,i], 0).pvalue
+fund_mini_pvals
+
+```
+
+The $p$-values are low for Managers One and Three, and high for the
+other three managers.  However, we cannot simply reject $H_{0,1}$ and
+$H_{0,3}$, since this would fail to account for the multiple testing
+that we have performed. Instead, we will conduct Bonferroni’s method
+and Holm’s method to control the FWER.
+
+To do this, we use the `multipletests()`  function from the
+`statsmodels` module (abbreviated to `mult_test()`). Given the $p$-values,
+for methods like Holm and Bonferroni the function outputs
+adjusted $p$-values,  which
+can be thought of as a new set of $p$-values that have been corrected
+for multiple testing. If the adjusted $p$-value for a given hypothesis
+is less than or equal to $\alpha$, then that hypothesis can be
+rejected while maintaining a FWER of no more than $\alpha$. In other
+words, for such methods, the adjusted $p$-values resulting from the `multipletests()`
+function can simply be compared to the desired FWER in order to
+determine whether or not to reject each hypothesis. We will later
+see that we can use the same function to control FDR as well.
+
+
+The `mult_test()` function takes $p$-values and a `method` argument, as well as an optional
+`alpha` argument. It returns the  decisions (`reject` below)
+as well as the adjusted $p$-values (`bonf`).
+
+```{python}
+reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]
+reject
+
+```
+
+
+The $p$-values `bonf` are simply the `fund_mini_pvalues` multiplied by 5 and truncated to be less than
+or equal to 1.
+
+```{python}
+bonf, np.minimum(fund_mini_pvals * 5, 1)
+
+```
+
+Therefore, using Bonferroni’s method, we are able to reject the null hypothesis only for Manager
+One while controlling FWER at $0.05$.
+
+By contrast, using Holm’s method, the adjusted $p$-values indicate
+that we can  reject the null
+hypotheses for Managers One and Three at a FWER of $0.05$.
+
+```{python}
+mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
+
+```
+
+
+As discussed previously, Manager One seems to perform particularly
+well, whereas Manager Two has poor performance.
+
+```{python}
+fund_mini.mean()
+
+```
+
+
+Is there evidence of a meaningful difference in performance between
+these two managers?  We can check this by performing a  paired $t$-test  using the `ttest_rel()` function
+from `scipy.stats`:
+
+```{python}
+ttest_rel(fund_mini['Manager1'],
+          fund_mini['Manager2']).pvalue
+
+```
+
+The test results in a $p$-value of 0.038,
+suggesting a statistically significant difference.
+
+However, we decided to perform this test only after examining the data
+and noting that Managers One and Two had the highest and lowest mean
+performances.  In a sense, this means that we have implicitly
+performed ${5 \choose 2} = 5(5-1)/2=10$ hypothesis tests, rather than
+just one, as discussed in  Section 13.3.2.  Hence, we use the
+`pairwise_tukeyhsd()`  function from
+`statsmodels.stats.multicomp` to apply Tukey’s method
+  in order to adjust for multiple testing.  This function takes
+as input a fitted *ANOVA*  regression model, which is
+essentially just a linear regression in which all of the predictors
+are qualitative.  In this case, the response consists of the monthly
+excess returns achieved by each manager, and the predictor indicates
+the manager to which each return corresponds.
+
+```{python}
+returns = np.hstack([fund_mini.iloc[:,i] for i in range(5)])
+managers = np.hstack([[i+1]*50 for i in range(5)])
+tukey = pairwise_tukeyhsd(returns, managers)
+print(tukey.summary())
+
+```
+
+
+The `pairwise_tukeyhsd()` function provides confidence intervals
+for the difference between each pair of managers (`lower` and
+`upper`), as well as a $p$-value. All of these quantities have
+been adjusted for multiple testing. Notice that the $p$-value for the
+difference between Managers One and Two has increased from $0.038$ to
+$0.186$, so there is no longer clear evidence of a difference between
+the managers’ performances.  We can plot the confidence intervals for
+the pairwise comparisons using the `plot_simultaneous()` method
+of `tukey`. Any pair of intervals that don’t overlap indicates a significant difference at the nominal level of 0.05. In this case,
+no differences are considered significant as reported in the table above.
+
+```{python}
+fig, ax = plt.subplots(figsize=(8,8))
+tukey.plot_simultaneous(ax=ax);
+
+```
+
+## False Discovery Rate
+Now we perform hypothesis tests for all 2,000 fund managers in the
+`Fund`  dataset. We perform a one-sample $t$-test
+of   $H_{0,j}: \mu_j=0$, which states that the
+$j$th fund manager’s mean return is zero.
+
+```{python}
+fund_pvalues = np.empty(2000)
+for i, manager in enumerate(Fund.columns):
+    fund_pvalues[i] = ttest_1samp(Fund[manager], 0).pvalue
+
+```
+
+There are far too many managers to consider trying to control the FWER.
+Instead, we focus on controlling the FDR: that is, the expected fraction of rejected null hypotheses that are actually false positives.
+The `multipletests()` function (abbreviated `mult_test()`) can be used to carry out the Benjamini--Hochberg procedure.
+
+```{python}
+fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
+fund_qvalues[:10]
+
+```
+
+The  *q-values* output by the
+Benjamini--Hochberg procedure can be interpreted as the smallest FDR
+threshold at which we would reject a particular null hypothesis. For
+instance, a $q$-value of $0.1$ indicates that we can reject the
+corresponding null hypothesis at an FDR of 10% or greater, but that
+we cannot reject the null hypothesis at an FDR below 10%.
+
+If we control the FDR at 10%, then for how many of the fund managers can we reject $H_{0,j}: \mu_j=0$?
+
+```{python}
+(fund_qvalues <= 0.1).sum()
+
+```
+We find that 146 of the 2,000 fund managers have a $q$-value below
+0.1; therefore, we are able to conclude that 146 of the fund managers
+beat the market at an FDR of 10%.  Only about 15 (10% of 146) of
+these fund managers are likely to be false discoveries.
+
+By contrast, if we had instead used Bonferroni’s method to control the
+FWER at level $\alpha=0.1$, then we would have failed to reject any
+null hypotheses!
+
+```{python}
+(fund_pvalues <= 0.1 / 2000).sum()
+
+```
+
+
+Figure 13.6 displays the ordered
+$p$-values, $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(2000)}$, for
+the  `Fund`  dataset, as well as the threshold for rejection by the
+Benjamini--Hochberg procedure.  Recall that the Benjamini--Hochberg
+procedure identifies the largest $p$-value such that $p_{(j)}<qj/m$,
+and rejects all hypotheses for which the $p$-value is less than or
+equal to $p_{(j)}$. In the code below, we implement the
+Benjamini--Hochberg procedure ourselves, in order to illustrate how it
+works. We first order the $p$-values. We then identify all $p$-values
+that satisfy $p_{(j)}<qj/m$ (`sorted_set_`).  Finally, `selected_`
+is a boolean array indicating which $p$-values
+ are less than or equal to the largest
+$p$-value in `sorted_[sorted_set_]`. Therefore, `selected_` indexes the
+$p$-values rejected by the Benjamini--Hochberg procedure.
+
+```{python}
+sorted_ = np.sort(fund_pvalues)
+m = fund_pvalues.shape[0]
+q = 0.1
+sorted_set_ = np.where(sorted_ < q * np.linspace(1, m, m) / m)[0]
+if sorted_set_.shape[0] > 0:
+    selected_ = fund_pvalues < sorted_[sorted_set_].max()
+    sorted_set_ = np.arange(sorted_set_.max())
+else:
+    selected_ = []
+    sorted_set_ = []
+
+```
+
+We now reproduce  the middle panel of Figure 13.6.
+
+```{python}
+fig, ax = plt.subplots()
+ax.scatter(np.arange(0, sorted_.shape[0]) + 1,
+           sorted_, s=10)
+ax.set_yscale('log')
+ax.set_xscale('log')
+ax.set_ylabel('P-Value')
+ax.set_xlabel('Index')
+ax.scatter(sorted_set_+1, sorted_[sorted_set_], c='r', s=20)
+ax.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3);
+
+```
+
+
+## A Re-Sampling Approach
+Here, we implement the re-sampling approach to hypothesis testing
+using the  `Khan`  dataset, which we investigated in
+Section 13.5.  First, we merge the training and
+testing data, which results in observations on 83 patients for
+2,308 genes.
+
+```{python}
+Khan = load_data('Khan')      
+D = pd.concat([Khan['xtrain'], Khan['xtest']])
+D['Y'] = pd.concat([Khan['ytrain'], Khan['ytest']])
+D['Y'].value_counts()
+
+```
+
+
+There are four classes of cancer. For each gene, we compare the mean
+expression in the second class (rhabdomyosarcoma) to the mean
+expression in the fourth class (Burkitt’s lymphoma).  Performing a
+standard two-sample $t$-test  
+using `ttest_ind()`  from `scipy.stats` on the $11$th
+gene produces a test-statistic of -2.09 and an associated $p$-value
+of 0.0412, suggesting modest evidence of a difference in mean
+expression levels between the two cancer types.
+
+```{python}
+D2 = D[lambda df:df['Y'] == 2]
+D4 = D[lambda df:df['Y'] == 4]
+gene_11 = 'G0011'
+observedT, pvalue = ttest_ind(D2[gene_11],
+                              D4[gene_11],
+                              equal_var=True)
+observedT, pvalue
+
+```
+
+
+However, this $p$-value relies on the assumption that under the null
+hypothesis of no difference between the two groups, the test statistic
+follows a $t$-distribution with $29+25-2=52$ degrees of freedom.
+Instead of using this theoretical null distribution, we can randomly
+split the 54 patients into two groups of 29 and 25, and compute a new
+test statistic.  Under the null hypothesis of no difference between
+the groups, this new test statistic should have the same distribution
+as our original one.  Repeating this process 10,000 times allows us to
+approximate the null distribution of the test statistic.  We compute
+the fraction of the time that our observed test statistic exceeds the
+test statistics obtained via re-sampling.
+
+```{python}
+B = 10000
+Tnull = np.empty(B)
+D_ = np.hstack([D2[gene_11], D4[gene_11]])
+n_ = D2[gene_11].shape[0]
+D_null = D_.copy()
+for b in range(B):
+    rng.shuffle(D_null)
+    ttest_ = ttest_ind(D_null[:n_],
+                       D_null[n_:],
+                       equal_var=True)
+    Tnull[b] = ttest_.statistic
+(np.abs(Tnull) > np.abs(observedT)).mean()
+
+```
+
+
+This fraction, 0.0398,
+is our re-sampling-based $p$-value.
+It is almost identical to the $p$-value of  0.0412 obtained using the theoretical null distribution.
+We can plot  a histogram of the re-sampling-based test statistics in order to reproduce  Figure 13.7.
+
+```{python}
+fig, ax = plt.subplots(figsize=(8,8))
+ax.hist(Tnull,
+        bins=100,
+        density=True,
+        facecolor='y',
+        label='Null')
+xval = np.linspace(-4.2, 4.2, 1001)
+ax.plot(xval,
+        t_dbn.pdf(xval, D_.shape[0]-2),
+        c='r')
+ax.axvline(observedT,
+           c='b',
+           label='Observed')
+ax.legend()
+ax.set_xlabel("Null Distribution of Test Statistic");
+
+```
+The re-sampling-based null distribution is almost identical to the theoretical null distribution, which is displayed in red.
+
+Finally, we implement the plug-in re-sampling FDR approach outlined in
+Algorithm 13.4. Depending on the speed of your
+computer, calculating the FDR for all 2,308 genes in the `Khan`
+dataset may take a while.  Hence, we will illustrate the approach on a
+random subset of 100 genes.  For each gene, we first compute the
+observed test statistic, and then produce 10,000 re-sampled test
+statistics. This may take a few minutes to run.  If you are in a rush,
+then you could set `B` equal to a smaller value (e.g. `B=500`).
+
+```{python}
+m, B = 100, 10000
+idx = rng.choice(Khan['xtest'].columns, m, replace=False)
+T_vals = np.empty(m)
+Tnull_vals = np.empty((m, B))
+
+for j in range(m):
+    col = idx[j]
+    T_vals[j] = ttest_ind(D2[col],
+                          D4[col],
+                          equal_var=True).statistic
+    D_ = np.hstack([D2[col], D4[col]])
+    D_null = D_.copy()
+    for b in range(B):
+        rng.shuffle(D_null)
+        ttest_ = ttest_ind(D_null[:n_],
+                           D_null[n_:],
+                           equal_var=True)
+        Tnull_vals[j,b] = ttest_.statistic
+
+```
+
+Next, we compute the number of rejected null hypotheses $R$, the
+estimated number of false positives $\widehat{V}$, and the estimated
+FDR, for a range of threshold values $c$ in
+Algorithm 13.4. The threshold values are chosen
+using the absolute values of the test statistics from the 100 genes.
+
+```{python}
+cutoffs = np.sort(np.abs(T_vals))
+FDRs, Rs, Vs = np.empty((3, m))
+for j in range(m):
+   R = np.sum(np.abs(T_vals) >= cutoffs[j])
+   V = np.sum(np.abs(Tnull_vals) >= cutoffs[j]) / B
+   Rs[j] = R
+   Vs[j] = V
+   FDRs[j] = V / R
+
+```
+
+Now, for any given FDR, we can find the genes that will be
+rejected. For example, with FDR controlled at 0.1, we reject 15 of the
+100 null hypotheses. On average, we would expect about one or two of
+these genes (i.e. 10% of 15) to be false discoveries.  At an FDR of
+0.2, we can reject the null hypothesis for 28 genes, of which we
+expect around six to be false discoveries.
+
+The variable `idx` stores which
+genes were included in our 100 randomly-selected genes. Let’s look at
+the genes whose estimated FDR is less than 0.1.
+
+```{python}
+sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.1].min()])
+
+```
+
+At an FDR threshold of 0.2, more genes are selected, at the cost of having a higher expected
+proportion of false discoveries.
+
+```{python}
+sorted(idx[np.abs(T_vals) >= cutoffs[FDRs < 0.2].min()])
+
+```
+
+The next line  generates  Figure 13.11, which is similar
+to  Figure 13.9,
+except that it is based on only  a subset of the genes.
+
+```{python}
+fig, ax = plt.subplots()
+ax.plot(Rs, FDRs, 'b', linewidth=3)
+ax.set_xlabel("Number of Rejections")
+ax.set_ylabel("False Discovery Rate");
+
+```
+
+