# Julia for Data Analysis ## Bogumił Kamiński, Daniel Kaszyński # Chapter 13 # Problems ### Exercise 1 Download https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip archive and extract primary_data.csv and secondary_data.csv files from it. Save the files to disk.
Solution ``` using Downloads import ZipFile Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip") archive = ZipFile.Reader("MushroomDataset.zip") idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files)) open("primary_data.csv", "w") do io write(io, read(archive.files[idx])) end idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files)) open("secondary_data.csv", "w") do io write(io, read(archive.files[idx])) end close(archive) ```
### Exercise 2 Load primary_data.csv into the `primary` data frame. Load secondary_data.csv into the `secondary` data frame. Describe the contents of both data frames.
Solution ``` using CSV using DataFrames primary = CSV.read("primary_data.csv", DataFrame; delim=';') secondary = CSV.read("secondary_data.csv", DataFrame; delim=';') describe(primary) describe(secondary) ```
### Exercise 3 Start with `primary` data. Note that columns starting from column 4 have their data encoded using vector notation, but they have been read-in as strings. Convert these columns to hold proper vectors. Note that some columns have `missing` values. Most of the columns hold nominal data, but three columns, i.e. `cap-diameter`, `stem-height`, and `stem-width` have numeric data. These should be parsed as vectors storing numeric values. After parsing, put these three columns just after `class` column in the `parsed_primary` data frame. Check `renamecols` keyword argument of `select` to avoid renaming of the produced columns.
Solution ``` parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ") parse_nominal(::Missing) = missing parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", ")) parse_numeric(::Missing) = missing idcols = ["family", "name", "class"] numericcols = ["cap-diameter", "stem-height", "stem-width"] parsed_primary = select(primary, idcols, numericcols .=> ByRow(parse_numeric), Not([idcols; numericcols]) .=> ByRow(parse_nominal); renamecols=false) ```
### Exercise 4 In `parsed_primary` data frame find all pairs of mushrooms (rows) that might be confused and have a different class, using the information about them we have (so all information except their family, name, and class). Use the following rules: * if for some pair of mushrooms the data in some column for either of them is `missing` then skip matching on this column; for numeric columns if there is only one value in a vector then treat it as `missing`; * otherwise: - for numeric columns check if there is an overlap in the interval specified by the min and max values for the range passed; - for nominal columns check if the intersection of nominal values is nonempty. For each found pair print to the screen the row number, family, name, and class.
Solution ``` function overlap_numeric(v1, v2) # there are no missing values in numeric columns if length(v1) == 1 || length(v2) == 1 return true else return max(v1[1], v2[1]) <= min(v1[2], v2[2]) end end function overlap_nominal(v1, v2) if ismissing(v1) || ismissing(v2) return true else return !isempty(intersect(v1, v2)) end end function rows_overlap(row1, row2) # note that in parsed_primary numeric columns have indices 4 to 6 # and nominal columns have indices 7 to 23 return all(i -> overlap_numeric(row1[i], row2[i]), 4:6) && all(i -> overlap_nominal(row1[i], row2[i]), 7:23) end for i in 1:nrow(parsed_primary), j in i+1:nrow(parsed_primary) row1 = parsed_primary[i, :] row2 = parsed_primary[j, :] if rows_overlap(row1, row2) && row1.class != row2.class println((i, Tuple(row1[1:3]), j, Tuple(row2[1:3]))) end end ``` Note that in this exercise using `eachrow` is not a problem (although it is not type stable) because the data is small.
### Exercise 5 Still using `parsed_primary` find what is the average probability of class being `p` by `family`. Additionally add number of observations in each group. Sort these results by the probability. Try using DataFramesMeta.jl to do this exercise (this requirement is optional). Store the result in `agg_primary` data frame.
Solution ``` using Statistics using DataFramesMeta agg_primary = @chain parsed_primary begin groupby(:family) @combine(:pr_p = mean(:class .== "p"), $nrow) sort(:pr_p) end ```
### Exercise 6 Now using `agg_primary` data frame collapse it so that for each unique `pr_p` it gives us a total number of rows that had this probability and a tuple of mushroom family names. Optionally: try to display the produced table so that the tuple containing the list of families for each group is not cropped (this will require large terminal).
Solution ``` show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families); truncate=140) ```
### Exercise 7 From our preliminary analysis of `primary` data we see that `missing` value in the primary data is non-informative, so in `secondary` data we should be cautious when building a model if we allowed for missing data (in practice if we were investigating some real mushroom we most likely would know its characteristics). Therefore as a first step drop in-place all columns in `secondary` data frame that have missing values.
Solution ``` select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)]) ``` Note that we select based on actual contents of the columns and not by their element type (column could allow for missing values but not have them).
### Exercise 8 Create a logistic regression predicting `class` based on all remaining features in the data frame. You might need to check the `Term` usage in StatsModels.jl documentation. You will notice that for `stem-color` and `habitat` columns you get strange estimation results (large absolute values of estimated parameters and even larger standard errors). Explain why this happens by analyzing frequency tables of these variables against `class` column.
Solution ``` using GLM using FreqTables secondary.class = secondary.class .== "p" model = glm(Term(:class)~sum(Term.(Symbol.(names(secondary, Not(:class))))), secondary, Binomial(), LogitLink()) freqtable(secondary, "stem-color", "class") freqtable(secondary, "habitat", "class") ``` We can see that for certain levels of `stem-color` and `habitat` variables there is a perfect separation of classes.
### Exercise 9 Add `class_p` column to `secondary` as a second column that will contain predicted probability from the model created in exercise 8 of a given observation having class `p`. Print descriptive statistics of column `class_p` by `class`.
Solution ``` insertcols!(secondary, 2, :class_p => predict(model)) combine(groupby(secondary, :class)) do sdf return describe(sdf, :detailed, cols=:class_p) end ``` We can see that the model has some discriminatory power, but there is still a significant overlap between classes.
### Exercise 10 Plot FPR-TPR ROC curve for our model and compute associated AUC value.
Solution ``` using Plots using ROCAnalysis roc_data = roc(secondary; score=:class_p, target=:class) plot(roc_data.pfa, 1 .- roc_data.pmiss; title="AUC=$(round(100*(1 - auc(roc_data)), digits=2))%", xlabel="FPR", ylabel="TPR", legend=false) ```