JuliaForDataAnalysis/exercises/exercises13.md
2022-10-14 13:43:12 +02:00

8.0 KiB

Julia for Data Analysis

Bogumił Kamiński, Daniel Kaszyński

Chapter 13

Problems

Exercise 1

Download https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip archive and extract primary_data.csv and secondary_data.csv files from it. Save the files to disk.

Solution
using Downloads
import ZipFile
Downloads.download("https://archive.ics.uci.edu/ml/machine-learning-databases/00615/MushroomDataset.zip", "MushroomDataset.zip")
archive = ZipFile.Reader("MushroomDataset.zip")
idx = only(findall(x -> contains(x.name, "primary_data.csv"), archive.files))
open("primary_data.csv", "w") do io
    write(io, read(archive.files[idx]))
end
idx = only(findall(x -> contains(x.name, "secondary_data.csv"), archive.files))
open("secondary_data.csv", "w") do io
    write(io, read(archive.files[idx]))
end
close(archive)

Exercise 2

Load primary_data.csv into the primary data frame. Load secondary_data.csv into the secondary data frame. Describe the contents of both data frames.

Solution
using CSV
using DataFrames
primary = CSV.read("primary_data.csv", DataFrame; delim=';')
secondary = CSV.read("secondary_data.csv", DataFrame; delim=';')
describe(primary)
describe(secondary)

Exercise 3

Start with primary data. Note that columns starting from column 4 have their data encoded using vector notation, but they have been read-in as strings. Convert these columns to hold proper vectors. Note that some columns have missing values. Most of the columns hold nominal data, but three columns, i.e. cap-diameter, stem-height, and stem-width have numeric data. These should be parsed as vectors storing numeric values. After parsing, put these three columns just after class column in the parsed_primary data frame.

Check renamecols keyword argument of select to avoid renaming of the produced columns.

Solution
parse_nominal(s::AbstractString) = split(strip(s, ['[', ']']), ", ")
parse_nominal(::Missing) = missing
parse_numeric(s::AbstractString) = parse.(Float64, split(strip(s, ['[', ']']), ", "))
parse_numeric(::Missing) = missing
idcols = ["family", "name", "class"]
numericcols = ["cap-diameter", "stem-height", "stem-width"]
parsed_primary = select(primary,
                        idcols,
                        numericcols .=> ByRow(parse_numeric),
                        Not([idcols; numericcols]) .=> ByRow(parse_nominal);
                        renamecols=false)

Exercise 4

In parsed_primary data frame find all pairs of mushrooms (rows) that might be confused and have a different class, using the information about them we have (so all information except their family, name, and class).

Use the following rules: * if for some pair of mushrooms the data in some column for either of them is missing then skip matching on this column; for numeric columns if there is only one value in a vector then treat it as missing; * otherwise: - for numeric columns check if there is an overlap in the interval specified by the min and max values for the range passed; - for nominal columns check if the intersection of nominal values is nonempty.

For each found pair print to the screen the row number, family, name, and class.

Solution
function overlap_numeric(v1, v2)
    # there are no missing values in numeric columns
    if length(v1) == 1 || length(v2) == 1
        return true
    else
        return max(v1[1], v2[1]) <= min(v1[2], v2[2])
    end
end

function overlap_nominal(v1, v2)
    if ismissing(v1) || ismissing(v2)
        return true
    else
        return !isempty(intersect(v1, v2))
    end
end

function rows_overlap(row1, row2)
    # note that in parsed_primary numeric columns have indices 4 to 6
    # and nominal columns have indices 7 to 23
    return all(i -> overlap_numeric(row1[i], row2[i]), 4:6) &&
           all(i -> overlap_nominal(row1[i], row2[i]), 7:23)
end

for i in 1:nrow(parsed_primary), j in i+1:nrow(parsed_primary)
    row1 = parsed_primary[i, :]
    row2 = parsed_primary[j, :]
    if rows_overlap(row1, row2) && row1.class != row2.class
        println((i, Tuple(row1[1:3]), j, Tuple(row2[1:3])))
    end
end

Note that in this exercise using eachrow is not a problem (although it is not type stable) because the data is small.

Exercise 5

Still using parsed_primary find what is the average probability of class being p by family. Additionally add number of observations in each group. Sort these results by the probability. Try using DataFramesMeta.jl to do this exercise (this requirement is optional).

Store the result in agg_primary data frame.

Solution
using Statistics
using DataFramesMeta
agg_primary = @chain parsed_primary begin
    groupby(:family)
    @combine(:pr_p = mean(:class .== "p"), $nrow)
    sort(:pr_p)
end

Exercise 6

Now using agg_primary data frame collapse it so that for each unique pr_p it gives us a total number of rows that had this probability and a tuple of mushroom family names.

Optionally: try to display the produced table so that the tuple containing the list of families for each group is not cropped (this will require large terminal).

Solution
show(combine(groupby(agg_primary, :pr_p), :nrow => sum => :nrow, :family => Tuple => :families); truncate=140)

Exercise 7

From our preliminary analysis of primary data we see that missing value in the primary data is non-informative, so in secondary data we should be cautious when building a model if we allowed for missing data (in practice if we were investigating some real mushroom we most likely would know its characteristics).

Therefore as a first step drop in-place all columns in secondary data frame that have missing values.

Solution
select!(secondary, [!any(ismissing, col) for col in eachcol(secondary)])

Note that we select based on actual contents of the columns and not by their element type (column could allow for missing values but not have them).

Exercise 8

Create a logistic regression predicting class based on all remaining features in the data frame. You might need to check the Term usage in StatsModels.jl documentation.

You will notice that for stem-color and habitat columns you get strange estimation results (large absolute values of estimated parameters and even larger standard errors). Explain why this happens by analyzing frequency tables of these variables against class column.

Solution
using GLM
using FreqTables
secondary.class = secondary.class .== "p"
model = glm(Term(:class)~sum(Term.(Symbol.(names(secondary, Not(:class))))),
           secondary, Binomial(), LogitLink())
freqtable(secondary, "stem-color", "class")
freqtable(secondary, "habitat", "class")

We can see that for certain levels of stem-color and habitat variables there is a perfect separation of classes.

Exercise 9

Add class_p column to secondary as a second column that will contain predicted probability from the model created in exercise 8 of a given observation having class p.

Print descriptive statistics of column class_p by class.

Solution
insertcols!(secondary, 2, :class_p => predict(model))
combine(groupby(secondary, :class)) do sdf
    return describe(sdf, :detailed, cols=:class_p)
end

We can see that the model has some discriminatory power, but there is still a significant overlap between classes.

Exercise 10

Plot FPR-TPR ROC curve for our model and compute associated AUC value.

Solution
using Plots
using ROCAnalysis
roc_data = roc(secondary; score=:class_p, target=:class)
plot(roc_data.pfa, 1 .- roc_data.pmiss;
     title="AUC=$(round(100*(1 - auc(roc_data)), digits=2))%",
     xlabel="FPR", ylabel="TPR", legend=false)