Revised with 1.6.1

This commit is contained in:
David Doblas Jiménez 2021-06-27 16:34:46 +02:00
parent a3b5cd7deb
commit 85271e0ea6
3 changed files with 3169 additions and 0 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,479 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Final examples\n",
"\n",
"### Bogumił Kamiński"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us wrap up our tutorial with examples of joining and reshaping data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Joining and reshaping data frames"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using DataFrames\n",
"using CSV\n",
"using Pipe\n",
"using Unitful\n",
"using Dates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the weather forecast data from two cities from Poland."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>city</th><th>date</th><th>rainfall</th></tr><tr><th></th><th>String</th><th>Date</th><th>Float64</th></tr></thead><tbody><p>10 rows × 3 columns</p><tr><th>1</th><td>Olecko</td><td>2020-11-16</td><td>2.9</td></tr><tr><th>2</th><td>Olecko</td><td>2020-11-17</td><td>4.1</td></tr><tr><th>3</th><td>Olecko</td><td>2020-11-19</td><td>4.3</td></tr><tr><th>4</th><td>Olecko</td><td>2020-11-20</td><td>2.0</td></tr><tr><th>5</th><td>Olecko</td><td>2020-11-21</td><td>0.6</td></tr><tr><th>6</th><td>Olecko</td><td>2020-11-22</td><td>1.0</td></tr><tr><th>7</th><td>Ełk</td><td>2020-11-16</td><td>3.9</td></tr><tr><th>8</th><td>Ełk</td><td>2020-11-19</td><td>1.2</td></tr><tr><th>9</th><td>Ełk</td><td>2020-11-20</td><td>2.0</td></tr><tr><th>10</th><td>Ełk</td><td>2020-11-22</td><td>2.0</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& city & date & rainfall\\\\\n",
"\t\\hline\n",
"\t& String & Date & Float64\\\\\n",
"\t\\hline\n",
"\t1 & Olecko & 2020-11-16 & 2.9 \\\\\n",
"\t2 & Olecko & 2020-11-17 & 4.1 \\\\\n",
"\t3 & Olecko & 2020-11-19 & 4.3 \\\\\n",
"\t4 & Olecko & 2020-11-20 & 2.0 \\\\\n",
"\t5 & Olecko & 2020-11-21 & 0.6 \\\\\n",
"\t6 & Olecko & 2020-11-22 & 1.0 \\\\\n",
"\t7 & Ełk & 2020-11-16 & 3.9 \\\\\n",
"\t8 & Ełk & 2020-11-19 & 1.2 \\\\\n",
"\t9 & Ełk & 2020-11-20 & 2.0 \\\\\n",
"\t10 & Ełk & 2020-11-22 & 2.0 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m10×3 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m city \u001b[0m\u001b[1m date \u001b[0m\u001b[1m rainfall \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m String \u001b[0m\u001b[90m Date \u001b[0m\u001b[90m Float64 \u001b[0m\n",
"─────┼──────────────────────────────\n",
" 1 │ Olecko 2020-11-16 2.9\n",
" 2 │ Olecko 2020-11-17 4.1\n",
" 3 │ Olecko 2020-11-19 4.3\n",
" 4 │ Olecko 2020-11-20 2.0\n",
" 5 │ Olecko 2020-11-21 0.6\n",
" 6 │ Olecko 2020-11-22 1.0\n",
" 7 │ Ełk 2020-11-16 3.9\n",
" 8 │ Ełk 2020-11-19 1.2\n",
" 9 │ Ełk 2020-11-20 2.0\n",
" 10 │ Ełk 2020-11-22 2.0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rainfall_long = CSV.File(\"data/rainfall_forecast.csv\") |> DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we collect rainfall information, so it would be nice to add units to the measured values. This is not a problem with `Unitful.jl`. We take advantage of the fact that `DataFrame` can store vectors of any Julia objects."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>city</th><th>date</th><th>rainfall</th></tr><tr><th></th><th>String</th><th>Date</th><th>Quantit…</th></tr></thead><tbody><p>10 rows × 3 columns</p><tr><th>1</th><td>Olecko</td><td>2020-11-16</td><td>2.9 mm</td></tr><tr><th>2</th><td>Olecko</td><td>2020-11-17</td><td>4.1 mm</td></tr><tr><th>3</th><td>Olecko</td><td>2020-11-19</td><td>4.3 mm</td></tr><tr><th>4</th><td>Olecko</td><td>2020-11-20</td><td>2.0 mm</td></tr><tr><th>5</th><td>Olecko</td><td>2020-11-21</td><td>0.6 mm</td></tr><tr><th>6</th><td>Olecko</td><td>2020-11-22</td><td>1.0 mm</td></tr><tr><th>7</th><td>Ełk</td><td>2020-11-16</td><td>3.9 mm</td></tr><tr><th>8</th><td>Ełk</td><td>2020-11-19</td><td>1.2 mm</td></tr><tr><th>9</th><td>Ełk</td><td>2020-11-20</td><td>2.0 mm</td></tr><tr><th>10</th><td>Ełk</td><td>2020-11-22</td><td>2.0 mm</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& city & date & rainfall\\\\\n",
"\t\\hline\n",
"\t& String & Date & Quantit…\\\\\n",
"\t\\hline\n",
"\t1 & Olecko & 2020-11-16 & 2.9 mm \\\\\n",
"\t2 & Olecko & 2020-11-17 & 4.1 mm \\\\\n",
"\t3 & Olecko & 2020-11-19 & 4.3 mm \\\\\n",
"\t4 & Olecko & 2020-11-20 & 2.0 mm \\\\\n",
"\t5 & Olecko & 2020-11-21 & 0.6 mm \\\\\n",
"\t6 & Olecko & 2020-11-22 & 1.0 mm \\\\\n",
"\t7 & Ełk & 2020-11-16 & 3.9 mm \\\\\n",
"\t8 & Ełk & 2020-11-19 & 1.2 mm \\\\\n",
"\t9 & Ełk & 2020-11-20 & 2.0 mm \\\\\n",
"\t10 & Ełk & 2020-11-22 & 2.0 mm \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m10×3 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m city \u001b[0m\u001b[1m date \u001b[0m\u001b[1m rainfall \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m String \u001b[0m\u001b[90m Date \u001b[0m\u001b[90m Quantity… \u001b[0m\n",
"─────┼───────────────────────────────\n",
" 1 │ Olecko 2020-11-16 2.9 mm\n",
" 2 │ Olecko 2020-11-17 4.1 mm\n",
" 3 │ Olecko 2020-11-19 4.3 mm\n",
" 4 │ Olecko 2020-11-20 2.0 mm\n",
" 5 │ Olecko 2020-11-21 0.6 mm\n",
" 6 │ Olecko 2020-11-22 1.0 mm\n",
" 7 │ Ełk 2020-11-16 3.9 mm\n",
" 8 │ Ełk 2020-11-19 1.2 mm\n",
" 9 │ Ełk 2020-11-20 2.0 mm\n",
" 10 │ Ełk 2020-11-22 2.0 mm"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transform!(rainfall_long, :rainfall => x -> x .* u\"mm\", renamecols=false)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `renamecols=false` we left the name of the transformed column unchanged when we did an in-place update of the data frame using the `transform!` function."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It would be nice to see the data in a wide format, so that each city is represented by a single column. We can achieve this using the `unstack` function:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>date</th><th>Olecko</th><th>Ełk</th></tr><tr><th></th><th>Date</th><th>Quantit…?</th><th>Quantit…?</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>2020-11-16</td><td>2.9 mm</td><td>3.9 mm</td></tr><tr><th>2</th><td>2020-11-17</td><td>4.1 mm</td><td><em>missing</em></td></tr><tr><th>3</th><td>2020-11-19</td><td>4.3 mm</td><td>1.2 mm</td></tr><tr><th>4</th><td>2020-11-20</td><td>2.0 mm</td><td>2.0 mm</td></tr><tr><th>5</th><td>2020-11-21</td><td>0.6 mm</td><td><em>missing</em></td></tr><tr><th>6</th><td>2020-11-22</td><td>1.0 mm</td><td>2.0 mm</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& date & Olecko & Ełk\\\\\n",
"\t\\hline\n",
"\t& Date & Quantit…? & Quantit…?\\\\\n",
"\t\\hline\n",
"\t1 & 2020-11-16 & 2.9 mm & 3.9 mm \\\\\n",
"\t2 & 2020-11-17 & 4.1 mm & \\emph{missing} \\\\\n",
"\t3 & 2020-11-19 & 4.3 mm & 1.2 mm \\\\\n",
"\t4 & 2020-11-20 & 2.0 mm & 2.0 mm \\\\\n",
"\t5 & 2020-11-21 & 0.6 mm & \\emph{missing} \\\\\n",
"\t6 & 2020-11-22 & 1.0 mm & 2.0 mm \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m6×3 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\u001b[1m Olecko \u001b[0m\u001b[1m Ełk \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m Date \u001b[0m\u001b[90m Quantity…? \u001b[0m\u001b[90m Quantity…? \u001b[0m\n",
"─────┼──────────────────────────────────────\n",
" 1 │ 2020-11-16 2.9 mm 3.9 mm\n",
" 2 │ 2020-11-17 4.1 mm \u001b[90m missing \u001b[0m\n",
" 3 │ 2020-11-19 4.3 mm 1.2 mm\n",
" 4 │ 2020-11-20 2.0 mm 2.0 mm\n",
" 5 │ 2020-11-21 0.6 mm \u001b[90m missing \u001b[0m\n",
" 6 │ 2020-11-22 1.0 mm 2.0 mm"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rainfall_wide = unstack(rainfall_long, :date, :city, :rainfall)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the \"gaps\" in the rainfall information for `\"Ełk\"` column got automatically filled by `missing`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is also a `stack` function that does the reverse: transforms a data frame from wide to long format."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also note that one of the cities is `\"Ełk\"`, which has a non standard character `ł` in its name. It is not a problem with `DataFrames.jl`. Let us e.g. extract this column as an exercise:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6-element Vector{Union{Missing, Quantity{Float64, 𝐋, Unitful.FreeUnits{(mm,), 𝐋, nothing}}}}:\n",
" 3.9 mm\n",
" missing\n",
" 1.2 mm\n",
" 2.0 mm\n",
" missing\n",
" 2.0 mm"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rainfall_wide.Ełk"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6-element Vector{Union{Missing, Quantity{Float64, 𝐋, Unitful.FreeUnits{(mm,), 𝐋, nothing}}}}:\n",
" 3.9 mm\n",
" missing\n",
" 1.2 mm\n",
" 2.0 mm\n",
" missing\n",
" 2.0 mm"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rainfall_wide.\"Ełk\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we read the data, we note that still there are gaps in the passed information --- one of the days is missing as there is no forecasted rainfall for it.\n",
"\n",
"It would be nice to have information for all days in the considered period. Here is the way to do it:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>date</th></tr><tr><th></th><th>Date</th></tr></thead><tbody><p>7 rows × 1 columns</p><tr><th>1</th><td>2020-11-16</td></tr><tr><th>2</th><td>2020-11-17</td></tr><tr><th>3</th><td>2020-11-18</td></tr><tr><th>4</th><td>2020-11-19</td></tr><tr><th>5</th><td>2020-11-20</td></tr><tr><th>6</th><td>2020-11-21</td></tr><tr><th>7</th><td>2020-11-22</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|c}\n",
"\t& date\\\\\n",
"\t\\hline\n",
"\t& Date\\\\\n",
"\t\\hline\n",
"\t1 & 2020-11-16 \\\\\n",
"\t2 & 2020-11-17 \\\\\n",
"\t3 & 2020-11-18 \\\\\n",
"\t4 & 2020-11-19 \\\\\n",
"\t5 & 2020-11-20 \\\\\n",
"\t6 & 2020-11-21 \\\\\n",
"\t7 & 2020-11-22 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m7×1 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m Date \u001b[0m\n",
"─────┼────────────\n",
" 1 │ 2020-11-16\n",
" 2 │ 2020-11-17\n",
" 3 │ 2020-11-18\n",
" 4 │ 2020-11-19\n",
" 5 │ 2020-11-20\n",
" 6 │ 2020-11-21\n",
" 7 │ 2020-11-22"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_days = DataFrame(date=Date.(2020,11, 16:22))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>date</th><th>Olecko</th><th>Ełk</th></tr><tr><th></th><th>Date</th><th>Quantit…</th><th>Quantit…</th></tr></thead><tbody><p>7 rows × 3 columns</p><tr><th>1</th><td>2020-11-16</td><td>2.9 mm</td><td>3.9 mm</td></tr><tr><th>2</th><td>2020-11-17</td><td>4.1 mm</td><td>0.0 mm</td></tr><tr><th>3</th><td>2020-11-19</td><td>4.3 mm</td><td>1.2 mm</td></tr><tr><th>4</th><td>2020-11-20</td><td>2.0 mm</td><td>2.0 mm</td></tr><tr><th>5</th><td>2020-11-21</td><td>0.6 mm</td><td>0.0 mm</td></tr><tr><th>6</th><td>2020-11-22</td><td>1.0 mm</td><td>2.0 mm</td></tr><tr><th>7</th><td>2020-11-18</td><td>0.0 mm</td><td>0.0 mm</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& date & Olecko & Ełk\\\\\n",
"\t\\hline\n",
"\t& Date & Quantit… & Quantit…\\\\\n",
"\t\\hline\n",
"\t1 & 2020-11-16 & 2.9 mm & 3.9 mm \\\\\n",
"\t2 & 2020-11-17 & 4.1 mm & 0.0 mm \\\\\n",
"\t3 & 2020-11-19 & 4.3 mm & 1.2 mm \\\\\n",
"\t4 & 2020-11-20 & 2.0 mm & 2.0 mm \\\\\n",
"\t5 & 2020-11-21 & 0.6 mm & 0.0 mm \\\\\n",
"\t6 & 2020-11-22 & 1.0 mm & 2.0 mm \\\\\n",
"\t7 & 2020-11-18 & 0.0 mm & 0.0 mm \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m7×3 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\u001b[1m Olecko \u001b[0m\u001b[1m Ełk \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m Date \u001b[0m\u001b[90m Quantity… \u001b[0m\u001b[90m Quantity… \u001b[0m\n",
"─────┼──────────────────────────────────\n",
" 1 │ 2020-11-16 2.9 mm 3.9 mm\n",
" 2 │ 2020-11-17 4.1 mm 0.0 mm\n",
" 3 │ 2020-11-19 4.3 mm 1.2 mm\n",
" 4 │ 2020-11-20 2.0 mm 2.0 mm\n",
" 5 │ 2020-11-21 0.6 mm 0.0 mm\n",
" 6 │ 2020-11-22 1.0 mm 2.0 mm\n",
" 7 │ 2020-11-18 0.0 mm 0.0 mm"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@pipe leftjoin(all_days, rainfall_wide, on=:date) |>\n",
" coalesce.(_, 0.0u\"mm\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we additionally used a broadcasted `coalesce` operation on the whole data frame returned from `leftjoin` to replace all `missing` values by `0.0u\"mm\"` in it, as in this case `missing` meant that there is no rain forecasted for that day.\n",
"\n",
"It was safe to do here, as we knew that `:date` column does not contain missings. In particular note that `leftjoin` would error by default if we tried to perfrom join on a column that contains `missing` values (use `matchmissing` keyword argument in joins to change this behavior)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we finish let us summarize the major functions that `DataFrames.jl` provides:\n",
"1. data frame is a matrix-like data structure. You can index it just like a matrix. The differences are\n",
" - you can use strings or `Symbol`s to select columns\n",
" - if you select rows with `!` it selects you whole column of a data frame and passes it to you without copying\n",
"2. You can quickly summarize the contents of a data frame using the `describe` function\n",
"3. You can add rows to a data frame in-place using `push!` (similarly `append!` allows you to add multiple rows at the same time) (also `repeat`/`repeat!`, `hcat` and `vcat` are provided)\n",
"4. You can work on a grouped data frame that is created using the `groupby` function. It is a view and works as-if you have created a lookup index to a data frame.\n",
"5. There are `select`/`select!`/`transform`/`transform!`/`combine` functions that allow you to quickly transform/aggregate columns of a data frame or grouped data frame; there is also `mapcols`/`mapcols!` functions for quick aggregation of columns of a data frame\n",
"6. You can filter rows of a data frame using `filter` and `filter!` functions (also `subset` and `subset!` starting from version 1.0)\n",
"7. Use `sort` and `sort!` functions to sort data frames\n",
"8. You can join multiple data frames using `innerjoin`, `outerjoin`, `leftjoin`, `rightjoin`, `semijoin`, `antijoin`, and `crossjoin` functions (they work as you would expect them if you know SQL)\n",
"9. If you want to iterate rows or columns of a data frame use `eachrow` and `eachcol` functions (we have not discussed them, but they work exactly like in Julia Base)\n",
"10. You can change names of columns in a data frame using `rename` and `rename!` functions; to get names of columns of a data frame use `names` (strings) or `propertynames` (`Symbol`s)\n",
"11. To get number of rows and columns of a data frame use `nrow` and `ncol` functions\n",
"12. To flatten nested columns of a data frame use `flatten`\n",
"13. You can easily allow/disallow missing values in columns of a data frame using `allowmising`/`allowmissing!`/`disallowmising`/`disallowmissing!` functions\n",
"14. You can drop rows with missing data with `dropmissing`/`dropmissing!` functions\n",
"15. You can switch between [long and wide](https://en.wikipedia.org/wiki/Wide_and_narrow_data) representation of a data frame using `stack` and `unstack`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Additionally we have covered `freqtable` from FreqTables.jl, `@pipe` from Pipe.jl, and `lm` from GLM.jl packages that are often useful when wrangling data.\n",
"\n",
"You can use many formats to store and read data frames, we have discussed CSV.jl and Arrow.jl packages that provide such functionality.\n",
"\n",
"Finally we have shown how to integrate DataFrames.jl with plotting using PyPlot.jl and Unitful.jl."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course this course was just an introduction.\n",
"\n",
"You can find reviews of functionality of DataFrames.jl in:\n",
"* an official manual at https://juliadata.github.io/DataFrames.jl/stable/\n",
"* a tutorial going through all functionalities of DataFrames.jl at https://github.com/bkamins/Julia-DataFrames-Tutorial\n",
"* documentation strings of the respective funcions"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.6.1",
"language": "julia",
"name": "julia-1.6"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}