Revised with 1.6.1
This commit is contained in:
parent
a3b5cd7deb
commit
85271e0ea6
1768
DataFrames/04__Grouping_data_frames.ipynb
Normal file
1768
DataFrames/04__Grouping_data_frames.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
922
DataFrames/05__Collecting_experiments_data_in_a_data_frame.ipynb
Normal file
922
DataFrames/05__Collecting_experiments_data_in_a_data_frame.ipynb
Normal file
File diff suppressed because one or more lines are too long
479
DataFrames/06__Next_steps.ipynb
Normal file
479
DataFrames/06__Next_steps.ipynb
Normal file
@ -0,0 +1,479 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Final examples\n",
|
||||
"\n",
|
||||
"### Bogumił Kamiński"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let us wrap up our tutorial with examples of joining and reshaping data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Joining and reshaping data frames"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"using DataFrames\n",
|
||||
"using CSV\n",
|
||||
"using Pipe\n",
|
||||
"using Unitful\n",
|
||||
"using Dates"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Load the weather forecast data from two cities from Poland."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<table class=\"data-frame\"><thead><tr><th></th><th>city</th><th>date</th><th>rainfall</th></tr><tr><th></th><th>String</th><th>Date</th><th>Float64</th></tr></thead><tbody><p>10 rows × 3 columns</p><tr><th>1</th><td>Olecko</td><td>2020-11-16</td><td>2.9</td></tr><tr><th>2</th><td>Olecko</td><td>2020-11-17</td><td>4.1</td></tr><tr><th>3</th><td>Olecko</td><td>2020-11-19</td><td>4.3</td></tr><tr><th>4</th><td>Olecko</td><td>2020-11-20</td><td>2.0</td></tr><tr><th>5</th><td>Olecko</td><td>2020-11-21</td><td>0.6</td></tr><tr><th>6</th><td>Olecko</td><td>2020-11-22</td><td>1.0</td></tr><tr><th>7</th><td>Ełk</td><td>2020-11-16</td><td>3.9</td></tr><tr><th>8</th><td>Ełk</td><td>2020-11-19</td><td>1.2</td></tr><tr><th>9</th><td>Ełk</td><td>2020-11-20</td><td>2.0</td></tr><tr><th>10</th><td>Ełk</td><td>2020-11-22</td><td>2.0</td></tr></tbody></table>"
|
||||
],
|
||||
"text/latex": [
|
||||
"\\begin{tabular}{r|ccc}\n",
|
||||
"\t& city & date & rainfall\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t& String & Date & Float64\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t1 & Olecko & 2020-11-16 & 2.9 \\\\\n",
|
||||
"\t2 & Olecko & 2020-11-17 & 4.1 \\\\\n",
|
||||
"\t3 & Olecko & 2020-11-19 & 4.3 \\\\\n",
|
||||
"\t4 & Olecko & 2020-11-20 & 2.0 \\\\\n",
|
||||
"\t5 & Olecko & 2020-11-21 & 0.6 \\\\\n",
|
||||
"\t6 & Olecko & 2020-11-22 & 1.0 \\\\\n",
|
||||
"\t7 & Ełk & 2020-11-16 & 3.9 \\\\\n",
|
||||
"\t8 & Ełk & 2020-11-19 & 1.2 \\\\\n",
|
||||
"\t9 & Ełk & 2020-11-20 & 2.0 \\\\\n",
|
||||
"\t10 & Ełk & 2020-11-22 & 2.0 \\\\\n",
|
||||
"\\end{tabular}\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m10×3 DataFrame\u001b[0m\n",
|
||||
"\u001b[1m Row \u001b[0m│\u001b[1m city \u001b[0m\u001b[1m date \u001b[0m\u001b[1m rainfall \u001b[0m\n",
|
||||
"\u001b[1m \u001b[0m│\u001b[90m String \u001b[0m\u001b[90m Date \u001b[0m\u001b[90m Float64 \u001b[0m\n",
|
||||
"─────┼──────────────────────────────\n",
|
||||
" 1 │ Olecko 2020-11-16 2.9\n",
|
||||
" 2 │ Olecko 2020-11-17 4.1\n",
|
||||
" 3 │ Olecko 2020-11-19 4.3\n",
|
||||
" 4 │ Olecko 2020-11-20 2.0\n",
|
||||
" 5 │ Olecko 2020-11-21 0.6\n",
|
||||
" 6 │ Olecko 2020-11-22 1.0\n",
|
||||
" 7 │ Ełk 2020-11-16 3.9\n",
|
||||
" 8 │ Ełk 2020-11-19 1.2\n",
|
||||
" 9 │ Ełk 2020-11-20 2.0\n",
|
||||
" 10 │ Ełk 2020-11-22 2.0"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"rainfall_long = CSV.File(\"data/rainfall_forecast.csv\") |> DataFrame"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that we collect rainfall information, so it would be nice to add units to the measured values. This is not a problem with `Unitful.jl`. We take advantage of the fact that `DataFrame` can store vectors of any Julia objects."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<table class=\"data-frame\"><thead><tr><th></th><th>city</th><th>date</th><th>rainfall</th></tr><tr><th></th><th>String</th><th>Date</th><th>Quantit…</th></tr></thead><tbody><p>10 rows × 3 columns</p><tr><th>1</th><td>Olecko</td><td>2020-11-16</td><td>2.9 mm</td></tr><tr><th>2</th><td>Olecko</td><td>2020-11-17</td><td>4.1 mm</td></tr><tr><th>3</th><td>Olecko</td><td>2020-11-19</td><td>4.3 mm</td></tr><tr><th>4</th><td>Olecko</td><td>2020-11-20</td><td>2.0 mm</td></tr><tr><th>5</th><td>Olecko</td><td>2020-11-21</td><td>0.6 mm</td></tr><tr><th>6</th><td>Olecko</td><td>2020-11-22</td><td>1.0 mm</td></tr><tr><th>7</th><td>Ełk</td><td>2020-11-16</td><td>3.9 mm</td></tr><tr><th>8</th><td>Ełk</td><td>2020-11-19</td><td>1.2 mm</td></tr><tr><th>9</th><td>Ełk</td><td>2020-11-20</td><td>2.0 mm</td></tr><tr><th>10</th><td>Ełk</td><td>2020-11-22</td><td>2.0 mm</td></tr></tbody></table>"
|
||||
],
|
||||
"text/latex": [
|
||||
"\\begin{tabular}{r|ccc}\n",
|
||||
"\t& city & date & rainfall\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t& String & Date & Quantit…\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t1 & Olecko & 2020-11-16 & 2.9 mm \\\\\n",
|
||||
"\t2 & Olecko & 2020-11-17 & 4.1 mm \\\\\n",
|
||||
"\t3 & Olecko & 2020-11-19 & 4.3 mm \\\\\n",
|
||||
"\t4 & Olecko & 2020-11-20 & 2.0 mm \\\\\n",
|
||||
"\t5 & Olecko & 2020-11-21 & 0.6 mm \\\\\n",
|
||||
"\t6 & Olecko & 2020-11-22 & 1.0 mm \\\\\n",
|
||||
"\t7 & Ełk & 2020-11-16 & 3.9 mm \\\\\n",
|
||||
"\t8 & Ełk & 2020-11-19 & 1.2 mm \\\\\n",
|
||||
"\t9 & Ełk & 2020-11-20 & 2.0 mm \\\\\n",
|
||||
"\t10 & Ełk & 2020-11-22 & 2.0 mm \\\\\n",
|
||||
"\\end{tabular}\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m10×3 DataFrame\u001b[0m\n",
|
||||
"\u001b[1m Row \u001b[0m│\u001b[1m city \u001b[0m\u001b[1m date \u001b[0m\u001b[1m rainfall \u001b[0m\n",
|
||||
"\u001b[1m \u001b[0m│\u001b[90m String \u001b[0m\u001b[90m Date \u001b[0m\u001b[90m Quantity… \u001b[0m\n",
|
||||
"─────┼───────────────────────────────\n",
|
||||
" 1 │ Olecko 2020-11-16 2.9 mm\n",
|
||||
" 2 │ Olecko 2020-11-17 4.1 mm\n",
|
||||
" 3 │ Olecko 2020-11-19 4.3 mm\n",
|
||||
" 4 │ Olecko 2020-11-20 2.0 mm\n",
|
||||
" 5 │ Olecko 2020-11-21 0.6 mm\n",
|
||||
" 6 │ Olecko 2020-11-22 1.0 mm\n",
|
||||
" 7 │ Ełk 2020-11-16 3.9 mm\n",
|
||||
" 8 │ Ełk 2020-11-19 1.2 mm\n",
|
||||
" 9 │ Ełk 2020-11-20 2.0 mm\n",
|
||||
" 10 │ Ełk 2020-11-22 2.0 mm"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"transform!(rainfall_long, :rainfall => x -> x .* u\"mm\", renamecols=false)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"With `renamecols=false` we left the name of the transformed column unchanged when we did an in-place update of the data frame using the `transform!` function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It would be nice to see the data in a wide format, so that each city is represented by a single column. We can achieve this using the `unstack` function:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<table class=\"data-frame\"><thead><tr><th></th><th>date</th><th>Olecko</th><th>Ełk</th></tr><tr><th></th><th>Date</th><th>Quantit…?</th><th>Quantit…?</th></tr></thead><tbody><p>6 rows × 3 columns</p><tr><th>1</th><td>2020-11-16</td><td>2.9 mm</td><td>3.9 mm</td></tr><tr><th>2</th><td>2020-11-17</td><td>4.1 mm</td><td><em>missing</em></td></tr><tr><th>3</th><td>2020-11-19</td><td>4.3 mm</td><td>1.2 mm</td></tr><tr><th>4</th><td>2020-11-20</td><td>2.0 mm</td><td>2.0 mm</td></tr><tr><th>5</th><td>2020-11-21</td><td>0.6 mm</td><td><em>missing</em></td></tr><tr><th>6</th><td>2020-11-22</td><td>1.0 mm</td><td>2.0 mm</td></tr></tbody></table>"
|
||||
],
|
||||
"text/latex": [
|
||||
"\\begin{tabular}{r|ccc}\n",
|
||||
"\t& date & Olecko & Ełk\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t& Date & Quantit…? & Quantit…?\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t1 & 2020-11-16 & 2.9 mm & 3.9 mm \\\\\n",
|
||||
"\t2 & 2020-11-17 & 4.1 mm & \\emph{missing} \\\\\n",
|
||||
"\t3 & 2020-11-19 & 4.3 mm & 1.2 mm \\\\\n",
|
||||
"\t4 & 2020-11-20 & 2.0 mm & 2.0 mm \\\\\n",
|
||||
"\t5 & 2020-11-21 & 0.6 mm & \\emph{missing} \\\\\n",
|
||||
"\t6 & 2020-11-22 & 1.0 mm & 2.0 mm \\\\\n",
|
||||
"\\end{tabular}\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m6×3 DataFrame\u001b[0m\n",
|
||||
"\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\u001b[1m Olecko \u001b[0m\u001b[1m Ełk \u001b[0m\n",
|
||||
"\u001b[1m \u001b[0m│\u001b[90m Date \u001b[0m\u001b[90m Quantity…? \u001b[0m\u001b[90m Quantity…? \u001b[0m\n",
|
||||
"─────┼──────────────────────────────────────\n",
|
||||
" 1 │ 2020-11-16 2.9 mm 3.9 mm\n",
|
||||
" 2 │ 2020-11-17 4.1 mm \u001b[90m missing \u001b[0m\n",
|
||||
" 3 │ 2020-11-19 4.3 mm 1.2 mm\n",
|
||||
" 4 │ 2020-11-20 2.0 mm 2.0 mm\n",
|
||||
" 5 │ 2020-11-21 0.6 mm \u001b[90m missing \u001b[0m\n",
|
||||
" 6 │ 2020-11-22 1.0 mm 2.0 mm"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"rainfall_wide = unstack(rainfall_long, :date, :city, :rainfall)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see that the \"gaps\" in the rainfall information for `\"Ełk\"` column got automatically filled by `missing`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"There is also a `stack` function that does the reverse: transforms a data frame from wide to long format."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Also note that one of the cities is `\"Ełk\"`, which has a non standard character `ł` in its name. It is not a problem with `DataFrames.jl`. Let us e.g. extract this column as an exercise:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"6-element Vector{Union{Missing, Quantity{Float64, 𝐋, Unitful.FreeUnits{(mm,), 𝐋, nothing}}}}:\n",
|
||||
" 3.9 mm\n",
|
||||
" missing\n",
|
||||
" 1.2 mm\n",
|
||||
" 2.0 mm\n",
|
||||
" missing\n",
|
||||
" 2.0 mm"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"rainfall_wide.Ełk"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"6-element Vector{Union{Missing, Quantity{Float64, 𝐋, Unitful.FreeUnits{(mm,), 𝐋, nothing}}}}:\n",
|
||||
" 3.9 mm\n",
|
||||
" missing\n",
|
||||
" 1.2 mm\n",
|
||||
" 2.0 mm\n",
|
||||
" missing\n",
|
||||
" 2.0 mm"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"rainfall_wide.\"Ełk\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When we read the data, we note that still there are gaps in the passed information --- one of the days is missing as there is no forecasted rainfall for it.\n",
|
||||
"\n",
|
||||
"It would be nice to have information for all days in the considered period. Here is the way to do it:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<table class=\"data-frame\"><thead><tr><th></th><th>date</th></tr><tr><th></th><th>Date</th></tr></thead><tbody><p>7 rows × 1 columns</p><tr><th>1</th><td>2020-11-16</td></tr><tr><th>2</th><td>2020-11-17</td></tr><tr><th>3</th><td>2020-11-18</td></tr><tr><th>4</th><td>2020-11-19</td></tr><tr><th>5</th><td>2020-11-20</td></tr><tr><th>6</th><td>2020-11-21</td></tr><tr><th>7</th><td>2020-11-22</td></tr></tbody></table>"
|
||||
],
|
||||
"text/latex": [
|
||||
"\\begin{tabular}{r|c}\n",
|
||||
"\t& date\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t& Date\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t1 & 2020-11-16 \\\\\n",
|
||||
"\t2 & 2020-11-17 \\\\\n",
|
||||
"\t3 & 2020-11-18 \\\\\n",
|
||||
"\t4 & 2020-11-19 \\\\\n",
|
||||
"\t5 & 2020-11-20 \\\\\n",
|
||||
"\t6 & 2020-11-21 \\\\\n",
|
||||
"\t7 & 2020-11-22 \\\\\n",
|
||||
"\\end{tabular}\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m7×1 DataFrame\u001b[0m\n",
|
||||
"\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\n",
|
||||
"\u001b[1m \u001b[0m│\u001b[90m Date \u001b[0m\n",
|
||||
"─────┼────────────\n",
|
||||
" 1 │ 2020-11-16\n",
|
||||
" 2 │ 2020-11-17\n",
|
||||
" 3 │ 2020-11-18\n",
|
||||
" 4 │ 2020-11-19\n",
|
||||
" 5 │ 2020-11-20\n",
|
||||
" 6 │ 2020-11-21\n",
|
||||
" 7 │ 2020-11-22"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"all_days = DataFrame(date=Date.(2020,11, 16:22))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<table class=\"data-frame\"><thead><tr><th></th><th>date</th><th>Olecko</th><th>Ełk</th></tr><tr><th></th><th>Date</th><th>Quantit…</th><th>Quantit…</th></tr></thead><tbody><p>7 rows × 3 columns</p><tr><th>1</th><td>2020-11-16</td><td>2.9 mm</td><td>3.9 mm</td></tr><tr><th>2</th><td>2020-11-17</td><td>4.1 mm</td><td>0.0 mm</td></tr><tr><th>3</th><td>2020-11-19</td><td>4.3 mm</td><td>1.2 mm</td></tr><tr><th>4</th><td>2020-11-20</td><td>2.0 mm</td><td>2.0 mm</td></tr><tr><th>5</th><td>2020-11-21</td><td>0.6 mm</td><td>0.0 mm</td></tr><tr><th>6</th><td>2020-11-22</td><td>1.0 mm</td><td>2.0 mm</td></tr><tr><th>7</th><td>2020-11-18</td><td>0.0 mm</td><td>0.0 mm</td></tr></tbody></table>"
|
||||
],
|
||||
"text/latex": [
|
||||
"\\begin{tabular}{r|ccc}\n",
|
||||
"\t& date & Olecko & Ełk\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t& Date & Quantit… & Quantit…\\\\\n",
|
||||
"\t\\hline\n",
|
||||
"\t1 & 2020-11-16 & 2.9 mm & 3.9 mm \\\\\n",
|
||||
"\t2 & 2020-11-17 & 4.1 mm & 0.0 mm \\\\\n",
|
||||
"\t3 & 2020-11-19 & 4.3 mm & 1.2 mm \\\\\n",
|
||||
"\t4 & 2020-11-20 & 2.0 mm & 2.0 mm \\\\\n",
|
||||
"\t5 & 2020-11-21 & 0.6 mm & 0.0 mm \\\\\n",
|
||||
"\t6 & 2020-11-22 & 1.0 mm & 2.0 mm \\\\\n",
|
||||
"\t7 & 2020-11-18 & 0.0 mm & 0.0 mm \\\\\n",
|
||||
"\\end{tabular}\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m7×3 DataFrame\u001b[0m\n",
|
||||
"\u001b[1m Row \u001b[0m│\u001b[1m date \u001b[0m\u001b[1m Olecko \u001b[0m\u001b[1m Ełk \u001b[0m\n",
|
||||
"\u001b[1m \u001b[0m│\u001b[90m Date \u001b[0m\u001b[90m Quantity… \u001b[0m\u001b[90m Quantity… \u001b[0m\n",
|
||||
"─────┼──────────────────────────────────\n",
|
||||
" 1 │ 2020-11-16 2.9 mm 3.9 mm\n",
|
||||
" 2 │ 2020-11-17 4.1 mm 0.0 mm\n",
|
||||
" 3 │ 2020-11-19 4.3 mm 1.2 mm\n",
|
||||
" 4 │ 2020-11-20 2.0 mm 2.0 mm\n",
|
||||
" 5 │ 2020-11-21 0.6 mm 0.0 mm\n",
|
||||
" 6 │ 2020-11-22 1.0 mm 2.0 mm\n",
|
||||
" 7 │ 2020-11-18 0.0 mm 0.0 mm"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"@pipe leftjoin(all_days, rainfall_wide, on=:date) |>\n",
|
||||
" coalesce.(_, 0.0u\"mm\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that we additionally used a broadcasted `coalesce` operation on the whole data frame returned from `leftjoin` to replace all `missing` values by `0.0u\"mm\"` in it, as in this case `missing` meant that there is no rain forecasted for that day.\n",
|
||||
"\n",
|
||||
"It was safe to do here, as we knew that `:date` column does not contain missings. In particular note that `leftjoin` would error by default if we tried to perfrom join on a column that contains `missing` values (use `matchmissing` keyword argument in joins to change this behavior)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Conclusions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Before we finish let us summarize the major functions that `DataFrames.jl` provides:\n",
|
||||
"1. data frame is a matrix-like data structure. You can index it just like a matrix. The differences are\n",
|
||||
" - you can use strings or `Symbol`s to select columns\n",
|
||||
" - if you select rows with `!` it selects you whole column of a data frame and passes it to you without copying\n",
|
||||
"2. You can quickly summarize the contents of a data frame using the `describe` function\n",
|
||||
"3. You can add rows to a data frame in-place using `push!` (similarly `append!` allows you to add multiple rows at the same time) (also `repeat`/`repeat!`, `hcat` and `vcat` are provided)\n",
|
||||
"4. You can work on a grouped data frame that is created using the `groupby` function. It is a view and works as-if you have created a lookup index to a data frame.\n",
|
||||
"5. There are `select`/`select!`/`transform`/`transform!`/`combine` functions that allow you to quickly transform/aggregate columns of a data frame or grouped data frame; there is also `mapcols`/`mapcols!` functions for quick aggregation of columns of a data frame\n",
|
||||
"6. You can filter rows of a data frame using `filter` and `filter!` functions (also `subset` and `subset!` starting from version 1.0)\n",
|
||||
"7. Use `sort` and `sort!` functions to sort data frames\n",
|
||||
"8. You can join multiple data frames using `innerjoin`, `outerjoin`, `leftjoin`, `rightjoin`, `semijoin`, `antijoin`, and `crossjoin` functions (they work as you would expect them if you know SQL)\n",
|
||||
"9. If you want to iterate rows or columns of a data frame use `eachrow` and `eachcol` functions (we have not discussed them, but they work exactly like in Julia Base)\n",
|
||||
"10. You can change names of columns in a data frame using `rename` and `rename!` functions; to get names of columns of a data frame use `names` (strings) or `propertynames` (`Symbol`s)\n",
|
||||
"11. To get number of rows and columns of a data frame use `nrow` and `ncol` functions\n",
|
||||
"12. To flatten nested columns of a data frame use `flatten`\n",
|
||||
"13. You can easily allow/disallow missing values in columns of a data frame using `allowmising`/`allowmissing!`/`disallowmising`/`disallowmissing!` functions\n",
|
||||
"14. You can drop rows with missing data with `dropmissing`/`dropmissing!` functions\n",
|
||||
"15. You can switch between [long and wide](https://en.wikipedia.org/wiki/Wide_and_narrow_data) representation of a data frame using `stack` and `unstack`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Additionally we have covered `freqtable` from FreqTables.jl, `@pipe` from Pipe.jl, and `lm` from GLM.jl packages that are often useful when wrangling data.\n",
|
||||
"\n",
|
||||
"You can use many formats to store and read data frames, we have discussed CSV.jl and Arrow.jl packages that provide such functionality.\n",
|
||||
"\n",
|
||||
"Finally we have shown how to integrate DataFrames.jl with plotting using PyPlot.jl and Unitful.jl."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Of course this course was just an introduction.\n",
|
||||
"\n",
|
||||
"You can find reviews of functionality of DataFrames.jl in:\n",
|
||||
"* an official manual at https://juliadata.github.io/DataFrames.jl/stable/\n",
|
||||
"* a tutorial going through all functionalities of DataFrames.jl at https://github.com/bkamins/Julia-DataFrames-Tutorial\n",
|
||||
"* documentation strings of the respective funcions"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Julia 1.6.1",
|
||||
"language": "julia",
|
||||
"name": "julia-1.6"
|
||||
},
|
||||
"language_info": {
|
||||
"file_extension": ".jl",
|
||||
"mimetype": "application/julia",
|
||||
"name": "julia",
|
||||
"version": "1.6.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
Loading…
x
Reference in New Issue
Block a user