Initial commit
This commit is contained in:
195
Exercises/ex2_1.md
Normal file
195
Exercises/ex2_1.md
Normal file
@@ -0,0 +1,195 @@
|
||||
\[ [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]
|
||||
|
||||
# Exercise 2.1
|
||||
|
||||
*Objectives:*
|
||||
|
||||
- Figure out the most memory-efficient way to store a lot of data.
|
||||
- Learn about different ways of representing records including tuples,
|
||||
dictionaries, classes, and named tuples.
|
||||
|
||||
In this exercise, we look at different choices for representing data
|
||||
structures with an eye towards memory use and efficiency. A lot of
|
||||
people use Python to perform various kinds of data analysis so knowing
|
||||
about different options and their tradeoffs is useful information.
|
||||
|
||||
## (a) Stuck on the bus
|
||||
|
||||
The file `Data/ctabus.csv` is a CSV file containing
|
||||
daily ridership data for the Chicago Transit Authority (CTA) bus
|
||||
system from January 1, 2001 to August 31, 2013. It contains
|
||||
approximately 577000 rows of data. Use Python to view a few lines
|
||||
of data to see what it looks like:
|
||||
|
||||
```python
|
||||
>>> f = open('Data/ctabus.csv')
|
||||
>>> next(f)
|
||||
'route,date,daytype,rides\n'
|
||||
>>> next(f)
|
||||
'3,01/01/2001,U,7354\n'
|
||||
>>> next(f)
|
||||
'4,01/01/2001,U,9288\n'
|
||||
>>>
|
||||
```
|
||||
|
||||
There are 4 columns of data.
|
||||
|
||||
- route: Column 0. The bus route name.
|
||||
- date: Column 1. A date string of the form MM/DD/YYYY.
|
||||
- daytype: Column 2. A day type code (U=Sunday/Holiday, A=Saturday, W=Weekday)
|
||||
- rides: Column 3. Total number of riders (integer)
|
||||
|
||||
The `rides` column records the total number of people who boarded a
|
||||
bus on that route on a given day. Thus, from the example, 7354 people
|
||||
rode the number 3 bus on January 1, 2001.
|
||||
|
||||
## (b) Basic memory use of text
|
||||
|
||||
Let's get a baseline of the memory required to work with this
|
||||
datafile. First, restart Python and try a very simple experiment of
|
||||
simply grabbing the file and storing its data in a single string:
|
||||
|
||||
```python
|
||||
>>> # --- RESTART
|
||||
>>> import tracemalloc
|
||||
>>> f = open('Data/ctabus.csv')
|
||||
>>> tracemalloc.start()
|
||||
>>> data = f.read()
|
||||
>>> len(data)
|
||||
12361039
|
||||
>>> current, peak = tracemalloc.get_traced_memory()
|
||||
>>> current
|
||||
12369664
|
||||
>>> peak
|
||||
24730766
|
||||
>>>
|
||||
```
|
||||
|
||||
Your results might vary somewhat, but you should see current
|
||||
memory use in the range of 12MB with a peak of 24MB.
|
||||
|
||||
What happens if you read the entire file into a list of strings
|
||||
instead? Restart Python and try this:
|
||||
|
||||
```python
|
||||
>>> # --- RESTART
|
||||
>>> import tracemalloc
|
||||
>>> f = open('Data/ctabus.csv')
|
||||
>>> tracemalloc.start()
|
||||
>>> lines = f.readlines()
|
||||
>>> len(lines)
|
||||
577564
|
||||
>>> current, peak = tracemalloc.get_traced_memory()
|
||||
>>> current
|
||||
45828030
|
||||
>>> peak
|
||||
45867371
|
||||
>>>
|
||||
```
|
||||
|
||||
You should see the memory use go up significantly into the range of 40-50MB.
|
||||
Point to ponder: what might be the source of that extra overhead?
|
||||
|
||||
## (c) A List of Tuples
|
||||
|
||||
In practice, you might read the data into a list and convert each line
|
||||
into some other data structure. Here is a program `readrides.py` that
|
||||
reads the entire file into a list of tuples using the `csv` module:
|
||||
|
||||
```python
|
||||
# readrides.py
|
||||
|
||||
import csv
|
||||
|
||||
def read_rides_as_tuples(filename):
|
||||
'''
|
||||
Read the bus ride data as a list of tuples
|
||||
'''
|
||||
records = []
|
||||
with open(filename) as f:
|
||||
rows = csv.reader(f)
|
||||
headings = next(rows) # Skip headers
|
||||
for row in rows:
|
||||
route = row[0]
|
||||
date = row[1]
|
||||
daytype = row[2]
|
||||
rides = int(row[3])
|
||||
record = (route, date, daytype, rides)
|
||||
records.append(record)
|
||||
return records
|
||||
|
||||
if __name__ == '__main__':
|
||||
import tracemalloc
|
||||
tracemalloc.start()
|
||||
rows = read_rides_as_tuples('Data/ctabus.csv')
|
||||
print('Memory Use: Current %d, Peak %d' % tracemalloc.get_traced_memory())
|
||||
```
|
||||
|
||||
Run this program using `python3 -i readrides.py` and look at the
|
||||
resulting contents of `rows`. You should get a list of tuples like
|
||||
this:
|
||||
|
||||
```python
|
||||
>>> len(rows)
|
||||
577563
|
||||
>>> rows[0]
|
||||
('3', '01/01/2001', 'U', 7354)
|
||||
>>> rows[1]
|
||||
('4', '01/01/2001', 'U', 9288)
|
||||
```
|
||||
|
||||
Look at the resulting memory use. It should be substantially higher
|
||||
than in part (b).
|
||||
|
||||
## (d) Memory Use of Other Data Structures
|
||||
|
||||
Python has many different choices for representing data structures.
|
||||
For example:
|
||||
|
||||
```python
|
||||
# A tuple
|
||||
row = (route, date, daytype, rides)
|
||||
|
||||
# A dictionary
|
||||
row = {
|
||||
'route': route,
|
||||
'date': date,
|
||||
'daytype': daytype,
|
||||
'rides': rides,
|
||||
}
|
||||
|
||||
# A class
|
||||
class Row:
|
||||
def __init__(self, route, date, daytype, rides):
|
||||
self.route = route
|
||||
self.date = date
|
||||
self.daytype = daytype
|
||||
self.rides = rides
|
||||
|
||||
# A named tuple
|
||||
from collections import namedtuple
|
||||
Row = namedtuple('Row', ['route', 'date', 'daytype', 'rides'])
|
||||
|
||||
# A class with __slots__
|
||||
class Row:
|
||||
__slots__ = ['route', 'date', 'daytype', 'rides']
|
||||
def __init__(self, route, date, daytype, rides):
|
||||
self.route = route
|
||||
self.date = date
|
||||
self.daytype = daytype
|
||||
self.rides = rides
|
||||
```
|
||||
Your task is as follows: Create different versions of the `read_rides()` function
|
||||
that use each of these data structures to represent a single row of data.
|
||||
Then, find out the resulting memory use of each option. Find out which
|
||||
approach offers the most efficient storage if you were working with a lot
|
||||
of data all at once.
|
||||
|
||||
\[ [Solution](soln2_1.md) | [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]
|
||||
|
||||
----
|
||||
`>>>` Advanced Python Mastery
|
||||
`...` A course by [dabeaz](https://www.dabeaz.com)
|
||||
`...` Copyright 2007-2023
|
||||
|
||||
. This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/)
|
||||
Reference in New Issue
Block a user