196 lines
5.4 KiB
Markdown
196 lines
5.4 KiB
Markdown
|
\[ [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]
|
||
|
|
||
|
# Exercise 2.1
|
||
|
|
||
|
*Objectives:*
|
||
|
|
||
|
- Figure out the most memory-efficient way to store a lot of data.
|
||
|
- Learn about different ways of representing records including tuples,
|
||
|
dictionaries, classes, and named tuples.
|
||
|
|
||
|
In this exercise, we look at different choices for representing data
|
||
|
structures with an eye towards memory use and efficiency. A lot of
|
||
|
people use Python to perform various kinds of data analysis so knowing
|
||
|
about different options and their tradeoffs is useful information.
|
||
|
|
||
|
## (a) Stuck on the bus
|
||
|
|
||
|
The file `Data/ctabus.csv` is a CSV file containing
|
||
|
daily ridership data for the Chicago Transit Authority (CTA) bus
|
||
|
system from January 1, 2001 to August 31, 2013. It contains
|
||
|
approximately 577000 rows of data. Use Python to view a few lines
|
||
|
of data to see what it looks like:
|
||
|
|
||
|
```python
|
||
|
>>> f = open('Data/ctabus.csv')
|
||
|
>>> next(f)
|
||
|
'route,date,daytype,rides\n'
|
||
|
>>> next(f)
|
||
|
'3,01/01/2001,U,7354\n'
|
||
|
>>> next(f)
|
||
|
'4,01/01/2001,U,9288\n'
|
||
|
>>>
|
||
|
```
|
||
|
|
||
|
There are 4 columns of data.
|
||
|
|
||
|
- route: Column 0. The bus route name.
|
||
|
- date: Column 1. A date string of the form MM/DD/YYYY.
|
||
|
- daytype: Column 2. A day type code (U=Sunday/Holiday, A=Saturday, W=Weekday)
|
||
|
- rides: Column 3. Total number of riders (integer)
|
||
|
|
||
|
The `rides` column records the total number of people who boarded a
|
||
|
bus on that route on a given day. Thus, from the example, 7354 people
|
||
|
rode the number 3 bus on January 1, 2001.
|
||
|
|
||
|
## (b) Basic memory use of text
|
||
|
|
||
|
Let's get a baseline of the memory required to work with this
|
||
|
datafile. First, restart Python and try a very simple experiment of
|
||
|
simply grabbing the file and storing its data in a single string:
|
||
|
|
||
|
```python
|
||
|
>>> # --- RESTART
|
||
|
>>> import tracemalloc
|
||
|
>>> f = open('Data/ctabus.csv')
|
||
|
>>> tracemalloc.start()
|
||
|
>>> data = f.read()
|
||
|
>>> len(data)
|
||
|
12361039
|
||
|
>>> current, peak = tracemalloc.get_traced_memory()
|
||
|
>>> current
|
||
|
12369664
|
||
|
>>> peak
|
||
|
24730766
|
||
|
>>>
|
||
|
```
|
||
|
|
||
|
Your results might vary somewhat, but you should see current
|
||
|
memory use in the range of 12MB with a peak of 24MB.
|
||
|
|
||
|
What happens if you read the entire file into a list of strings
|
||
|
instead? Restart Python and try this:
|
||
|
|
||
|
```python
|
||
|
>>> # --- RESTART
|
||
|
>>> import tracemalloc
|
||
|
>>> f = open('Data/ctabus.csv')
|
||
|
>>> tracemalloc.start()
|
||
|
>>> lines = f.readlines()
|
||
|
>>> len(lines)
|
||
|
577564
|
||
|
>>> current, peak = tracemalloc.get_traced_memory()
|
||
|
>>> current
|
||
|
45828030
|
||
|
>>> peak
|
||
|
45867371
|
||
|
>>>
|
||
|
```
|
||
|
|
||
|
You should see the memory use go up significantly into the range of 40-50MB.
|
||
|
Point to ponder: what might be the source of that extra overhead?
|
||
|
|
||
|
## (c) A List of Tuples
|
||
|
|
||
|
In practice, you might read the data into a list and convert each line
|
||
|
into some other data structure. Here is a program `readrides.py` that
|
||
|
reads the entire file into a list of tuples using the `csv` module:
|
||
|
|
||
|
```python
|
||
|
# readrides.py
|
||
|
|
||
|
import csv
|
||
|
|
||
|
def read_rides_as_tuples(filename):
|
||
|
'''
|
||
|
Read the bus ride data as a list of tuples
|
||
|
'''
|
||
|
records = []
|
||
|
with open(filename) as f:
|
||
|
rows = csv.reader(f)
|
||
|
headings = next(rows) # Skip headers
|
||
|
for row in rows:
|
||
|
route = row[0]
|
||
|
date = row[1]
|
||
|
daytype = row[2]
|
||
|
rides = int(row[3])
|
||
|
record = (route, date, daytype, rides)
|
||
|
records.append(record)
|
||
|
return records
|
||
|
|
||
|
if __name__ == '__main__':
|
||
|
import tracemalloc
|
||
|
tracemalloc.start()
|
||
|
rows = read_rides_as_tuples('Data/ctabus.csv')
|
||
|
print('Memory Use: Current %d, Peak %d' % tracemalloc.get_traced_memory())
|
||
|
```
|
||
|
|
||
|
Run this program using `python3 -i readrides.py` and look at the
|
||
|
resulting contents of `rows`. You should get a list of tuples like
|
||
|
this:
|
||
|
|
||
|
```python
|
||
|
>>> len(rows)
|
||
|
577563
|
||
|
>>> rows[0]
|
||
|
('3', '01/01/2001', 'U', 7354)
|
||
|
>>> rows[1]
|
||
|
('4', '01/01/2001', 'U', 9288)
|
||
|
```
|
||
|
|
||
|
Look at the resulting memory use. It should be substantially higher
|
||
|
than in part (b).
|
||
|
|
||
|
## (d) Memory Use of Other Data Structures
|
||
|
|
||
|
Python has many different choices for representing data structures.
|
||
|
For example:
|
||
|
|
||
|
```python
|
||
|
# A tuple
|
||
|
row = (route, date, daytype, rides)
|
||
|
|
||
|
# A dictionary
|
||
|
row = {
|
||
|
'route': route,
|
||
|
'date': date,
|
||
|
'daytype': daytype,
|
||
|
'rides': rides,
|
||
|
}
|
||
|
|
||
|
# A class
|
||
|
class Row:
|
||
|
def __init__(self, route, date, daytype, rides):
|
||
|
self.route = route
|
||
|
self.date = date
|
||
|
self.daytype = daytype
|
||
|
self.rides = rides
|
||
|
|
||
|
# A named tuple
|
||
|
from collections import namedtuple
|
||
|
Row = namedtuple('Row', ['route', 'date', 'daytype', 'rides'])
|
||
|
|
||
|
# A class with __slots__
|
||
|
class Row:
|
||
|
__slots__ = ['route', 'date', 'daytype', 'rides']
|
||
|
def __init__(self, route, date, daytype, rides):
|
||
|
self.route = route
|
||
|
self.date = date
|
||
|
self.daytype = daytype
|
||
|
self.rides = rides
|
||
|
```
|
||
|
Your task is as follows: Create different versions of the `read_rides()` function
|
||
|
that use each of these data structures to represent a single row of data.
|
||
|
Then, find out the resulting memory use of each option. Find out which
|
||
|
approach offers the most efficient storage if you were working with a lot
|
||
|
of data all at once.
|
||
|
|
||
|
\[ [Solution](soln2_1.md) | [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]
|
||
|
|
||
|
----
|
||
|
`>>>` Advanced Python Mastery
|
||
|
`...` A course by [dabeaz](https://www.dabeaz.com)
|
||
|
`...` Copyright 2007-2023
|
||
|
|
||
|
. This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/)
|