5.4 KiB
[ Index | Exercise 1.6 | Exercise 2.2 ]
Exercise 2.1
Objectives:
- Figure out the most memory-efficient way to store a lot of data.
- Learn about different ways of representing records including tuples, dictionaries, classes, and named tuples.
In this exercise, we look at different choices for representing data structures with an eye towards memory use and efficiency. A lot of people use Python to perform various kinds of data analysis so knowing about different options and their tradeoffs is useful information.
(a) Stuck on the bus
The file Data/ctabus.csv
is a CSV file containing daily
ridership data for the Chicago Transit Authority (CTA) bus system from
January 1, 2001 to August 31, 2013. It contains approximately 577000
rows of data. Use Python to view a few lines of data to see what it
looks like:
>>> f = open('Data/ctabus.csv')
>>> next(f)
'route,date,daytype,rides\n'
>>> next(f)
'3,01/01/2001,U,7354\n'
>>> next(f)
'4,01/01/2001,U,9288\n'
>>>
There are 4 columns of data.
- route: Column 0. The bus route name.
- date: Column 1. A date string of the form MM/DD/YYYY.
- daytype: Column 2. A day type code (U=Sunday/Holiday, A=Saturday, W=Weekday)
- rides: Column 3. Total number of riders (integer)
The rides
column records the total number of people who
boarded a bus on that route on a given day. Thus, from the example, 7354
people rode the number 3 bus on January 1, 2001.
(b) Basic memory use of text
Let’s get a baseline of the memory required to work with this datafile. First, restart Python and try a very simple experiment of simply grabbing the file and storing its data in a single string:
>>> # --- RESTART
>>> import tracemalloc
>>> f = open('Data/ctabus.csv')
>>> tracemalloc.start()
>>> data = f.read()
>>> len(data)
12361039
>>> current, peak = tracemalloc.get_traced_memory()
>>> current
12369664
>>> peak
24730766
>>>
Your results might vary somewhat, but you should see current memory use in the range of 12MB with a peak of 24MB.
What happens if you read the entire file into a list of strings instead? Restart Python and try this:
>>> # --- RESTART
>>> import tracemalloc
>>> f = open('Data/ctabus.csv')
>>> tracemalloc.start()
>>> lines = f.readlines()
>>> len(lines)
577564
>>> current, peak = tracemalloc.get_traced_memory()
>>> current
45828030
>>> peak
45867371
>>>
You should see the memory use go up significantly into the range of 40-50MB. Point to ponder: what might be the source of that extra overhead?
(c) A List of Tuples
In practice, you might read the data into a list and convert each
line into some other data structure. Here is a program
readrides.py
that reads the entire file into a list of
tuples using the csv
module:
# readrides.py
import csv
def read_rides_as_tuples(filename):
'''
Read the bus ride data as a list of tuples
'''
= []
records with open(filename) as f:
= csv.reader(f)
rows = next(rows) # Skip headers
headings for row in rows:
= row[0]
route = row[1]
date = row[2]
daytype = int(row[3])
rides = (route, date, daytype, rides)
record
records.append(record)return records
if __name__ == '__main__':
import tracemalloc
tracemalloc.start()= read_rides_as_tuples('Data/ctabus.csv')
rows print('Memory Use: Current %d, Peak %d' % tracemalloc.get_traced_memory())
Run this program using python3 -i readrides.py
and look
at the resulting contents of rows
. You should get a list of
tuples like this:
>>> len(rows)
577563
>>> rows[0]
'3', '01/01/2001', 'U', 7354)
(>>> rows[1]
'4', '01/01/2001', 'U', 9288) (
Look at the resulting memory use. It should be substantially higher than in part (b).
(d) Memory Use of Other Data Structures
Python has many different choices for representing data structures. For example:
# A tuple
= (route, date, daytype, rides)
row
# A dictionary
= {
row 'route': route,
'date': date,
'daytype': daytype,
'rides': rides,
}
# A class
class Row:
def __init__(self, route, date, daytype, rides):
self.route = route
self.date = date
self.daytype = daytype
self.rides = rides
# A named tuple
from collections import namedtuple
= namedtuple('Row', ['route', 'date', 'daytype', 'rides'])
Row
# A class with __slots__
class Row:
__slots__ = ['route', 'date', 'daytype', 'rides']
def __init__(self, route, date, daytype, rides):
self.route = route
self.date = date
self.daytype = daytype
self.rides = rides
Your task is as follows: Create different versions of the
read_rides()
function that use each of these data
structures to represent a single row of data. Then, find out the
resulting memory use of each option. Find out which approach offers the
most efficient storage if you were working with a lot of data all at
once.
[ Solution | Index | Exercise 1.6 | Exercise 2.2 ]
>>>
Advanced Python Mastery
...
A course by dabeaz
...
Copyright 2007-2023
.
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License