python-mastery/Exercises/ex2_1.md

\[ [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]

# Exercise 2.1

*Objectives:*

- Figure out the most memory-efficient way to store a lot of data.
- Learn about different ways of representing records including tuples,
dictionaries, classes, and named tuples.

In this exercise, we look at different choices for representing data
structures with an eye towards memory use and efficiency.  A lot of
people use Python to perform various kinds of data analysis so knowing
about different options and their tradeoffs is useful information.

## (a) Stuck on the bus

The file `Data/ctabus.csv` is a CSV file containing
daily ridership data for the Chicago Transit Authority (CTA) bus
system from January 1, 2001 to August 31, 2013.  It contains
approximately 577000 rows of data.  Use Python to view a few lines
of data to see what it looks like:

```python
>>> f = open('Data/ctabus.csv')
>>> next(f)
'route,date,daytype,rides\n'
>>> next(f)
'3,01/01/2001,U,7354\n'
>>> next(f)
'4,01/01/2001,U,9288\n'
>>>
```

There are 4 columns of data.

- route: Column 0.  The bus route name.
- date: Column 1.  A date string of the form MM/DD/YYYY.
- daytype: Column 2. A day type code (U=Sunday/Holiday, A=Saturday, W=Weekday)
- rides: Column 3. Total number of riders (integer)

The `rides` column records the total number of people who boarded a
bus on that route on a given day. Thus, from the example, 7354 people
rode the number 3 bus on January 1, 2001.

## (b) Basic memory use of text

Let's get a baseline of the memory required to work with this
datafile.  First, restart Python and try a very simple experiment of
simply grabbing the file and storing its data in a single string:

```python
>>> # --- RESTART
>>> import tracemalloc
>>> f = open('Data/ctabus.csv')
>>> tracemalloc.start()
>>> data = f.read()
>>> len(data)
12361039
>>> current, peak = tracemalloc.get_traced_memory()
>>> current
12369664
>>> peak
24730766
>>>
```

Your results might vary somewhat, but you should see current
memory use in the range of 12MB with a peak of 24MB.

What happens if you read the entire file into a list of strings
instead?  Restart Python and try this:

```python
>>> # --- RESTART
>>> import tracemalloc
>>> f = open('Data/ctabus.csv')
>>> tracemalloc.start()
>>> lines = f.readlines()
>>> len(lines)
577564
>>> current, peak = tracemalloc.get_traced_memory()
>>> current
45828030
>>> peak
45867371
>>>
```

You should see the memory use go up significantly into the range of 40-50MB.
Point to ponder: what might be the source of that extra overhead?

## (c) A List of Tuples

In practice, you might read the data into a list and convert each line
into some other data structure.  Here is a program `readrides.py` that
reads the entire file into a list of tuples using the `csv` module:

```python
# readrides.py

import csv

def read_rides_as_tuples(filename):
    '''
    Read the bus ride data as a list of tuples
    '''
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     # Skip headers
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = (route, date, daytype, rides)
            records.append(record)
    return records

if __name__ == '__main__':
    import tracemalloc
    tracemalloc.start()
    rows = read_rides_as_tuples('Data/ctabus.csv')
    print('Memory Use: Current %d, Peak %d' % tracemalloc.get_traced_memory())
```

Run this program using `python3 -i readrides.py` and look at the
resulting contents of `rows`. You should get a list of tuples like
this:

```python
>>> len(rows)
577563
>>> rows[0]
('3', '01/01/2001', 'U', 7354)
>>> rows[1]
('4', '01/01/2001', 'U', 9288)
```

Look at the resulting memory use. It should be substantially higher
than in part (b).

## (d) Memory Use of Other Data Structures

Python has many different choices for representing data structures.
For example:

```python
# A tuple
row = (route, date, daytype, rides)

# A dictionary
row = {
    'route': route,
    'date': date,
    'daytype': daytype,
    'rides': rides,
}

# A class
class Row:
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides

# A named tuple
from collections import namedtuple
Row = namedtuple('Row', ['route', 'date', 'daytype', 'rides'])

# A class with __slots__
class Row:
    __slots__ = ['route', 'date', 'daytype', 'rides']
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides
```
Your task is as follows:  Create different versions of the `read_rides()` function
that use each of these data structures to represent a single row of data.
Then, find out the resulting memory use of each option.   Find out which
approach offers the most efficient storage if you were working with a lot
of data all at once.

\[ [Solution](soln2_1.md) | [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]

----
`>>>` Advanced Python Mastery
`...` A course by [dabeaz](https://www.dabeaz.com)
`...` Copyright 2007-2023

![](https://i.creativecommons.org/l/by-sa/4.0/88x31.png). This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/)