python-mastery/Exercises/ex2_5.md

\[ [Index](index.md) | [Exercise 2.4](ex2_4.md) | [Exercise 2.6](ex2_6.md) \]

# Exercise 2.5

*Objectives:*

- Look at memory allocation behavior of lists and dicts
- Make a custom container

*Files Created:* None

## (a) List growth

Python lists are highly optimized for performing `append()`
operations. Each time a list grows, it grabs a larger chunk of memory
than it actually needs with the expectation that more data will be
added to the list later.  If new items are added and space is
available, the `append()` operation stores the item without
allocating more memory.

Experiment with this feature of lists by using
the `sys.getsizeof()` function on a list and appending a few
more items.

```python
>>> import sys
>>> items = []
>>> sys.getsizeof(items)
64
>>> items.append(1)
>>> sys.getsizeof(items)
96
>>> items.append(2)
>>> sys.getsizeof(items)    # Notice how the size does not increase
96
>>> items.append(3)
>>> sys.getsizeof(items)    # It still doesn't increase here
96
>>> items.append(4)
>>> sys.getsizeof(items)    # Not yet.
96
>>> items.append(5)
>>> sys.getsizeof(items)    # Notice the size has jumped
128
>>>
```

A list stores its items by reference.  So, the memory required for
each item is a single memory address.  On a 64-bit machine, an address
is typically 8 bytes.  However, if Python has been compiled for
32-bits, it might be 4 bytes and the numbers for the above example
will be half of what's shown.

## (b) Dictionary/Class Growth

Python dictionaries (and classes) allow up to 5 values to be stored
before their reserved memory doubles.  Investigate by making a dictionary
and adding a few more values to it:

```python
>>> row = { 'route': '22', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354 }
>>> sys.getsizeof(row)
>>> sys.getsizeof(row)
240
>>> row['a'] = 1
>>> sys.getsizeof(row)
240
>>> row['b'] = 2
>>> sys.getsizeof(row)
368
>>>
```

Does the memory go down if you delete the item you just added?

Food for thought:  If you are creating large numbers of records,
representing each record as a dictionary might not be the most
efficient approach--you could be paying a heavy price for the convenience
of having a dictionary.   It might be better to consider the use of tuples,
named tuples, or classes that define `__slots__`.

## (c) Changing Your Orientation (to Columns)

You can often save a lot of memory if you change your view of data.
For example, what happens if you read all of the bus data into a
columns using this function?

```python
# readrides.py

...

def read_rides_as_columns(filename):
    '''
    Read the bus ride data into 4 lists, representing columns
    '''
    routes = []
    dates = []
    daytypes = []
    numrides = []
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     # Skip headers
        for row in rows:
            routes.append(row[0])
            dates.append(row[1])
            daytypes.append(row[2])
            numrides.append(int(row[3]))
    return dict(routes=routes, dates=dates, daytypes=daytypes, numrides=numrides)
```

In theory, this function should save a lot of memory.  Let's analyze it before trying it.

First, the datafile contained 577563 rows of data where each row contained
four values. If each row is stored as a dictionary, then those dictionaries
are minimally 240 bytes in size.

```python
>>> nrows = 577563     # Number of rows in original file
>>> nrows * 240
138615120
>>>
```

So, that's 138MB just for the dictionaries themselves.  This does not
include any of the values actually stored in the dictionaries.

By switching to columns, the data is stored in 4 separate lists.
Each list requires 8 bytes per item to store a pointer.  So, here's
a rough estimate of the list requirements:

```python
>>> nrows * 4 * 8
18482016
>>>
```

That's about 18MB in list overhead.  So, switching to a column orientation
should save about 120MB of memory solely from eliminating all of the extra information that
needs to be stored in dictionaries.

Try using this function to read the bus data and look at the memory use.

```python
>>> import tracemalloc
>>> tracemalloc.start()
>>> columns = read_rides_as_columns('Data/ctabus.csv')
>>> tracemalloc.get_traced_memory()
... look at the result ...
>>>
```

Does the result reflect the expected savings in memory from our rough calculations above?

## (d) Making a Custom Container - The Great Fake Out

Storing the data in columns offers a much better memory savings, but
the data is now rather weird to work with.  In fact, none of our
earlier analysis code from [Exercise 2.2](ex2_2.md) can work
with columns. The reason everything is broken is that you've broken
the data abstraction that was used in earlier exercises--namely the
assumption that data is stored as a list of dictionaries.

This can be fixed if you're willing to make a custom container object
that "fakes" it. Let's do that.

The earlier analysis code assumes the data is stored in a sequence of
records.  Each record is represented as a dictionary.  Let's start
by making a new "Sequence" class.  In this class, we store the
four columns of data that were being using in the `read_rides_as_columns()`
function.

```python
# readrides.py

import collections
...
class RideData(collections.abc.Sequence):
    def __init__(self):
        self.routes = []      # Columns
        self.dates = []
        self.daytypes = []
        self.numrides = []
```

Try creating a `RideData` instance.  You'll find that it fails with an
error message like this:

```python
>>> records = RideData()
Traceback (most recent call last):
...
TypeError: Can't instantiate abstract class RideData with abstract methods __getitem__, __len__
>>>
```

Carefully read the error message. It tells us what we need to
implement. Let's add a `__len__()` and `__getitem__()` method.  In the
`__getitem__()` method, we'll make a dictionary.  In addition, we'll
create an `append()` method that takes a dictionary and unpacks it
into 4 separate `append()` operations.

```python
# readrides.py
...

class RideData(collections.abc.Sequence):
    def __init__(self):
        # Each value is a list with all of the values (a column)
        self.routes = []
        self.dates = []
        self.daytypes = []
        self.numrides = []

    def __len__(self):
        # All lists assumed to have the same length
        return len(self.routes)

    def __getitem__(self, index):
        return { 'route': self.routes[index],
                 'date': self.dates[index],
                 'daytype': self.daytypes[index],
                 'rides': self.numrides[index] }

    def append(self, d):
        self.routes.append(d['route'])
        self.dates.append(d['date'])
        self.daytypes.append(d['daytype'])
        self.numrides.append(d['rides'])
```

If you've done this correctly, you should be able to drop this object into
the previously written `read_rides_as_dicts()` function.  It involves
changing only one line of code:

```python
# readrides.py
...

def read_rides_as_dicts(filename):
    '''
    Read the bus ride data as a list of dicts
    '''
    records = RideData()      # <--- CHANGE THIS
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     # Skip headers
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = {
                'route': route,
                'date': date,
                'daytype': daytype,
                'rides' : rides
                }
            records.append(record)
    return records
```

If you've done this right, old code should work exactly as it did before.
For example:

```python
>>> rows = readrides.read_rides_as_dicts('Data/ctabus.csv')
>>> rows
<readrides.RideData object at 0x10f5054a8>
>>> len(rows)
577563
>>> rows[0]
{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}
>>> rows[1]
{'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': 9288}
>>> rows[2]
{'route': '6', 'date': '01/01/2001', 'daytype': 'U', 'rides': 6048}
>>>
```

Run your earlier CTA code from [Exercise 2.2](ex2_2.md).  It
should work without modification, but use substantially less memory.

## (e) Challenge

What happens when you take a slice of ride data?

```python
>>> r = rows[0:10]
>>> r
... look at result ...
>>>
```

It's probably going to look a little crazy.  Can you modify
the `RideData` class so that it produces a proper slice that
looks like a list of dictionaries? For example, like this:

```python
>>> rows = readrides.read_rides_as_columns('Data/ctabus.csv')
>>> rows
<readrides.RideData object at 0x10f5054a8>
>>> len(rows)
577563
>>> r = rows[0:10]
>>> r
<readrides.RideData object at 0x10f5068c8>
>>> len(r)
10
>>> r[0]
{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}
>>> r[1]
{'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': 9288}
>>>
```

\[ [Solution](soln2_5.md) | [Index](index.md) | [Exercise 2.4](ex2_4.md) | [Exercise 2.6](ex2_6.md) \]

----
`>>>` Advanced Python Mastery
`...` A course by [dabeaz](https://www.dabeaz.com)
`...` Copyright 2007-2023

![](https://i.creativecommons.org/l/by-sa/4.0/88x31.png). This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/)