\[ [Index](index.md) | [Exercise 2.4](ex2_4.md) | [Exercise 2.6](ex2_6.md) \] # Exercise 2.5 *Objectives:* - Look at memory allocation behavior of lists and dicts - Make a custom container *Files Created:* None ## (a) List growth Python lists are highly optimized for performing `append()` operations. Each time a list grows, it grabs a larger chunk of memory than it actually needs with the expectation that more data will be added to the list later. If new items are added and space is available, the `append()` operation stores the item without allocating more memory. Experiment with this feature of lists by using the `sys.getsizeof()` function on a list and appending a few more items. ```python >>> import sys >>> items = [] >>> sys.getsizeof(items) 64 >>> items.append(1) >>> sys.getsizeof(items) 96 >>> items.append(2) >>> sys.getsizeof(items) # Notice how the size does not increase 96 >>> items.append(3) >>> sys.getsizeof(items) # It still doesn't increase here 96 >>> items.append(4) >>> sys.getsizeof(items) # Not yet. 96 >>> items.append(5) >>> sys.getsizeof(items) # Notice the size has jumped 128 >>> ``` A list stores its items by reference. So, the memory required for each item is a single memory address. On a 64-bit machine, an address is typically 8 bytes. However, if Python has been compiled for 32-bits, it might be 4 bytes and the numbers for the above example will be half of what's shown. ## (b) Dictionary/Class Growth Python dictionaries (and classes) allow up to 5 values to be stored before their reserved memory doubles. Investigate by making a dictionary and adding a few more values to it: ```python >>> row = { 'route': '22', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354 } >>> sys.getsizeof(row) >>> sys.getsizeof(row) 240 >>> row['a'] = 1 >>> sys.getsizeof(row) 240 >>> row['b'] = 2 >>> sys.getsizeof(row) 368 >>> ``` Does the memory go down if you delete the item you just added? Food for thought: If you are creating large numbers of records, representing each record as a dictionary might not be the most efficient approach--you could be paying a heavy price for the convenience of having a dictionary. It might be better to consider the use of tuples, named tuples, or classes that define `__slots__`. ## (c) Changing Your Orientation (to Columns) You can often save a lot of memory if you change your view of data. For example, what happens if you read all of the bus data into a columns using this function? ```python # readrides.py ... def read_rides_as_columns(filename): ''' Read the bus ride data into 4 lists, representing columns ''' routes = [] dates = [] daytypes = [] numrides = [] with open(filename) as f: rows = csv.reader(f) headings = next(rows) # Skip headers for row in rows: routes.append(row[0]) dates.append(row[1]) daytypes.append(row[2]) numrides.append(int(row[3])) return dict(routes=routes, dates=dates, daytypes=daytypes, numrides=numrides) ``` In theory, this function should save a lot of memory. Let's analyze it before trying it. First, the datafile contained 577563 rows of data where each row contained four values. If each row is stored as a dictionary, then those dictionaries are minimally 240 bytes in size. ```python >>> nrows = 577563 # Number of rows in original file >>> nrows * 240 138615120 >>> ``` So, that's 138MB just for the dictionaries themselves. This does not include any of the values actually stored in the dictionaries. By switching to columns, the data is stored in 4 separate lists. Each list requires 8 bytes per item to store a pointer. So, here's a rough estimate of the list requirements: ```python >>> nrows * 4 * 8 18482016 >>> ``` That's about 18MB in list overhead. So, switching to a column orientation should save about 120MB of memory solely from eliminating all of the extra information that needs to be stored in dictionaries. Try using this function to read the bus data and look at the memory use. ```python >>> import tracemalloc >>> tracemalloc.start() >>> columns = read_rides_as_columns('Data/ctabus.csv') >>> tracemalloc.get_traced_memory() ... look at the result ... >>> ``` Does the result reflect the expected savings in memory from our rough calculations above? ## (d) Making a Custom Container - The Great Fake Out Storing the data in columns offers a much better memory savings, but the data is now rather weird to work with. In fact, none of our earlier analysis code from [Exercise 2.2](ex2_2.md) can work with columns. The reason everything is broken is that you've broken the data abstraction that was used in earlier exercises--namely the assumption that data is stored as a list of dictionaries. This can be fixed if you're willing to make a custom container object that "fakes" it. Let's do that. The earlier analysis code assumes the data is stored in a sequence of records. Each record is represented as a dictionary. Let's start by making a new "Sequence" class. In this class, we store the four columns of data that were being using in the `read_rides_as_columns()` function. ```python # readrides.py import collections ... class RideData(collections.abc.Sequence): def __init__(self): self.routes = [] # Columns self.dates = [] self.daytypes = [] self.numrides = [] ``` Try creating a `RideData` instance. You'll find that it fails with an error message like this: ```python >>> records = RideData() Traceback (most recent call last): ... TypeError: Can't instantiate abstract class RideData with abstract methods __getitem__, __len__ >>> ``` Carefully read the error message. It tells us what we need to implement. Let's add a `__len__()` and `__getitem__()` method. In the `__getitem__()` method, we'll make a dictionary. In addition, we'll create an `append()` method that takes a dictionary and unpacks it into 4 separate `append()` operations. ```python # readrides.py ... class RideData(collections.abc.Sequence): def __init__(self): # Each value is a list with all the values (a column) self.routes = [] self.dates = [] self.daytypes = [] self.numrides = [] def __len__(self): # All lists assumed to have the same length return len(self.routes) def __getitem__(self, index): return { 'route': self.routes[index], 'date': self.dates[index], 'daytype': self.daytypes[index], 'rides': self.numrides[index] } def append(self, d): self.routes.append(d['route']) self.dates.append(d['date']) self.daytypes.append(d['daytype']) self.numrides.append(d['rides']) ``` If you've done this correctly, you should be able to drop this object into the previously written `read_rides_as_dicts()` function. It involves changing only one line of code: ```python # readrides.py ... def read_rides_as_dicts(filename): ''' Read the bus ride data as a list of dicts ''' records = RideData() # <--- CHANGE THIS with open(filename) as f: rows = csv.reader(f) headings = next(rows) # Skip headers for row in rows: route = row[0] date = row[1] daytype = row[2] rides = int(row[3]) record = { 'route': route, 'date': date, 'daytype': daytype, 'rides' : rides } records.append(record) return records ``` If you've done this right, old code should work exactly as it did before. For example: ```python >>> rows = readrides.read_rides_as_dicts('Data/ctabus.csv') >>> rows >>> len(rows) 577563 >>> rows[0] {'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354} >>> rows[1] {'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': 9288} >>> rows[2] {'route': '6', 'date': '01/01/2001', 'daytype': 'U', 'rides': 6048} >>> ``` Run your earlier CTA code from [Exercise 2.2](ex2_2.md). It should work without modification, but use substantially less memory. ## (e) Challenge What happens when you take a slice of ride data? ```python >>> r = rows[0:10] >>> r ... look at result ... >>> ``` It's probably going to look a little crazy. Can you modify the `RideData` class so that it produces a proper slice that looks like a list of dictionaries? For example, like this: ```python >>> rows = readrides.read_rides_as_dicts('Data/ctabus.csv') >>> rows >>> len(rows) 577563 >>> r = rows[0:10] >>> r >>> len(r) 10 >>> r[0] {'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354} >>> r[1] {'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': 9288} >>> ``` \[ [Solution](soln2_5.md) | [Index](index.md) | [Exercise 2.4](ex2_4.md) | [Exercise 2.6](ex2_6.md) \] ---- `>>>` Advanced Python Mastery `...` A course by [dabeaz](https://www.dabeaz.com) `...` Copyright 2007-2023 ![](https://i.creativecommons.org/l/by-sa/4.0/88x31.png). This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/)