Initial commit

2023-07-16 20:21:00 -05:00
parent 82e815fab2
commit 7d4b30154a
259 changed files with 600233 additions and 2 deletions
--- a/Exercises/ex2_1.md
+++ b/Exercises/ex2_1.md
@@ -0,0 +1,195 @@
+\[ [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]
+
+# Exercise 2.1
+
+*Objectives:*
+
+- Figure out the most memory-efficient way to store a lot of data.
+- Learn about different ways of representing records including tuples,
+dictionaries, classes, and named tuples.
+
+In this exercise, we look at different choices for representing data
+structures with an eye towards memory use and efficiency.  A lot of
+people use Python to perform various kinds of data analysis so knowing
+about different options and their tradeoffs is useful information.
+
+## (a) Stuck on the bus
+
+The file `Data/ctabus.csv` is a CSV file containing
+daily ridership data for the Chicago Transit Authority (CTA) bus
+system from January 1, 2001 to August 31, 2013.  It contains
+approximately 577000 rows of data.  Use Python to view a few lines
+of data to see what it looks like:
+
+```python
+>>> f = open('Data/ctabus.csv')
+>>> next(f)
+'route,date,daytype,rides\n'
+>>> next(f)
+'3,01/01/2001,U,7354\n'
+>>> next(f)
+'4,01/01/2001,U,9288\n'
+>>>
+```
+
+There are 4 columns of data.
+
+- route: Column 0.  The bus route name.
+- date: Column 1.  A date string of the form MM/DD/YYYY.
+- daytype: Column 2. A day type code (U=Sunday/Holiday, A=Saturday, W=Weekday)
+- rides: Column 3. Total number of riders (integer)
+
+The `rides` column records the total number of people who boarded a
+bus on that route on a given day. Thus, from the example, 7354 people
+rode the number 3 bus on January 1, 2001.
+
+## (b) Basic memory use of text
+
+Let's get a baseline of the memory required to work with this
+datafile.  First, restart Python and try a very simple experiment of
+simply grabbing the file and storing its data in a single string:
+
+```python
+>>> # --- RESTART 
+>>> import tracemalloc
+>>> f = open('Data/ctabus.csv')
+>>> tracemalloc.start()
+>>> data = f.read()
+>>> len(data)
+12361039
+>>> current, peak = tracemalloc.get_traced_memory()
+>>> current
+12369664
+>>> peak
+24730766
+>>> 
+```
+
+Your results might vary somewhat, but you should see current
+memory use in the range of 12MB with a peak of 24MB.
+
+What happens if you read the entire file into a list of strings
+instead?  Restart Python and try this:
+
+```python
+>>> # --- RESTART
+>>> import tracemalloc
+>>> f = open('Data/ctabus.csv')
+>>> tracemalloc.start()
+>>> lines = f.readlines()
+>>> len(lines)
+577564
+>>> current, peak = tracemalloc.get_traced_memory()
+>>> current
+45828030
+>>> peak
+45867371
+>>> 
+```
+
+You should see the memory use go up significantly into the range of 40-50MB.
+Point to ponder: what might be the source of that extra overhead?
+
+## (c) A List of Tuples
+
+In practice, you might read the data into a list and convert each line
+into some other data structure.  Here is a program `readrides.py` that
+reads the entire file into a list of tuples using the `csv` module:
+
+```python
+# readrides.py
+
+import csv
+
+def read_rides_as_tuples(filename):
+    '''
+    Read the bus ride data as a list of tuples
+    '''
+    records = []
+    with open(filename) as f:
+        rows = csv.reader(f)
+        headings = next(rows)     # Skip headers
+        for row in rows:
+            route = row[0]
+            date = row[1]
+            daytype = row[2]
+            rides = int(row[3])
+            record = (route, date, daytype, rides)
+            records.append(record)
+    return records
+
+if __name__ == '__main__':
+    import tracemalloc
+    tracemalloc.start()
+    rows = read_rides_as_tuples('Data/ctabus.csv')
+    print('Memory Use: Current %d, Peak %d' % tracemalloc.get_traced_memory())
+```
+
+Run this program using `python3 -i readrides.py` and look at the
+resulting contents of `rows`. You should get a list of tuples like
+this:
+
+```python
+>>> len(rows)
+577563
+>>> rows[0]
+('3', '01/01/2001', 'U', 7354)
+>>> rows[1]
+('4', '01/01/2001', 'U', 9288)
+```
+
+Look at the resulting memory use. It should be substantially higher
+than in part (b).
+
+## (d) Memory Use of Other Data Structures
+
+Python has many different choices for representing data structures.
+For example:
+
+```python
+# A tuple
+row = (route, date, daytype, rides)
+
+# A dictionary
+row = {
+    'route': route,
+    'date': date,
+    'daytype': daytype,
+    'rides': rides,
+}
+
+# A class
+class Row:
+    def __init__(self, route, date, daytype, rides):
+        self.route = route
+        self.date = date
+        self.daytype = daytype
+        self.rides = rides
+
+# A named tuple
+from collections import namedtuple
+Row = namedtuple('Row', ['route', 'date', 'daytype', 'rides'])
+
+# A class with __slots__
+class Row:
+    __slots__ = ['route', 'date', 'daytype', 'rides']
+    def __init__(self, route, date, daytype, rides):
+        self.route = route
+        self.date = date
+        self.daytype = daytype
+        self.rides = rides
+```
+Your task is as follows:  Create different versions of the `read_rides()` function
+that use each of these data structures to represent a single row of data.
+Then, find out the resulting memory use of each option.   Find out which
+approach offers the most efficient storage if you were working with a lot 
+of data all at once.
+
+\[ [Solution](soln2_1.md) | [Index](index.md) | [Exercise 1.6](ex1_6.md) | [Exercise 2.2](ex2_2.md) \]
+
+----
+`>>>` Advanced Python Mastery  
+`...` A course by [dabeaz](https://www.dabeaz.com)  
+`...` Copyright 2007-2023  
+
+![](https://i.creativecommons.org/l/by-sa/4.0/88x31.png). This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/)