<div style="text-align: right"><i>Peter Norvig<br>April 2024</i></div> 

# Counting Cluster Sizes

Zach Wissner-Gross's *Fiddler on the Proof!* column [poses a question](https://thefiddler.substack.com/p/can-you-paint-by-number) that I will restate as:

Consider a two-dimensional grid of squares, where each square is colored either red or blue. A **cluster** is a group of contiguous squares of the same color (contiguity must be horizontal or vertical, not diagonal). For example, in the following 10 × 2 grid there are four red clusters and five blue clusters:

![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F079ab505-66c7-427b-ad6a-a2cf6a1794a6_1600x384.png)

Altogether there are 20 squares and 9 clusters, so the average cluster size is 20/9 ≈ 2.22 squares. Under the assumption that every square is equally likely to be red or blue, Zach poses two questions (and I'll add three more):

- What is the average cluster size for a grid consisting of a single infinitely long row?
- What is the average cluster size for a grid consisting of two infinitely long rows?
- What is the average cluster size for a grid of any given size *w* × *h*?
- What if there are three or more colors rather than just two?
- If you pick a random square, what is the average size of its cluster?

# Code to Make Grids and to Count Clusters

I can see three approaches to answering these questions:
1) Enumerate all possible grids  up to a certain size, compute the average cluster size for each grid, and average the averages. This gives an exact answer for grids of a specific size, but it can't say anything about infinite size grids. In fact it starts getting slow for grids with more than about 20 squares, because there are 2<sup>*n*</sup> grids with *n* squares. 
2) Randomly select some grids and average cluster sizes over them. This can handle grids with thousands of squares, but the averages will be only estimates.
3) Come up with a mathematical proof that proves the answer for grids of any width *w*.


I can easily write code to implement the first two approaches. I'll start with some imports and the definition of three data types:

In [1]:
import itertools 
import random
from statistics import mean, stdev
from typing import *

Square = Tuple[int, int] # A square is a pair of `(x, y)` coordinates, e.g. `(2, 1)`.
Grid   = Dict # A dict of `{square: contents}`e.g. {(1, 1): 1, (2, 1): 'B'}
Color  = str  # A color is represented by a one-character string, e.g. 'R' for red and 'B' for blue
COLORS = ('R', 'B')


Now I'll define the function `all_grids` to make a list of all possible grids of a given size (all possible ways to color each square), and `random_grids` to sample `N` different grids of a given size. The helper function `one_grid` makes a single grid from a sequence of colors.

In [2]:
def all_grids(width: int, height: int, colors=COLORS) -> List[Grid]:
    """All possible grids with given width, height, and color set."""
    return [one_grid(width, height, colorseq) 
            for colorseq in itertools.product(colors, repeat=(width * height))]

def random_grids(width: int, height: int, N: int, colors=COLORS) -> List[Grid]:
    """N grids of size width × height, filled with random colors."""
    return [random_grid(width, height, colors) for _ in range(N)]

def random_grid(width: int, height: int, colors=COLORS) -> Grid:
    """A single random grid."""
    return one_grid(width, height, [random.choice(colors) for _ in range(width * height)])

def one_grid(width: int, height: int, colorseq: Sequence[Color]) -> Grid: 
    """A grid of given size made from the sequence of colors."""
    squares = [(x, y) for y in range(height) for x in range(width)]
    return dict(zip(squares, colorseq))

Finally, the function `cluster` mutates the grid so that each square's contents is changed from a color  to a cluster number, where 1 is the number of the first cluster, 2 of the next cluster, and so on. It uses a [flood fill](https://en.wikipedia.org/wiki/Flood_fill) algorithm.

In [3]:
def cluster(grid: Grid[Square, Color]) -> Grid[Square, int]:
    """Mutate grid, replacing colors with cluster numbers.
    Do a flood fill, replacing one cluster of adjacent colors with an integer cluster number,
    then incrementing the cluster number and continuing on to find the next cluster."""
    cluster_number = 0 
    for square in grid:
        c = grid[square]
        if isinstance(c, Color):
            cluster_number += 1
            # Assign `cluster_number` to `square` and all its neighbors of the same color
            Q = [square] # queue of squares in cluster `cluster_number`
            while Q: 
                sq = Q.pop()
                if sq in grid and grid[sq] == c: 
                    grid[sq] = cluster_number
                    Q.extend(neighbors(sq))
    return grid

def mean_cluster_size(grids: Collection[Grid]) -> float: 
    """Mean size of the clusters in a collection of grids."""
    return mean(len(grid) / max(cluster(grid).values()) for grid in grids)
    
def neighbors(square: Square) -> List[Square]:
    """The four neighbors of a square."""
    (x, y) = square
    return [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)]

# Answering the Questions

Here's a function to help answer the questions:

In [4]:
def do(W, h, N=30_000, colors='RB') -> None: 
    """For each width w from 1 to W, print the mean cluster size of w x h grids.
    If `N` is an integer, randomly sample `N` grids.
    If `N` is `all`, exhaustively enumerate all possible grids."""
    which = "all possible" if N is all else f"{N:,d} randomly sampled"
    print(f' Average cluster size over {which} grids of width 1–{W} and height {h}:')
    for w in range(1, W + 1):
        grids = all_grids(w, h, colors) if N is all else random_grids(w, h, N, colors)
        print(f'{w:2} × {h} grids: {mean_cluster_size(grids):6.4f}')

# One-Row Grids

Let's see what we get. First an exact calculation over all possible grids up to size 14 × 1:

In [5]:
do(14, 1, all)

 Average cluster size over all possible grids of width 1–14 and height 1:
 1 × 1 grids: 1.0000
 2 × 1 grids: 1.5000
 3 × 1 grids: 1.7500
 4 × 1 grids: 1.8750
 5 × 1 grids: 1.9375
 6 × 1 grids: 1.9688
 7 × 1 grids: 1.9844
 8 × 1 grids: 1.9922
 9 × 1 grids: 1.9961
10 × 1 grids: 1.9980
11 × 1 grids: 1.9990
12 × 1 grids: 1.9995
13 × 1 grids: 1.9998
14 × 1 grids: 1.9999


Let's compare that to the random sampling approach, which can handle wider grids:

In [6]:
do(20, 1, 30_000)

 Average cluster size over 30,000 randomly sampled grids of width 1–20 and height 1:
 1 × 1 grids: 1.0000
 2 × 1 grids: 1.5025
 3 × 1 grids: 1.7502
 4 × 1 grids: 1.8762
 5 × 1 grids: 1.9374
 6 × 1 grids: 1.9726
 7 × 1 grids: 1.9912
 8 × 1 grids: 1.9987
 9 × 1 grids: 1.9960
10 × 1 grids: 2.0011
11 × 1 grids: 1.9928
12 × 1 grids: 1.9976
13 × 1 grids: 2.0020
14 × 1 grids: 1.9991
15 × 1 grids: 2.0034
16 × 1 grids: 1.9952
17 × 1 grids: 1.9995
18 × 1 grids: 1.9984
19 × 1 grids: 1.9925
20 × 1 grids: 2.0022


The answer seems to be converging on 2. The random sampling approach has pretty good agreement with the exhaustive approach, always agreeing to within 0.01, and sometimes to within 0.001.

Now that I see these results, I can describe a mathematical justification of why the limit is 2. In fact, I have two justifications:

**First**: what is the average number of new clusters introduced per column? Half the time column *k* will be different from the previous column *k* - 1, and thus half the time a column will introduce a new cluster. So with *W* columns there will be *W*/2 clusters, and the average cluster size will be *W*/(*W*/2) = 2.

**Second**: what is the average length of a cluster that starts in a given column? Every cluster starts with one first square. Half the time, the next square will be the same color, giving two in a row.  Continuing, we would get three in a row a quarter of the time, four in a row an eigth of the time, and so on. So using the [formula for the sum of a geometric series](https://en.wikipedia.org/wiki/Geometric_series) we get:

mean cluster size &nbsp;  =  &nbsp; Σ<sub>i∈{0,1,...∞}</sub> (1/2)<sup>n</sup> &nbsp;  =  &nbsp; 1 / (1 - (1/2)) &nbsp;  =  &nbsp;2

# Two-Row Grids

Now consider grids that are two rows tall, using both approaches:

In [7]:
do(9, 2, all)

 Average cluster size over all possible grids of width 1–9 and height 2:
 1 × 2 grids: 1.5000
 2 × 2 grids: 2.1250
 3 × 2 grids: 2.4688
 4 × 2 grids: 2.6682
 5 × 2 grids: 2.7920
 6 × 2 grids: 2.8737
 7 × 2 grids: 2.9305
 8 × 2 grids: 2.9717
 9 × 2 grids: 3.0027


With widths up to 9, the mean is getting close to 3, but I don't think it is converging to exactly 3. 

In [8]:
do(25, 2)

 Average cluster size over 30,000 randomly sampled grids of width 1–25 and height 2:
 1 × 2 grids: 1.5005
 2 × 2 grids: 2.1190
 3 × 2 grids: 2.4684
 4 × 2 grids: 2.6611
 5 × 2 grids: 2.7899
 6 × 2 grids: 2.8772
 7 × 2 grids: 2.9399
 8 × 2 grids: 2.9784
 9 × 2 grids: 3.0066
10 × 2 grids: 3.0256
11 × 2 grids: 3.0587
12 × 2 grids: 3.0636
13 × 2 grids: 3.0728
14 × 2 grids: 3.0804
15 × 2 grids: 3.0969
16 × 2 grids: 3.1046
17 × 2 grids: 3.1094
18 × 2 grids: 3.1201
19 × 2 grids: 3.1187
20 × 2 grids: 3.1297
21 × 2 grids: 3.1326
22 × 2 grids: 3.1296
23 × 2 grids: 3.1403
24 × 2 grids: 3.1367
25 × 2 grids: 3.1383


Now it is clear that the mean is well above 3. At this point I tried to come up with a mathematical analysis but made a mistake. Fortunately, Zach [provides the answer](https://thefiddler.substack.com/p/can-you-eclipse-via-ellipse) by breaking down the possibilities into cases. With the single-row grid, we only needed two cases: column *k* introduced a new cluster 1/2 the time. But with two rows there are 16 cases: each of the two squares in column *k* and the two in column *k* - 1 can be either red or blue. Overall, as aach's diagram below shows, there are 10 new clusters introduced in 16 cases, so the number of new clusters per column is 10/16, and the average cluster size over *W* columns is 2*W* / ((10/16)·*W*) = 16/5 = 3.2.

![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb06d848c-37f6-4e9b-be38-a88bfe3e7316_1342x1154.png)

# Three-Row Grids

Let's next consider grids of size *n* × 3 :

In [9]:
do(6, 3, all)

 Average cluster size over all possible grids of width 1–6 and height 3:
 1 × 3 grids: 1.7500
 2 × 3 grids: 2.4688
 3 × 3 grids: 2.8986
 4 × 3 grids: 3.1591
 5 × 3 grids: 3.3266
 6 × 3 grids: 3.4408


In [10]:
do(25, 3)

 Average cluster size over 30,000 randomly sampled grids of width 1–25 and height 3:
 1 × 3 grids: 1.7507
 2 × 3 grids: 2.4668
 3 × 3 grids: 2.8953
 4 × 3 grids: 3.1560
 5 × 3 grids: 3.3236
 6 × 3 grids: 3.4452
 7 × 3 grids: 3.5230
 8 × 3 grids: 3.5881
 9 × 3 grids: 3.6304
10 × 3 grids: 3.6695
11 × 3 grids: 3.6897
12 × 3 grids: 3.7266
13 × 3 grids: 3.7455
14 × 3 grids: 3.7636
15 × 3 grids: 3.7772
16 × 3 grids: 3.7995
17 × 3 grids: 3.8018
18 × 3 grids: 3.8090
19 × 3 grids: 3.8220
20 × 3 grids: 3.8290
21 × 3 grids: 3.8318
22 × 3 grids: 3.8514
23 × 3 grids: 3.8500
24 × 3 grids: 3.8580
25 × 3 grids: 3.8639


The mathematical analysis here is much trickier. With two rows, we can tell how many new clusters are introduced in column *k* just by looking at column *k* - 1. But with three or more rows, that's no longer true. Suppose the top and bottom squares in a column are red. Is that one red cluster or two? We have to look a potentially *unbounded* number of columns away to see if they are connected (by a column that has red in all three squares). So no local analysis can determine the number of new clusters per column; we'll need some new method of analysis. 

Interestingly, this is the same limitation that Minsky and Papert noticed in the book [Perceptrons](https://direct.mit.edu/books/monograph/3132/PerceptronsAn-Introduction-to-Computational); their analysis caused many researchers to draw the conclusion that neural networks were no good; they should have instead drawn the conclusion that single-layer neural networks are no good but mult-layer ones avoid the limitations.

# Adding Colors

What if we add a third color? That should make the average cluster size smaller (since there are more chances for neighboring squares to be different colors). I'll start with a single row:

In [11]:
do(11, 1, all, colors='RGB')

 Average cluster size over all possible grids of width 1–11 and height 1:
 1 × 1 grids: 1.0000
 2 × 1 grids: 1.3333
 3 × 1 grids: 1.4444
 4 × 1 grids: 1.4815
 5 × 1 grids: 1.4938
 6 × 1 grids: 1.4979
 7 × 1 grids: 1.4993
 8 × 1 grids: 1.4998
 9 × 1 grids: 1.4999
10 × 1 grids: 1.5000
11 × 1 grids: 1.5000


This is straightforward: the mean cluster size converges to 3/2; the analysis says that this is true because each column starts a new cluster 2/3 of the time, on average.

Now for a two-row grid with three colors. I'll use random sampling:

In [12]:
do(25, 2, colors='RGB')

 Average cluster size over 30,000 randomly sampled grids of width 1–25 and height 2:
 1 × 2 grids: 1.3320
 2 × 2 grids: 1.6609
 3 × 2 grids: 1.7763
 4 × 2 grids: 1.8300
 5 × 2 grids: 1.8569
 6 × 2 grids: 1.8685
 7 × 2 grids: 1.8816
 8 × 2 grids: 1.8874
 9 × 2 grids: 1.8936
10 × 2 grids: 1.9009
11 × 2 grids: 1.9026
12 × 2 grids: 1.9045
13 × 2 grids: 1.9031
14 × 2 grids: 1.9056
15 × 2 grids: 1.9151
16 × 2 grids: 1.9119
17 × 2 grids: 1.9110
18 × 2 grids: 1.9121
19 × 2 grids: 1.9106
20 × 2 grids: 1.9144
21 × 2 grids: 1.9152
22 × 2 grids: 1.9139
23 × 2 grids: 1.9156
24 × 2 grids: 1.9165
25 × 2 grids: 1.9156


This seems to converge to about 1.92. To analye this for the two-color case we had to look at 2<sup>4</sup> = 16 cases; with three colors we have 3<sup>4</sup> = 81 cases; I haven't gone through them all to do the calculation. (But you can.)

# Larger Grids

What about a grid that is unbounded in all directions? I can't fit that into a finite computer, but I can easily examine random grids of size 100 × 100 or more:

In [13]:
mean_cluster_size(random_grids(100, 100, N=200))

7.2028626172827535

In [14]:
mean_cluster_size(random_grids(200, 200, N=200))

7.400513340948056

In [15]:
mean_cluster_size(random_grids(300, 300, N=200))

7.468919989722078

I think that what's happening is that the clusters that are near the edge of the grid get arbitrarily cut off, and since the edge is a smaller percentage of larger grids, a larger grid has a larger mean cluster size, one that is a better representative of what would happen on an infinite grid. But I can't say exactly what the mean converges to; probably somewhere around 7.5.

# Examining Random Clusters

On an *n* × 1 grid a cluster is a straight line. But what do clusters look like on an arbitrarily large grid? I'll define `show` to print a grid (and return the number of colored squares in the grid), and `cluster_at(grid, square)` to return just the part of `grid` that contains `square` and all the other members of square's cluster.

In [16]:
def show(grid: Grid) -> int:
    """Print a representation of the grid and return counts for the colors."""
    xs = sorted({x for (x, y) in grid})
    ys = sorted({y for (x, y) in grid})
    for y in ys:
        print(*[grid.get((x, y), ' ') for x in xs])
    return len(grid)

def cluster_at(grid: Grid[Square, int], square: Square, color='#') -> Grid[Square, Color]:
    """The cluster that square belongs to in grid. Grid must already be clustered."""
    cluster_number = grid[square]
    assert isinstance(cluster_number, int), "grid must be already clustered by cluster(grid)"
    return {sq: color for sq in grid if grid[sq] == cluster_number}

I'll make a 300 x 300 grid and then show the cluster at four different squares, chosen more-or-less randomly:

In [17]:
random.seed(1234)

grid300 = cluster(random_grid(300, 300))

In [18]:
show(cluster_at(grid300, (150, 150)))

                                            # # #               # #    
                                    #       # # # # #             #    
                                    # #     # # #               # # #  
                                    # #       #               # #   #  
                                # # #       # # # #       #   # #      
                                # # # #     #       #     #   # #      
                                  #   #     # # # # # #   # # # #      
                                    # # # # #     # # #       #   # # #
                                  #         #       # #   #   # # # # #
                              # # # # #   # #       # #   #   #     #  
                # #         # # # #     #     # # # # # # # # # #   #  
                # # # #   #     # # # # # # #   # # # #   #         #  
                # #   # # # # # # #     #   # # #   # #   # # # #   #  
      # #         #           #   #     #     #     # #     #   

240

In [19]:
show(cluster_at(grid300, (150, 200)))

                              # #              
                        # # # #                
                          #                    
                          #                    
                        # #                    
                      # #                      
                      #                        
                # # # # # # #   #     # #      
                #   #   #   #   #   # # # #    
                            # # # #   # # #    
                # #     # # #   #   #   #      
              # # # # # # #     # # # # #      
          # # # #               #   # #        
        # # # #   #   #           # #   #      
          #   # # #   # #       # #     #   #  
        # # #   #   # # #         # #   # # # #
      #     #       # # #       # # # # # # #  
  # # # # # # # # # # #   # # # #   #          
# #     # # #   # #   #       # # #            
# #     #           # # # #   #                
    # # #           # #                 

164

In [20]:
show(cluster_at(grid300, (200, 150)))

                                                                  #   #   #                  
                                                                # # # # # #                  
                                                                    #     # #     #          
                                                                    # #   #       # #        
                                                                      #   # #       # # # #  
                                                                    # # #   #     # #     # #
                                                        #         # #   # # # # # #          
                                                      # #   # # # # # #     # #              
                                                        # # #   # #                          
                                                            #   #                            
                                                            

293

In [21]:
show(cluster_at(grid300, (200, 200)))

#


1

We see that there is a lot of variation in size and shape of the clusters. To get a better idea of the variation I introduce `cluster_counts(S)` to do the following: make an S x S grid; cluster it; define `sizes` such that `sizes[cluster_number]` gives the cluster's size; define the set of squares in the `perimeter`; define the set of clusters that touch the perimeter (and thus have a size that might be smaller than it should be); and finally return a Counter that for each cluster size gives the number of squares that are in a cluster of that size:

In [22]:
def cluster_counts(S: int) -> Counter:
    """For a random SxS grid, return a counter of {cluster_size: squares_with_that_size},
    for all squares that are not part of a cluster that touches the perimeter."""
    grid = cluster(random_grid(S, S)) # Grid of {square: cluster_number}
    sizes = Counter(grid.values()) # Counter of {cluster_number: cluster_size}
    perimeter = cross([0, S - 1], range(S)) | cross(range(S), [0, S - 1]) # Squares on perimeter
    perimeter_clusters = {grid[p] for p in perimeter} # Cluster numbers on perimeter squares
    return Counter(sizes[grid[sq]] for sq in grid if grid[sq] not in perimeter_clusters)

def cross(xs, ys) -> Set[Square]: return {(x, y) for x in xs for y in ys}

I'll then define `counter_stats` to return a dict of statistics about a counter, and use it to explore our cluster counts:

In [23]:
def counter_stats(counter: Counter, common=10) -> dict:
    """Return a dict of some statistics for the values in this counter."""
    return dict(mean=mean(counter.elements()), N=sum(counter.values()), 
                range=range(min(counter), max(counter) + 1), common=counter.most_common(common))

counter_stats(cluster_counts(1000))

{'mean': 58.88548665248869,
 'N': 974751,
 'range': range(1, 734),
 'common': [(1, 62777),
  (2, 31428),
  (3, 28983),
  (4, 25780),
  (5, 23480),
  (6, 21900),
  (7, 20412),
  (8, 18168),
  (9, 17316),
  (10, 16280)]}

Wait a minute ... earler we saw that the average cluster size on a 300 x 300 grid was about 7.5. But the `counter_stats` here give a `mean` cluster size of somewhere around 60. How can that be? It is because the two numbers are answering two different questions. **7.5** is the answer to "if we randomly pick a **cluster**, what is its expected size"? **60** is the answer to "if we randomly pick a **square**, what is the expected size of the square's cluster?"

In the output from `counter_stats`, the `common` entry gives a list of the ten most common sizes (which happen to be the integers from 1 to 10), paired with the counts of how many squares have that size cluster. On average, 1/16 of the clusters have size 1, (when all 4 neighbors of the start square get the opposite color). Half as many have size 2. A cluster with hundreds of squares is far less common, but certainly possible.



# Tests

Some unit tests to give confidence in the code, and show examples of use:

In [24]:
assert all_grids(1, 3) == [
    {(0, 0): 'R', (0, 1): 'R', (0, 2): 'R'},
    {(0, 0): 'R', (0, 1): 'R', (0, 2): 'B'},
    {(0, 0): 'R', (0, 1): 'B', (0, 2): 'R'},
    {(0, 0): 'R', (0, 1): 'B', (0, 2): 'B'},
    {(0, 0): 'B', (0, 1): 'R', (0, 2): 'R'},
    {(0, 0): 'B', (0, 1): 'R', (0, 2): 'B'},
    {(0, 0): 'B', (0, 1): 'B', (0, 2): 'R'},
    {(0, 0): 'B', (0, 1): 'B', (0, 2): 'B'}]

assert all_grids(1, 2, 'RGB') == [
    {(0, 0): 'R', (0, 1): 'R'},
    {(0, 0): 'R', (0, 1): 'G'},
    {(0, 0): 'R', (0, 1): 'B'},
    {(0, 0): 'G', (0, 1): 'R'},
    {(0, 0): 'G', (0, 1): 'G'},
    {(0, 0): 'G', (0, 1): 'B'},
    {(0, 0): 'B', (0, 1): 'R'},
    {(0, 0): 'B', (0, 1): 'G'},
    {(0, 0): 'B', (0, 1): 'B'}]

grid6x1 = one_grid(6, 1, 'RRBBBR')
assert grid6x1 == {(0, 0): 'R', (1, 0): 'R', (2, 0): 'B', (3, 0): 'B', (4, 0): 'B', (5, 0): 'R'}
assert cluster(grid6x1) == {(0, 0): 1, (1, 0): 1, (2, 0): 2, (3, 0): 2, (4, 0): 2, (5, 0): 3}
assert mean_cluster_size([grid6x1]) == 2

grid5x3 = one_grid(5, 3, 'RR:RR'
                         '.R:R.'
                         '.RRR.')
assert cluster(grid5x3) == {
    (0, 0): 1, (1, 0): 1, (2, 0): 2, (3, 0): 1, (4, 0): 1,
    (0, 1): 3, (1, 1): 1, (2, 1): 2, (3, 1): 1, (4, 1): 4,
    (0, 2): 3, (1, 2): 1, (2, 2): 1, (3, 2): 1, (4, 2): 4}
assert mean_cluster_size([grid5x3]) == 3.75

grid10x2 = one_grid(10, 2, 'RBRBRRRBBR' # Example from diagram at top of notebook
                           'RRRRBBRBRB') 
assert mean_cluster_size([grid10x2]) == 20/9

grid4x3 = random_grid(3, 4)
assert len(grid4x3) == 12

assert cross((10, 20, 30), (1, 2, 3)) == {(10, 1), (10, 2), (10, 3), (20, 1), (20, 2), (20, 3), (30, 1), (30, 2), (30, 3)}