# Analyzing Theoretical and Practical Implementations of Set Algorithms: Part 1

Here, I implemented two different set algorithms in Python, sorted list and unsorted list. I also compared their performances for inserting and searching a value in a set. The link to my code on GitHub is here. This is the first part of 2-part series. In the next blog entry, I will write about balanced and unbalanced trees, and their performances.

## Algorithms

The two algorithms I compared are sorted list and unsorted list. These are the steps I took for my project:

With a set algorithm, for each n = 2^{5}, 2^{9}, 2^{13}, and 2^{17}:

1. Create a set S1, and time the following:

a. Insert each even number in range(0, n) in order into a set S1.

b. Check whether S1 contains each value in range(0, n).

2. Create a list of n random numbers, and a new set S2. Time the following:

a. Insert the first n/2 random numbers into S2.

b. Check whether S2 contains each random number.

In summary, I tested insertion and containment check for n sizes of 2^{4},
2^{8}, 2^{12}, and 2^{16}, in incremental and random orders.

### Sorted List Set

When a `SortedListSet`

is initialized, it creates an empty list:

For sorting, I used binary search algorithm for finding the index of the value when inserting and checking containment. The binary search algorithm returns the index of the value:

`insert`

inserts the value at the correct index that
`binary_search`

returned, with Python's built-in function,
insert. `Contains`

check returns True or False depending on
whether the value at the index matches the queuried number:

Insertion with sorted list algorithm has big-O of
O(log_{2}n)+O(n), which rounds to O(n). This would be when the
inserting needs to occur at the first of the list, and all the
items in the list are scooted to the right. The big-O of
containment check is O(log_{2}(n)).

### Unsorted List Set

The code for unsorted list is simple. When the `UnsortedListSet`

is initialized, it creates an empty list:

For `insert`

and `contains`

, the algorithm checks one
number at a time from index of zero to the end of the list:

The big-O of `insert`

and `contains`

for
`UnsortedListSet`

are both O(n).

## Visualizing Performance

I collected 100 data points of each set implementation. I used
R and ggplot2
for visualizing the data. The plot of the data on a log_{2}
x-scale and y-scale looks like below:

While it appears that I collected each operation data at a different n-sizes, each group of four boxes actually belongs to one n-size. ggplot2 automatically spreads the box plots so that it's easier to see the individual boxes. I used box plot because it accentuates the subtle differences between operations with random and non-random ordered numbers, rather than overlapping them like a point plot would. Let's dive into the data a little more, and look at insertion and containment check separately.

### Containment Check: non-random and random

In the above plot in the `UnsortedListSet`

plot, you can see
that there are two extreme outliers that took about 2^{7} times
longer for completing the `contains`

operation. It is
odd that only two points are much slower than the other data
points. I think it could have been either the order of the
queries tested the worst case performance, or something
interrupted the operations somehow. When sorted in decreasing
order, you can see that these outliers are quite large
compared to the third largest value.

Either way, I decided to exclude the two outliers so that plots look cleaner. I wouldn't recommend completely excluding outliers usually, but in this case, there are only 2 points so I removed them from the data.

The big-O of containment-check would be O(log_{2}(n))
for `SortedListSet`

, and O(n)
for `UnsortedListSet`

.

It is clear that the time taken for `contains`

in
`SortedListSets`

increases at a slower rate than in
`UnsortedListSets`

, as expected. It is interesting that the
`SortedListSet`

starts to significantly outperform
`UnsortedListSet`

at n > 2^{8}. More
interestingly, `SortedListSet`

is in fact slower
than `UnsortedListSet`

when n-size is small. The
overhead cost of sorting via binary search must be larger than
the benefit of checking containment in a sorted list.

The difference between checking containment of numbers in
non-random and random orders into a `SortedListSet`

almost does not exist. This is probably because every time a
new number is added to the sorted list, the algorithm does
binary search to find the correct index for the number. For
a `SortedListSet`

, whether the input numbers are in
random or non-random order does not make a difference in the
resulting list.

This, however, is not true for a `UnsortedListSet`

. If the input
numbers are given in a random order, the resulting list will
have numbers in a random order. `contains`

in a
randomly ordered list takes longer than in an ordered list, as
you can see in the plot.

### Insertion: non-random and random

The big-O of insertion would be O(log_{2}n + n), which
rounds to O(n) for `SortedListSet`

. This is the sum
of bineary search
and Python's
built-in `insert`

operation. For `UnsortedListSet`

, the big-O of
insertion is O(n).

Like in `contains`

, the time it takes to insert numbers
increases at a slower rate in `SortedListSet`

. Again,
`SortedListSet`

actually takes longer insertion time for n < 2^{8},
and starts to outperform `UnsortedListSet`

for n > 2^{8}.

Unlike `contains`

, the difference between inserting
numbers in random and non-random orders into
a `SortedListSet`

exists. This is because when
inserting numbers in order, insertion is more like
Python's `append`

, because the new number is always
added to the last of the list. In other words, none of the
items in the list is moved, and its big-O is O(1). This is
much smaller than when the inserting numbers in random order,
which will have big-O of O(n). Meanwhile, insertion
in `UnsortedListSet`

follows a similar trend as
its `contains`

.

One thing I cannot explain about insertion is how inserting randomly
ordered numbers take shorter time than ordered numbers at n = 2^{4}. It
very well could be noise in data, or an interesting trend.

## Conclusion and Closing Thoughts

I wanted to compare the real world performace to theoretical big-O
in the plots, but I had difficult time representing the big-O. For
example, I understand that insertion for `UnsortedListSet`

is O(n) but
that doesn't mean it should be plotted as y = x. What would be the
coefficient? What about the intercept?

I'm also looking forward to implementing and visualizing
TreeSets. I wrote a script for
a `UnbalancedTreeSet`

, but for n > 2^{8},
it would reach the recursion limit and crash. Currently, the
algorithm is implemented recursively, and I'm thinking about
how to do it iteratively.