Here, I implemented two different set algorithms in Python, sorted list and unsorted list. I also compared their performances for inserting and searching a value in a set. The link to my code on GitHub is here. This is the first part of 2-part series. In the next blog entry, I will write about balanced and unbalanced trees, and their performances.
The two algorithms I compared are sorted list and unsorted list. These are the steps I took for my project:
With a set algorithm, for each n = 25, 29, 213, and 217:
1. Create a set S1, and time the following:
a. Insert each even number in range(0, n) in order into a set S1.
b. Check whether S1 contains each value in range(0, n).
2. Create a list of n random numbers, and a new set S2. Time the following:
a. Insert the first n/2 random numbers into S2.
b. Check whether S2 contains each random number.
In summary, I tested insertion and containment check for n sizes of 24, 28, 212, and 216, in incremental and random orders.
Sorted List Set
SortedListSet is initialized, it creates an empty list:
For sorting, I used binary search algorithm for finding the index of the value when inserting and checking containment. The binary search algorithm returns the index of the value:
insert inserts the value at the correct index that
binary_search returned, with Python's built-in function,
Contains check returns True or False depending on
whether the value at the index matches the queuried number:
Insertion with sorted list algorithm has big-O of O(log2n)+O(n), which rounds to O(n). This would be when the inserting needs to occur at the first of the list, and all the items in the list are scooted to the right. The big-O of containment check is O(log2(n)).
Unsorted List Set
The code for unsorted list is simple. When the
is initialized, it creates an empty list:
contains, the algorithm checks one
number at a time from index of zero to the end of the list:
The big-O of
UnsortedListSet are both O(n).
I collected 100 data points of each set implementation. I used R and ggplot2 for visualizing the data. The plot of the data on a log2 x-scale and y-scale looks like below:
While it appears that I collected each operation data at a different n-sizes, each group of four boxes actually belongs to one n-size. ggplot2 automatically spreads the box plots so that it's easier to see the individual boxes. I used box plot because it accentuates the subtle differences between operations with random and non-random ordered numbers, rather than overlapping them like a point plot would. Let's dive into the data a little more, and look at insertion and containment check separately.
Containment Check: non-random and random
In the above plot in the
UnsortedListSet plot, you can see
that there are two extreme outliers that took about 27 times
longer for completing the
contains operation. It is
odd that only two points are much slower than the other data
points. I think it could have been either the order of the
queries tested the worst case performance, or something
interrupted the operations somehow. When sorted in decreasing
order, you can see that these outliers are quite large
compared to the third largest value.
Either way, I decided to exclude the two outliers so that plots look cleaner. I wouldn't recommend completely excluding outliers usually, but in this case, there are only 2 points so I removed them from the data.
The big-O of containment-check would be O(log2(n))
SortedListSet, and O(n)
It is clear that the time taken for
SortedListSets increases at a slower rate than in
UnsortedListSets, as expected. It is interesting that the
SortedListSet starts to significantly outperform
UnsortedListSet at n > 28. More
SortedListSet is in fact slower
UnsortedListSet when n-size is small. The
overhead cost of sorting via binary search must be larger than
the benefit of checking containment in a sorted list.
The difference between checking containment of numbers in
non-random and random orders into a
almost does not exist. This is probably because every time a
new number is added to the sorted list, the algorithm does
binary search to find the correct index for the number. For
SortedListSet, whether the input numbers are in
random or non-random order does not make a difference in the
This, however, is not true for a
UnsortedListSet. If the input
numbers are given in a random order, the resulting list will
have numbers in a random order.
contains in a
randomly ordered list takes longer than in an ordered list, as
you can see in the plot.
Insertion: non-random and random
The big-O of insertion would be O(log2n + n), which
rounds to O(n) for
SortedListSet. This is the sum
of bineary search
UnsortedListSet, the big-O of
insertion is O(n).
contains, the time it takes to insert numbers
increases at a slower rate in
SortedListSet actually takes longer insertion time for n < 28,
and starts to outperform
UnsortedListSet for n > 28.
contains, the difference between inserting
numbers in random and non-random orders into
SortedListSet exists. This is because when
inserting numbers in order, insertion is more like
append, because the new number is always
added to the last of the list. In other words, none of the
items in the list is moved, and its big-O is O(1). This is
much smaller than when the inserting numbers in random order,
which will have big-O of O(n). Meanwhile, insertion
UnsortedListSet follows a similar trend as
One thing I cannot explain about insertion is how inserting randomly ordered numbers take shorter time than ordered numbers at n = 24. It very well could be noise in data, or an interesting trend.
Conclusion and Closing Thoughts
I wanted to compare the real world performace to theoretical big-O
in the plots, but I had difficult time representing the big-O. For
example, I understand that insertion for
UnsortedListSet is O(n) but
that doesn't mean it should be plotted as y = x. What would be the
coefficient? What about the intercept?
I'm also looking forward to implementing and visualizing
TreeSets. I wrote a script for
UnbalancedTreeSet, but for n > 28,
it would reach the recursion limit and crash. Currently, the
algorithm is implemented recursively, and I'm thinking about
how to do it iteratively.