Diving Into Data Structures In Swift: Sets

Part 2 of a 3-part series.

Sets

Table of contents:

A. Overview of sets
B. How sets work under the hood
C. Operations for sets
D. Collection type implementation in Swift (copy-on-write)

Overview of sets

Sets hold elements of one single type. All elements stored in sets must conform to the Hashable protocol. This is a requirement since sets store elements in what’s called a hash table data structure.

Conforming to the Hashable protocol makes the element able to compute a hash value, which is important in indicating the elements’ storage location. And to get that value, the element must go through a hash function.

  • A hash function calculates a hash value that ultimately indicates where elements are stored in memory.
  • A hash value is the representation of data in an integer value. It reduces a sequence of bytes to a single integer. These values are generally used to quickly retrieve or map data to a collection.

Now that we know these terms, they make sets have a constant — O(1) lookup time, insertion, and deletion on average. Each item you search for maps straight to where that item is stored in memory — so you don’t have to traverse through the entire database to find the item.

There’s more to this. The different operation runtimes depend entirely on how well the hash function is implemented. The more hash functions have the ability to produce an even distribution of values, the more efficient lookup time is.

But with bad hash functions, multiple elements are mapped to the same point in memory. Thus, lookup time becomes proportional to the number of elements stored in the collection. Later, we’ll see a diagram of how sets look under the hood with good and bad hash functions.

This seems complex since sets have a very unique way of storing their elements. In practice, they’re actually stored similar to dictionaries — which we’ll dive into in the final part of this series.

Ways to create sets include initializing them as empty, which requires declaring an explicit type, or predefining set elements using array literals:

// 1. empty set initialization
var numbers = Set<Int>()
var nums: Set<Int> = Set()

// 2. empty set initialization with array literal
var numbersSet: Set<Int> = []

// 3. creating set with predefined elements using array literals
var newNums: Set = [1, 2, 3, 4]

Sets: Under the Hood

As shown in the diagram above, a hash table data structure is used to implement sets. Two of the most important ingredients in a hash table are a hash function and an array. The hash function computes an integer that will indicate which index or bucket in which the array will be stored. Then depending on whether the index is available, the element will reside there or in the next available slot.

There are two solutions to resolving collisions; chaining and open addressing with linear probing. Here, along with the following diagrams, we’ll be using linear probing examples since this how Swift resolves collisions in sets and dictionaries—open addressing with linear probing.

Adding elements to a set

In step one, we’re adding the string “Lily” into the set. To know where to store it in the set, it gets passed into a hash function. This hash function specifies where to store the string “Lily” in memory, which is the 4th index.

You can use this image to envision how elements are stored in a set. Each time, it goes through the three steps. This is true except in cases where you’re adding onto a set that’s just about to exceed its capacity. In these cases, the set has to allocate more memory.

var numbers = Set<Int>()

numbers.insert(3)   // (true, 3)
numbers.insert(22)  // (true, 22)
numbers.insert(22)  // (false, 22)

We use the insert(_:) method to add elements into a set. The return value of this set is a tuple, where the value at index 0 is a bool indicating whether the element was successfully added, and the value at index 1 indicates whether the element has been added, or not.

Reallocation on sets

Sets have a fixed amount of elements they can store before having to reallocate new memory to store more elements—just like dynamic arrays. These elements are stored in what’s called the storage buffer, which is just another name for storage.

Reallocation is costly since all elements have to be copied over to the set at a different point in memory, requiring all elements to go through another hash function — rehashing — to know the new index where it will be stored. We resolve this reallocation by using the minimumCapacity(_:), which preallocates space for at least the specified number of elements.

So if we know in advance how many elements will be stored, we can initialize the set using the minimumCapacity(_:) method. As such, until we add to one greater than the specified number, reallocation does not occur.

var numbers = Set<Int>(minimumCapacity: 30)
numbers.capacity  // 48

As seen in the above code snippet, minimumCapacity(_:) initialization only determines that there will be at least that many elements before reallocation, but it doesn’t necessarily mean that it can only contain that many elements until having to reallocate more memory.

Hash functions play a big role in determining how sets look like under the hood. As mentioned above, hash functions have a direct correlation with the run time of different operations on sets. Let’s compare and contrast.

There are things as good and bad hash functions. In this case, good vs bad is dependent on the ability to compute different values for the different data passed in. In bad hash functions, the values computed are the same for mostly every data passed in. This makes a lot of data point to the same index and requires that it traverse down the array until it finds an empty index to fill from the initial point.

Operations for Sets

A. Adding

  • insert(_:) Adds the element to the set if it’s not yet a part of it. The return value is a tuple, where index 0 is a bool indicating whether the element was successfully added, and index 1 indicating the element added.

B. Removing

  • remove(_:) Removes the element in the set. The return value is the element removed from the set; otherwise, it’s nil if not found in the set.
  • removeAll() Removes all elements in the set.
  • removeAll(keepingCapacity: ) Removes all elements in the set but reserves the original capacity size.
  • removeFirst() Removes first elements of the set. But the element may not be the first element added to the set.

C. Modifying

Note: These functions actually mutate the original set.

  • formUnion(_:) Inserts elements of another set or sequence into the current set.
  • formIntersection(_:) Removes all elements in the current set that are not contained in the other set.
  • formSymmetricDifference(_:) Replaces set with elements contained in the current set or other set/sequence, but not in both.
  • subtract(_:) Removes elements from the other set or sequence from the current set.

D. Set operations

Note: These functions create a new set and return it based on the original set and other set or sequence.

  • union(_:) Returns a new set that contains elements in both the original set and the other set or sequence.
  • intersection(_:) Returns a new set of all elements that are present in both the current set and the other set or sequence, but not in only one.
  • subtracting(_:) Returns a new set containing all elements, minus elements that are in the other set or sequence.
  • symmetricDifference(_:) Returns a new set of elements either in the current set or other set or sequence, but not in both.

E. Comparing

  • (==) Equal operator. Bool return value indicating if both sets contain the same elements.
  • elementsEqual(_:) Bool return value indicating if all elements in the current set are equal to the other sequence in the same order.
  • isSubset(_:) Bool return value indicating if all members of the set are contained in other set or sequence. However, the set must not contain an equal amount of elements.
  • isStrictSubset(_:) Bool return value indicating if all elements in the set are contained in the other set or sequence. Requirement for the other set is that it must contain at least one other element.
  • isSuperset(_:) Bool return value indicating if all members in the set are contained in the other set or sequence.
  • isStrictSuperset(_:) Bool return value indicating if all members in the set are contained in the other set or sequence. Requirement for the set to contain at least one more member than that of the other.
  • isDisjoint(_:) Bool return value indicating if all elements in current set are not members in the other set or sequence.

To top this discussion off, I want to include another important piece of the puzzle included in understanding data structures / collections, that I’m covering in this 3-part series (Arrays, Sets, and Dictionaries). Continue below for more on how these collections are structs and how they’re implemented using a special technique called copy-on-write.

Collection Type Implementation in Swift (copy-on-write)

Most types in Swift are defined as structs, with enums and classes making up a smaller percentage. As mentioned, all collection types in Swift — arrays, sets, and dictionaries — are also defined as structs. This is because they can be accessed directly — being a value type — making it extremely fast compared to reference types. And the compiler puts structs on the stack rather than the heap in memory. Read more on memory management with heaps and stacks.

Here we can see where and how collections are implemented using a technique called copy-on-write.

copy-on-write is a technique where the value gets copied only when we mutate it. In the code snippet above, we create a set x and another variable y that assigns the value of x to it. Internally, each of these Set structs contains a reference to some memory. This memory stores all the elements within the set and is located on the heap. Both the sets point to the same location in memory — but the moment we mutate either one, the shared reference gets detected and the memory is copied.

So this means that there’s the expense in copying the elements only when it has to — which is when the value gets mutated.

copy-on-write doesn’t come with all structs we create. They must be implemented by creating a custom class that contains NSMutableData and the isKnownUniquelyReferenced function to find the uniqueness of a reference. And then we can include that custom class in the struct we created.

We’ve completed the second part of Diving into Data Structures—this one on sets! Thank you for sticking with me this far. I hope you’ve become more familiar with sets and their various operations. Drop some claps before you decide to go! 👏👏👏 Have any questions? 💡 Contact me or comment down below. 😃

Resources

Advance Swift — Chris Eidhof, Ole Begemann, and Airspeed Velocity
Collection Types — The Swift Programming Language
Hashable/Hasher — by Mattt
Hashable — SwiftDoc

📌 Find me on LinkedIn and GitHub here. 😄

Avatar photo

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *