Union-Find Algorithm
Robert Sedgewick and Kevin Wayne
Copyright © 2009 · January 29, 2009 11:37:05 PM
· From two words: UNION which means “to connect two objects” and FIND which defines “the path connecting the two objects.”
· Transitivity
o If p is connected to q and q is connected to r, then p is connected to r
o Connected components. Maximal set of objects that are mutually connected
· Union-Find applications involve manipulating objects of all types.
• Variable name aliases.
• Pixels in a digital photo.
• Computers in a network.
• Web pages on the Internet.
• Transistors in a computer chip.
• Metallic sites in a composite system.
· Union-find applications
Network connectivity.
• Percolation.
• Image processing.
• Least common ancestor.
• Equivalence of finite state automata.
• Hinley-Milner polymorphic type inference.
• Kruskal's minimum spanning tree algorithm.
• Games (Go, Hex)
• Compiling equivalence statements in Fortran.
Lecturer: Richard Karp
Scribe: Cheng Tien Ee
Spring 2006
A Union-Find data structure maintains a partition of a set X of size n. For each set A in the partition it maintains a representative S(A) contained in A. The initial partition has each element in a set by itself.
Operations include:
Union(S(A), S(B)), where S(A) 6= S(B), replaces A and B by A [ B, and specifies a representative for A [ B.
Find(x), where x 2 X, returns S(Ax), the representative of the set containing x.
We consider a Union-Find data structure in which each set A is represented by a forest where nodes are the elements of A, with S(A) at the root and an edge from each node to its parent. The entire data structure is a forest F with a tree for each set in the partition. The parent pointers are stored in an array.
Figure 8.1: (a)-(c) Examples of trees, with the black nodes representing roots of the corresponding trees. (d) Representation of parent pointers in array form for tree shown in (a); the root node is represented using a *
Union(S(A), S(B)) is easy to execute, just add a pointer from S(A) to S(B), or from S(B) to S(A).
Find(x) is executed by following pointers from x up to S(x).
The cost of Union(S(A), S(B)) is 0 and that of Find(x) is the number of pointers changed. At most n − 1
Union operations can be executed.
We consider an implementation of Find involving two stages:
1. Following pointers from x to S(Ax),
2. Retracing the path, replacing the parent pointer of each node (except the last) by a pointer to S(Ax).
In general, Union operations are interspersed with Find operations. However, given any sequence of operations, we can define another in which the Unions are all performed first, and then the same paths as in the original algorithm are compressed. However, these paths no longer must go all the way to the root. This separation of Unions from Finds simplifies the analysis.
Lemma 8.1 Given any sequence of Union and path compression operations, there is a sequence with the same number of Unions and the same cost, in which the Unions come first, followed by partial path compressions in 1-1 correspondence with the original path compressions.
A partial path compression takes a path p ending at a vertex y, which is not the root, and causes every vertex in p to point to the parent of y.
We now begin our analysis of sequences of partial path compressions, starting with a given forest F. A partition of X into disjoint sets Xb and Xt is called a dissection of X if Xt is ancestor-closed, i.e. if x 2 Xt then every ancestor of x in F is in Xt. Figure 8.3 shows an example of Xt in F.
Lemma 8.2 Let C be a sequence of |C| compressions in a forest F with node set X. Let Xb,Xt be a dissection of F, and let the forests induced respectively by Xb and Xt be denoted F(Xb) and F(Xt). Let C be a sequence of |C| compressions in F. Then there exists a sequence Cb of compressions in F(Xb) and a sequence Ct of compressions in F(Xt) such that
|Cb| + |Ct| = |C| (8.1)
and
cost(C) _ cost(Cb) + cost(Ct) + |Xb| + |Ct| (8.2)
Proof: For each compression in C, we define a corresponding compression in Cb or Ct, and charge the cost of the compression to one of four “accounts”. We then complete the proof by summing the amounts charged to the different accounts. Associated with each compression in C is a path p in F, ending at node y.
Case 1: All nodes of p lie in Xb and the parent of y lies in Xb. Assign p to Cb and charge cost(p) to account 1.
Case 2: All nodes of p lie in Xt. Assign p to Ct and charge its cost to account 2.
Case 3: Some nodes of p lie in Xb and some lie in Xt. Let p2 be the part of p containing nodes in Xt. Assign p2 to Ct and charge cost(p2) to account 2. For each node in p \Xb whose parent is in Xb, charge 1 to account 3, and charge 1 to account 4. Then |C| = |Cb| + |Ct| since each compression assigns a path to Cb or to Ct but not both. The total amount charged to all accounts is cost(C), and one can verify that the amounts charged are
Account 1: cost(Cb)
Account 2: cost(Ct)
Account 3: at most |Xb|, since each element of Xb causes a charge of at most 1
Account 4: at most |Ct|, since the compression of each path in Ct causes a charge of at most one to this account
Union by Rank
Recall that, in executing Union(S(A), S(B)), one can either create a pointer from S(A) to S(B), or a pointer from S(B) to S(A). The Union-by-Rank rule specifies a way of making this choice which tends to keep the trees shallow. Associate with each node x a rank rank(x) which is initially zero. When a Union requires choosing either a pointer from x to y or y to x, the pointer should run from the node of lower rank to the one of higher. If the ranks are equal, then the choice is arbitrary, but the node to which the new pointer is directed has its rank incremented by 1.
In the scenario where all Unions precede all partial path compressions, the forest created by the Unions has the following properties:
(1) a node of rank k has at least one child of each rank < k,
(2) the rank of any node is less than the rank of its parent, and
(3) for any s, the number of nodes of rank greater than s is at most n/2s+1, where n = |X|.
A forest produced using the Union-by-Rank operation, and therefore respecting (1), (2) and (3), is called a rank forest. Let f(m, n, r) be the maximum cost of a sequence of m compressions applied to a rank forest of size n, in which the root is of rank r. We shall derive amazingly low upper bounds on f(n, m, r).
Theorem 8.3 If f(m, n, r) _ km+2ng(n) for all m, n and r, then f(m, n, r) _ (k +1)m+2ng0(n) for all
m, n and r.