Speeding up anticlustering

Martin Papenberg

library(anticlust)

This vignette documents various ways by which the speed of the anticlustering method implemented in the anticlust package can be adjusted. Speedup is particularly useful for large data sets when the default anticlustering algorithm becomes too slow. A fast method may also be desirable for testing purposes, even if the final anticlustering is based on a slower method.

The exchange algorithm

The default anticlustering algorithm works by exchanging data points between clusters in such a way that exchanges improve the anticlustering objective as much as possible. Details on the exchange method may also be found in Papenberg and Klau (2021; https://doi.org/10.1037/met0000301), Papenberg (2023; https://doi.org/10.1111/bmsp.12315), or the anticlust documentation (?anticlustering). Basically, running more exchanges tends to improve the results, but the improvements are diminishing with many repetitions—especially for large data sets. So, to speed up anticlustering(), we can reduce the number of exchanges. Here we will learn how to do that. However, first we learn how to slow down anticlustering. Slowing down can lead to better results and is recommended if you have the time.

Slowing down

The default exchange algorithm (anticlustering(..., method = "exchange")) iterates through all input elements and attempts to improve the anticlustering by swapping each input element with a element that is currently assigned to a different cluster. No swap is conducted if an element cannot be swapped in such a way that the anticlustering objective is improved. The process stops after all possible exchanges have been evaluated for each element. When the number of input elements is \(N\), this process leads to approximately \(N^2\)—or \(O(N^2)\)—attempted exchanges, because each element is swapped with all elements that are currently assigned to a different cluster. To give a concrete example, when having \(N = 100\) data points and \(K = 4\) equal-sized groups, 75 swaps are evaluated for each element and the best swap is realized. This leads to 100 * 75 = 7500 exchanges that have to be conducted during the entire exchange algorithm, and for each exchange the objective function has to be re-evaluated. This is less exchanges than \(N^2 = 100^2 = 10000\) because we skip exchanges with the 25 elements that are currently in the same cluster (including itself). However, according to the Big O notation, we would still classify the number of exchanges as \(O(N^2)\), independent of the number of groups. Thus, the total theoretical run time of the exchange method is \(O(N^2)\) multiplied with the effort needed to compute an anticlustering objective.

The results of the exchange method can be improved by not stopping after a single iteration through the data set; instead we may repeat the process until no single exchange is able to further improve the anticlustering objective, i.e., until a local maximum is found. This happens if we use anticlustering(..., method = "local-maximum"). This method corresponds to the algorithm “LCW” in Weitz and Lakshminarayanan (1998). Using the local maximum method leads to more exchanges and thus to longer running time, but also better results than the default exchange method.

Let’s compare the two exchange methods with regard to their running time, using the iris data set, which contains 150 elements.


K <- 3
system.time(anticlustering(iris[, -5], K = K, method = "exchange"))
#>        User      System verstrichen 
#>       0.007       0.000       0.008
system.time(anticlustering(iris[, -5], K = K, method = "local-maximum"))
#>        User      System verstrichen 
#>       0.038       0.000       0.039

Depending on how many iterations are needed, the default exchange method can be much faster than the local maximum method. Generally I would recommend to use method = "local-maximum" for better results, but if speed is an issue, stick with the default.

To slow down even more: The exchange process may be restarted several times, each time using a different initial grouping of the elements. This is accomplished when specifying the repetitions argument, which defaults to 1 repetition of the exchange / local maximum algorithm. Thus, for better results, we may increase the number of repetitions:


K <- 3
system.time(anticlustering(iris[, -5], K = K, method = "exchange"))
#>        User      System verstrichen 
#>       0.008       0.000       0.008
system.time(anticlustering(iris[, -5], K = K, method = "local-maximum"))
#>        User      System verstrichen 
#>       0.034       0.000       0.034
system.time(anticlustering(iris[, -5], K = K, method = "local-maximum", repetitions = 10))
#>        User      System verstrichen 
#>       0.344       0.004       0.348

In this case, sticking with the default leads to run times that are much much faster. Still, if your data set is not too large, using several repetitions may be useful (but you can judge yourself via the results). The good news is that fewer exchanges may be enough in large data sets: anticlustering generally becomes easier with more data.

Getting fast: Using fewer exchange partners

If the default exchange method is not fast enough for your taste, it is possible to use fewer exchange partners during the anticlustering process. By default, the exchange method evaluates each possible exchange with all elements that are currently assigned to a different cluster, leading to \(O(N^2)\) exchanges. If we only use a fixed number of exchange partners per element, we can reduce the number of exchanges to \(O(N)\), corresponding to a gain of an order of magnitude in terms of run time. Using fewer exchange partners for each element may decrease the quality of the results, and is generally only recommended for (very) large data sets. But the run time will be considerably faster.

One way of doing using fewer exchange partners is by including “preclustering” restrictions (see ?anticlustering). When setting preclustering = TRUE, the optimization restricts the number of exchange partners to K - 1 (very similar) elements. Note that the preclustering algorithm itself has \(O(N^2)\) run time, which has to be performed prior to the anticlustering algorithm. Thus, we have to use a larger data set to see the improvement in running time:

N <- 1000
M <- 5
K <- 3
data <- matrix(rnorm(N*M), ncol = M)
system.time(anticlustering(data, K = K))
#>        User      System verstrichen 
#>       4.279       0.000       4.279
system.time(anticlustering(data, K = K, preclustering = TRUE))
#>        User      System verstrichen 
#>       0.097       0.008       0.105

As we can see, for N = 1000, the speedup is enormous. There is also an additional “hidden” method to make the anticlustering() function run faster. This method also relies on using fewer exchange partners during the exchange process, but does not use preclustering. This approach is documented here and mostly relies on a dirty “hack” involving the anticlustering() argument categories:

The first step that I am using here is not strictly necessary—in the next section, we will learn more about what this accomplishes—but let’s create the initial clusters before calling anticlustering(). This grouping is the basis on which the exchange procedure starts to improve the anticlustering:

N <- nrow(iris)
initial_clusters <- sample(rep_len(1:K, N))
initial_clusters
#>   [1] 2 2 3 1 3 3 3 3 3 2 1 1 1 3 3 2 1 1 3 1 3 1 1 3 2 1 3 2 2 2 3 2 1 3 2 1 2
#>  [38] 3 3 3 2 3 2 1 2 2 1 3 1 1 2 1 1 3 2 2 2 3 2 1 3 3 3 3 1 1 2 1 1 2 1 2 2 1
#>  [75] 3 1 1 2 3 3 3 2 1 2 2 3 2 1 3 2 1 2 3 1 1 2 2 1 1 1 3 1 1 2 1 3 3 3 3 3 1
#> [112] 3 1 2 2 2 3 2 3 2 2 3 2 3 2 1 2 2 2 1 3 3 3 1 3 1 2 2 1 1 3 3 1 1 2 2 2 1
#> [149] 1 3

Now, the argument categories can be used to define which elements serve as exchange partners for each other. Lets create random groups of 10 elements that serve as exchange elements for each other:

n_exchange_partners <- 10
exchange_partners <- sample(rep_len(1:(N/n_exchange_partners), N))
exchange_partners
#>   [1]  5 13  2 11  7  5  6  1  7  3  1  1 15  3 10 10  8  4 14 15 15  7  8  3  2
#>  [26] 13  1  9 11  6 13  6  9 13 13  1  9 11 15  7 12  8 12 13 14  7  3 10 12  8
#>  [51] 11 11  7  5 15  5  6 14 15  4  9  5 15 12 14  4  4  6 11 10  1  9  6 12  1
#>  [76]  8 10 11 14 11 15 14 10  7  2  8  4  8  5 15  3  1  2  4  5 12 12  3  9 13
#> [101]  3  8  5 10  1  2  7  9  6 10  1  8 10  5  4  4  9  2  3 14  9  2  5  7 10
#> [126] 12 11 13 12  6  9  6 14 14 15 13 13  4  3  4 14 11  2  6  2  3  8  2 12  7

The variable exchange_partners now defines groups of elements that are exchanged with each other. Only elements having the same value in exchange_partners serve as exchange partners for each other. Thus, each element is only swapped with 10 other elements instead of all 150 elements.

Now let’s call anticlustering using the exchange partners we just defined:

system.time(anticlustering(iris[, -5], K = initial_clusters))
#>        User      System verstrichen 
#>       0.007       0.000       0.007
system.time(anticlustering(iris[, -5], K = initial_clusters, categories = exchange_partners))
#>        User      System verstrichen 
#>       0.003       0.000       0.003

Well, there is not a lot going on here with this very small data set (N = 150), so let’s do this for a larger data set with 1000 data points.

N <- 1000
M <- 2
K <- 5
data <- matrix(rnorm(M*N), ncol = M)

initial_clusters <- sample(rep_len(1:K, N))
n_exchange_partners <- 10
exchange_partners <- sample(rep_len(1:(N/n_exchange_partners), N))

system.time(anticlustering(data, K = initial_clusters))
#>        User      System verstrichen 
#>       3.035       0.008       3.042
system.time(anticlustering(data, K = initial_clusters, categories = exchange_partners))
#>        User      System verstrichen 
#>       0.047       0.008       0.056

The speedup is enormous! Note that this approach even reduces the theoretical run time of the algorithm, because the number of exchanges no longer depends on the size of the input data, but instead is given a fixed value.

Including categorical variables

The previous approach to speed up the anticlustering by requiring fewer exchange partners used the categories argument. We should now reflect how this was accomplished, and first note that the categories argument usually has a different purpose: It is used to evenly distribute a categorical variable across groups—we did not care for that in the previous example.

For example, coming back to the iris data set, we may require to evenly distribute the species of the iris plants across 5 groups of plants:

groups <- anticlustering(iris[, -5], K = 5, categories = iris$Species)
table(groups, iris$Species)
#>       
#> groups setosa versicolor virginica
#>      1     10         10        10
#>      2     10         10        10
#>      3     10         10        10
#>      4     10         10        10
#>      5     10         10        10

How does the categories argument accomplish the even spread of the species? First, the initial grouping of the elements is not random, but instead a “stratified split”, which ensures that a categorical variable occurs an equal number of times in each split. anticlust has the function categorical_sampling() for this purpose. After conducting the initial stratified split, only plants belonging to the same species serve as exchange partners for each other. This second purpose of the categories argument is the one that we used above to restrict the number of exchange partners to speed up anticlustering: Only elements that have the same value in categories serve as exchange partners for each other.

In the example above, we prevented anticlustering() from conducting a stratified split on the basis of the categories argument because we passed the initial grouping of the variables ourselves. The insight that the categories argument has a twofold purpose—one of which can be shut down by using the K argument as the initial grouping vector—leads to the following approach, where I combine the speedup aspect of categories with the aspect of conducting a stratified split.

First, we conduct a manual stratified split as the initial grouping vector for the K argument in anticlustering():

initial_groups <- categorical_sampling(iris$Species, K = 5)
table(initial_groups, iris$Species) # even!
#>               
#> initial_groups setosa versicolor virginica
#>              1     10         10        10
#>              2     10         10        10
#>              3     10         10        10
#>              4     10         10        10
#>              5     10         10        10

Next, as in the previous section, we generate a vector that defines groups of pairwise exchange partners.

N <- nrow(iris)
n_exchange_partners <- 10
exchange_partners <- sample(rep_len(1:(N/n_exchange_partners), N))

Now, and this is the crucial part, we pass to the argument categories a matrix that contains both the species as well as the exchange_partners vector, and to the argument K the vector that encodes the stratified split. This way we ensure that:

  1. the species is split evenly between groups at the start of the algorithm (argument K)
  2. the number of exchange partners is restricted to 10 (one column of the argument categories)
  3. the exchange partners are from the same species, thereby ensuring that the species remains evenly distributed between groups (the other column of categories)
groups <- anticlustering(
  iris[, -5],
  K = initial_groups, 
  categories = cbind(iris$Species, exchange_partners)
)

The groups are still balanced after anticlustering() was called:

table(groups, iris$Species)
#>       
#> groups setosa versicolor virginica
#>      1     10         10        10
#>      2     10         10        10
#>      3     10         10        10
#>      4     10         10        10
#>      5     10         10        10

Objective function

The package anticlust primarily implements two objective functions for anticlustering: k-means and the diversity.1 The k-means criterion is well-known in cluster analysis and is computed as the sum of the squared Euclidean distances between all data points and the centroid of their respective cluster. The diversity is the overall sum of distances of elements in the same sets (for details, see Papenberg & Klau, 2021). By default, anticlustering() optimizes the “Euclidean diversity”: the diversity objective using the Euclidean distance as measure of pairwise dissimilarity. However, any kind of dissimilarity matrix can be used as input.

As stated above, the anticlustering algorithms recompute the objective function for each attempted exchange. Thus, computing the objective is the major contributor to overall run time. In anticlust, I exploit the fact that anticlustering objectives can be recomputed faster when only two items have swapped between clusters and the objective value prior to the exchange is known. For example, computing the diversity objective “from scratch” is in \(O(N^2)\), but when only two items differ between swaps, it is not necessary to spend the entire \(O(N^2)\) time during each exchange. Instead, by “cleverly” updating the objective before and after the swap, the computation reduces to \(O(N)\), leading to about \(O(N^3)\) for the entire exchange method (instead of \(O(N^4)\)—this is a huge difference). Computing the k-means objective (objective = "variance") is in \(O(M \cdot N)\), where \(M\) is the number of variables. anticlust uses some optimizations during the exchange process to prevent the entire re-computation of the cluster centroids for each exchange, which otherwise consumes most of the run time. However, this does not change the theoretical \(O(M \cdot N)\) run time of the computation. Thus, theoretically, re-computing the k-means objective is slower than re-computing the diversity objective. In practice, however, I observe that k-means anticlustering is oftentimes faster than diversity anticlustering. That is, when \(N\) is large as compared to \(M\) and \(K\). For example, let’s compare running times for k-means and diversity anticlustering for \(N = 1000\), \(M = 2\) variables and \(K = 5\) groups:

N <- 1000
M <- 2
K <- 5
data <- matrix(rnorm(M*N), ncol = M)

system.time(anticlustering(data, K = K, objective = "diversity"))
#>        User      System verstrichen 
#>       3.041       0.004       3.046
system.time(anticlustering(data, K = K, objective = "variance")) # k-means anticlustering
#>        User      System verstrichen 
#>       1.762       0.000       1.762

When \(M\) and/or \(K\) increase, the diversity implementation may be faster, e.g. using \(N = 1000\), \(M = 20\) variables and \(K = 50\) yields:

N <- 1000
M <- 20
K <- 50
data <- matrix(rnorm(M*N), ncol = M)

system.time(anticlustering(data, K = K, objective = "diversity"))
#>        User      System verstrichen 
#>       0.205       0.004       0.209
system.time(anticlustering(data, K = K, objective = "variance"))
#>        User      System verstrichen 
#>       0.826       0.000       0.827

In most application I am aware of, the k-means method is faster because \(M\) and \(K\) are usually not very large. However, for example, if we happen to have (approximately) “\(M = N\)”, the diversity implementation is much faster because its theoretical run time is better. Note that for very large data sets, however, using the diversity objective may not be feasible at all. The reason for this is that a quadratic matrix of between-item distances has to be computed and stored in memory. It is my experience that on a personal computer this becomes difficult for about > 20000 elements (where the distance matrix has 20000^2 = 400000000 elements). Thus, for large data sets a change of the objective is reasonable; the k-means objective is computationally more efficient and even very large data sets can be processed—at least this is true when the number of variables is (much) smaller than the number of elements. I live in a world where this is usually the case.

Because the k-means objective has some disadvantages for anticlustering (see Papenberg, 2023),2 you may consider using the k-plus objective, which extends—and effectively re-uses—the k-means criterion. For example, the following code—using \(N = 50000\), 2 features and 2 exchange partners per element—ran in just about 15 seconds on my 10 year old personal computer:


N <- 50000
M <- 2
K <- 5
data <- matrix(rnorm(M*N), ncol = M)

initial_clusters <- sample(rep_len(1:K, N))
n_exchange_partners <- 2
exchange_partners <- sample(rep_len(1:(N/n_exchange_partners), N))

kplus_anticlustering(data, K = initial_clusters, categories = exchange_partners)

My computer crashes when I insert 50000 data points into diversity anticlustering.

Addendum: fast_anticlustering()

The anticlust package contains a function called fast_anticlustering(), which (as the name suggests) was introduced to be a faster version of anticlustering() to process large data sets. fast_anticlustering() optimizes the k-means criterion and has an argument k_neighbours, which adjusts the number of exchange partners. Details are found in the documentation (?fast_anticlustering) and in Papenberg and Klau (2021). In my view, the fast_anticlustering() approach to adjusting the number of exchange partners is theoretically and practically more satisfying than the previous “hack” of using the categories argument. However, while the function may still be useful in some settings, I currently do not recommend its use. This is because the exchange algorithm that is called in anticlustering() (and therefore also kplus_anticlustering(), which calls anticlustering()) has been rewritten in C after including fast_anticlustering() in the package. The C implementation is considerably faster as compared to the R implementation in fast_anticlustering(). Due to the way in which exchange partners are generated in fast_anticlustering(), it is also not straight forward use the C implementation of the exchange method in fast_anticlustering()—I would like to change this in the future, but I am not sure when this will happen. So, ironically, fast_anticlustering() is currently not the fastest method for anticlustering, by a long shot:

N <- 1000
M <- 2
K <- 5
data <- matrix(rnorm(M*N), ncol = M)

initial_clusters <- sample(rep_len(1:K, N))
n_exchange_partners <- 10
exchange_partners <- sample(rep_len(1:(N/n_exchange_partners), N))

# fast_anticlustering() always optimizes the k-means criterion (i.e., objective = "variance")
system.time(anticlustering(data, K = initial_clusters, categories = exchange_partners, objective = "variance"))
#>        User      System verstrichen 
#>       0.024       0.000       0.025
system.time(fast_anticlustering(data, K = initial_clusters, k_neighbours = n_exchange_partners))
#>        User      System verstrichen 
#>       1.191       0.000       1.192

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2023). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology. Advance online publication. https://doi.org/10.1111/bmsp.12315

Weitz, R., & Lakshminarayanan, S. (1998). An empirical comparison of heuristic methods for creating maximally diverse groups. Journal of the Operational Research Society, 49(6), 635–646. https://doi.org/10.1057/palgrave.jors.2600510


  1. Actually, four objectives are natively supported for the anticlustering() argument objective: "diversity", "variance" (i.e, k-means), "kplus" and "dispersion". However, the k-plus objective as implemented in anticlust effectively re-uses the original k-means criterion and just extends the input data internally. The dispersion objective has a different goal than the other objectives as it does not strive for between-group similarity, so it cannot be used as an alternative to the other objectives.↩︎

  2. The primary disadvantage is that k-means anticlustering only leads to similarity in means, but not in standard deviations or any other distribution aspects; the k-plus criterion can be used to equalize arbitrary distribution moments between groups.↩︎