Some best practices for anticlustering

library(anticlust)

This vignette documents some “best practices” for using the function anticlustering() (and they are mostly applicable to kplus_anticlustering() as well). In many cases, the suggestions pertain to overriding the default values of arguments, which seems to be a difficult decision for users. However, I advise you: Do not stick with the defaults; check out the results of different anticlustering specifications; repeat the process; play around; read the documentation (especially ?anticlustering); change arguments arbitrarily; compare the output. Nothing can break.1

This document uses somewhat imperative language; nuance and explanations are given in the package documentation, the other vignettes, and the papers by Papenberg and Klau (2021; https://doi.org/10.1037/met0000301) and Papenberg (2023; https://doi.org/10.1111/bmsp.12315). Note that deciding which anticlustering objective to use usually requires substantial content considerations and cannot be reduced to “which one is better”. However, some hints are given below.

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M. (2023). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology. Advance online publication. https://doi.org/10.1111/bmsp.12315


  1. Well, actually your R session can break if you use an optimal method (method = "ilp") with a data set that is too large.↩︎

  2. You might ask why standardize = TRUE is not the default. Actually, there are two reasons. First, the argument was not always available in anticlust and changing the default behaviour of a function when releasing a new version is oftentimes undesirable. Second, it seems like a big decision to me to just change users’ data by default. In doubt, just compare the results of using standardize = TRUE and standardize = FALSE and decide for yourself which you like best. Standardization may not be the best choice in all settings.↩︎

  3. The maximum dispersion problem can be solved optimally for rather large data sets, especially for \(K = 2\). For \(K = 3\) and \(K = 4\), several 100 elements can usually be processed, especially when installing the SYMPHONY solver.↩︎

  4. What is considered very large may depend on the specifications of your computer. I used to have problems computing a distance matrix for \(N > 20000\).↩︎