Multi-label Classification with MLPUGS

Mikhail Popov



Suppose we have a dataset \(D\) of \(n\) observations and a label set \(L\). Each \(i\)-th instance can have one or more labels \((S_i \subseteq L)\). In the multi-label classification problem, we are given \(x_*\) and we are interested in \(\hat{S}_*\). The most well known approach to this problem is binary relevance (BM), which transforms the problem into one binary classification problem for each label. The problem with this approach is that it ignores the relationship between the response variables, akin to fitting two multiple regression models when one should instead be using multivariate regression.

Classifier chains (Read et al., 2009) (CC) is a type of BM that makes use of the correlations between the labels. Each link in the chain is trained to predict label \(l_j \in L\) using a feature space extended to include \((l_1,\ldots,l_{j-1})\). An Ensemble of Classifier Chains (ECC) trains \(m\) CC classifiers where each \(C_k\) is trained with a random subset of \(D\) and is likely to be unique and able to give different multi-label predictions.

Of course, the problem with this approach is that to make a prediction for any of the labels, we need to know what the other labels are, which we don’t because we also need to predict those. This is where the idea of Gibbs sampling comes in. We start with random predictions, then proceed label-by-label, conditioning on the most recent predictions within each iteration. After the burn-in, we should have have a stable multivariate distribution of the labels for each observation.

Classification Pipeline

MLPUGS (Multi-label prediction using Gibbs sampling) is a wrapper that takes any binary classification algorithm as a base classifier and constructs an ECC and uses Gibbs sampling to make the multi-label predictions. In short, the steps are:

  1. Train an ECC using any base classifier that can predict classes and probabilities.
  2. Use it to make predictions (using Gibbs sampling).
  3. Collapse multiple iterations and models into a final set of predictions.
ecc(x, y) %>% predict(newdata) %>% [summary|validate]

We will go through each of the steps, including an evaluation step, in the example below.

Note on Parallelization

This package was designed to take advantage of multiple cores unless the OS is Windows. On a quad-core processor it’s recommended to parallel train an ensemble of 3 models. On 6-core and 8-core processors the recommended number of models is 5 and 7, respectively. Predictions are also performed in parallel, if the user allows it.

Example: Movies

Suppose we wanted to predict whether a movie would have a good (at least 80%) critic rating on Rotten Tomatoes, Metacritic, and Fandango based on the user ratings on those websites, along with the user ratings on Multi-label prediction via classifier chains allows us to use the correlation between those websites (a critically accepted movie on one review score aggregation website is likely to have high rating on another).

title year rotten_tomatoes metacritic fandango rotten_tomatoes_user metacritic_user imdb_user fandango_user
Avengers: Age of Ultron 2015 0 0 1 86 7.1 7.8 4.5
Cinderella 2015 1 0 1 80 7.5 7.1 4.5
Ant-Man 2015 1 0 1 90 8.1 7.8 4.5
Do You Believe? 2015 0 0 1 84 4.7 5.4 4.5
Hot Tub Time Machine 2 2015 0 0 0 28 3.4 5.1 3.0
The Water Diviner 2015 0 0 1 62 6.8 7.2 4.0

We are going use a pre-made training dataset movies_train (60% of movies) to train our ECC and movies_test (remaining 40%) for assessing the accuracy of our predictions.

data("movies_train"); data("movies_test")

Training an Ensemble of Classifier Chains (ECC)

There is no built-in classifier as of the writing of this vignette, so train_ecc requires us to give it an appropriate classifier to train. In a future release, MLPUGS will include a classifier to work-out-of-the-box, although the package was written to allow for user-supplied classifiers. We could, for example, use the randomForest package, in which case our code will look like:

fit <- ecc(movies_train[, -(1:3)], movies_train[1:3], 3, randomForest::randomForest,
           replace = TRUE)

This will give us 3 models, forming an ensemble of classifier chains. Each set of classifier chains was trained on a random 95% of the available training data. (If we had trained 1 set of classifier chains, that model would have used all of the training data.)

Prediction Using Gibbs Sampling (PUGS)

Photo by Dídac Balanzó (

Multi-label prediction is tricky with classifier chains because each label’s classifier was trained using the other labels. That is, we must predict an unobserved set of values using L-1 sets of unobserved values. To this end, we employ the Gibbs Sampler.

There is no built-in classifier as of the writing of this vignette, so predict requires us to give it an appropriate prediction function. We chose to train a randomForest, so we must supply a corresponding function for making the predictions.

pugs <- predict(fit, movies_test[, -(1:3)], = 500, n.iters = 1500, thin = 15,
                .f = randomForest:::predict.randomForest, type = "prob")

This will give us a total of 30 predictions (10 iterations, 3 sets of chains) for each movie’s rating on Rotten Tomatoes, Metacritic, and Fandango.

Gathering Predictions

When we collapse the predictions using summary, we can ask it to provide us with binary classifications or probabilistic classifications. Binary classification works by picking the most-predicted classification between the iterations and then the models.

y_pred <- summary(pugs, type = "prob")

Table 2: Probabilistic classifications for the first 5 movies in the validation set.

rotten_tomatoes metacritic fandango
Avengers: Age of Ultron (2015) 0.767 0.287 0.979
Do You Believe? (2015) 0.310 0.131 0.883
Hot Tub Time Machine 2 (2015) 0.001 0.000 0.018
The Water Diviner (2015) 0.223 0.023 0.983
Top Five (2014) 0.541 0.016 0.766
y_pred <- summary(pugs, type = "class")

Table 3: Binary classifications for the first 5 movies in the validation set.

rotten_tomatoes metacritic fandango
Avengers: Age of Ultron (2015) 1 0 1
Do You Believe? (2015) 0 0 1
Hot Tub Time Machine 2 (2015) 0 0 0
The Water Diviner (2015) 0 0 1
Top Five (2014) 1 0 1

An advantage of using probabilistic classification over binary is that you can experiment with the threshold you use. You may find, for example, that the intuitive and initial threshold of 0.5 doesn’t work as well as, say, 0.62.


So how well did our ensemble of classifier chains perform? Here are true classifications of first 5 movies in the test (validation) set:

Table 4: True classifications for the first 5 movies in the test (validation) set.
rotten_tomatoes metacritic fandango
Avengers: Age of Ultron (2015) 0 0 1
Do You Believe? (2015) 0 0 1
Hot Tub Time Machine 2 (2015) 0 0 0
The Water Diviner (2015) 0 0 1
Top Five (2014) 1 1 1

Let’s calculate some established measures of accuracy for multi-label classification:

validate(pugs, movies_test[, 1:3])
Measurement Description
Logarithmic.Loss 0.3180 provides a steep penalty for predictions that are both confident and wrong
Exact.Match.Ratio 0.6441 average per-obs exact classification
Labelling.F.score 0.7853 average per-obs classification with partial matches
Retrieval.F.score 0.6034 per-label classification with partial matches
Hamming.Loss 0.1299 average per-example per-class total error

Not bad!

Further Reading

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. URL:

Casella, G., & George, E. I. (1992). Explaining the Gibbs Sampler. The American Statistician, 46(3), 167–174. URL: