Errorlocate uses validation rules from package
to locate faulty values in observations (or in database slang:
erronenous fields in records).
It follows this simple recipe (Felligi-Holt):
errorlocate does this by translating this into a mixed
integer problem (see
vignette("inspect_mip", package="errorlocate") and solving
errorlocate has two main functions to be used:
locate_errorsfor detecting errors
replace_errorsfor replacing faulty values with
Let’s start with a simple example:
We have a rule that age cannot be negative:
And we have the following data set
summary(le) gives an overview of the errors found in
this data set. The complete error listing can be found with:
Which says that record 1 has a faulty value for age.
Suppose we expand our rules
validate::confront we can see that rule
r2 is violated (record 2).
|r1||4||2||1||1||FALSE||FALSE||age > 0|
|r2||4||2||1||1||FALSE||FALSE||income <= 0 | (age > 16)|
What errors will be found by
It now detects that
age in observation 2 is also faulty,
since it violates the second rule. Note that we use
set.seed. This is needed because in this example, either
income can be considered faulty.
set.seed assures that the procedure is reproducible.
replace_errors we can remove the errors (which
still need to be imputed).
|r1||4||1||0||3||FALSE||FALSE||age > 0|
|r2||4||2||0||2||FALSE||FALSE||income <= 0 | (age > 16)|
replace_errors set all faulty values to
locate_errors allows for supplying weigths for the
variables. It is common that the quality of the observed variables
differs. When we have more trust in
age we can give it more
weight so it chooses income when it has to decide between the two
Weights can be specified in different ways: (see also
vector: all records will have same set of weights. Unspeficied columns will have weight 1.
data.frame, same dimension as the data: specify weights per record.
Infweights to fixate a variable, so it won’t be changed.
locate_errors solves a mixed integer problem. When the
number of interactions between validation rules is large, finding an
optimal solution can become computationally intensive. Both
locate_errors as well as
replace_errors have a
Ncpus making use of multiple
$duration (s) property of each solution
indicates the time spent to find a solution for each record. This can be
restricted using the argument