Imagine waking up one morning feeling unwell, prompting you to seek medical attention. Upon arriving at the clinic, the medic might inquire, “Is there something that’s bothering you?” to which you would respond, “Since this morning, I’ve been experiencing a stomachache.” The medic’s next question could be, “What was the last thing you ate?” You then proceed to describe the delightful dinner you had the previous night. After a series of questions, the medic informs you that you simply have minor indigestion due to overeating and prescribes treatment for the stomachache. The medic arrives at this diagnosis based on their experience with similar symptoms and expert knowledge, leading them to infer that it is just indigestion.
This technique mirrors the approach employed in constructing classification and regression trees, where we possess the values of certain predictor variables (the symptoms) and a target variable of interest (the disease). The primary objective is to determine a function that takes these predictors as input and infers the value of the target variable.
Both regression and classification trees achieve this through the following process: The entire predictor space forms the root of the tree, which is then divided into distinct regions. Each region is subdivided into subregions, continuing until a specified stopping rule is applied. The partitioning within a cell is determined by maximising a chosen splitting criterion. While various splitting criteria exist in the literature, the most popular one is the CART split criterion. This criterion aims to attain homogeneity in the values of the target variable within each cell while simultaneously striving for the most significant disparity between cells.
What is a random forest?
A random forest consists of multiple randomised trees. The randomization between these trees can be achieved in various ways, but the most common approach involves two distinct parts of the algorithm. First, before constructing a tree, the observations are randomly subsampled, and only the selected observations are used for building that particular tree.
Second, whenever a cell is split during the tree’s construction, the splitting criterion is maximised by considering only a subset of randomly chosen directions. To extend this analogy, if we think of a single tree as a medic, then a random forest would be akin to the collective opinion of several medics, each offering their observations and perspectives.
An issue arises when certain observations have missing values for some predictor variables, akin to responding “I don’t know” to specific questions posed by the medic in our analogy. We suggest adapting the splitting criterion to accommodate incomplete observations to address this. The concept behind this approach is straightforward: each time we split a cell, we assign observations with missing values to the subcell that maximises the criterion.
In our analogy, we referred to a missing value as the equivalent of responding “I don’t know” to a question. However, missing values can arise from various scenarios, such as refusing to answer certain questions. Interestingly, the reason behind the missing data is as crucial as the missing data itself. There is a distinction between someone unintentionally omitting to answer a survey question and someone consciously deciding not to provide an answer. This connection between the data and the reason for the missingness is what we refer to as the data-missing mechanism.
Data-missing mechanisms can be classified into three categories: MCAR (Missing Completely At Random), which occurs when the probability of missingness is unrelated to the predictors’ values or the target variable; MAR (Missing At Random), where the probability of missingness is associated with some measured variables but not the missing values themselves; and MNAR (Missing Not At Random), which arises when the probability of missingness depends on the missing values themselves.
Properly planning a computational study for machine learning algorithms is essential to obtain meaningful results. Therefore, the primary aim of our work was to conduct a comprehensive comparison of different strategies for handling missing values using random forests. We considered a diverse set of 7 missing-data mechanisms in our simulation experiments.
To ensure a thorough investigation, we selected numerous algorithms as benchmarks, enabling a comparison between straightforward methods for addressing missing data and more complex state-of-the-art algorithms. We categorised these algorithms into three groups:
- Listwise deletion involves eliminating all observations with any missing value from the dataset.
- Imputation algorithms, which generate a complete dataset by inferring the missing values.
- Algorithms that directly handle the missing values during the construction of the trees, as demonstrated in our approach.
The following figure illustrates the average mean squared error (MSE) associated with different approaches. A lower MSE value indicates better overall algorithm performance. In the figure, listwise deletion is represented in green, approaches implementing imputation in the dataset are shown in blue, and methods directly handling missing values during tree construction are depicted in red.
Notably, listwise deletion (denoted as ‘NoRows’) yields the largest MSE, thus serving as a benchmark for the minimum expected performance of any method attempting to estimate the regression function with missing values. It is worth noting that even a straightforward approach, such as imputing with the median, can outperform most considered methods or demonstrate similar performance across various missing-data mechanisms.
Discussion and conclusions
Unsurprisingly, listwise deletion exhibited the poorest performance, and it is advisable to avoid this approach unless the percentage of observations with missing values is minimal, allowing their removal without significant consequences. As the percentage of missing values diminishes (approximately 20% or less), even a straightforward technique like median imputation can yield comparable results to more complex algorithms.
Until now, numerous approaches have not provided clear instructions on predicting new observations when only a portion of the predictor variables is available. This step holds significant importance and should not be overlooked, as the same mechanism that caused missing values during the training phase could be in effect during the prediction phase. Addressing this aspect is crucial to ensure robust and reliable predictions in practical applications.
Gómez-Méndez, I., & Joly, E. (2023). Regression with missing data, a comparison study of techniques based on random forests. Journal of Statistical Computation and Simulation, 1-26. https://doi.org/10.1080/00949655.2022.2163646