Random forest

3/24/2023

While its use was in the early years limited to innovation-friendly scientists interested (or experts) in machine learning, random forests are now more and more well-known in various non-computational communities. Our experience as authors, reviewers and readers is that random forest can now be used routinely in many scientific fields without particular justification and without the audience strongly questioning this choice. Since its invention 17 years ago, the random forest (RF) prediction algorithm, which focuses on prediction rather than explanation, has strongly gained popularity and is increasingly becoming a common “standard tool” also used by scientists without any strong background in statistics or machine learning. This is especially true in scientific fields such as medicine or psycho-social sciences where the focus is not only on prediction but also on explanation see Shmueli for a discussion of this distinction. when the number of covariates is small compared to the sample size), logistic regression is considered a standard approach for binary classification. In the context of low-dimensional data (i.e. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. The mean difference between RF and LR was 0.029 (95%-CI =) for the accuracy, 0.041 (95%-CI =) for the Area Under the Curve, and − 0.027 (95%-CI =) for the Brier score, all measures thus suggesting a significantly better performance of RF. RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools.

Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001.

0 Comments

Random forest

Leave a Reply.

Author

Archives

Categories