JdS2012


 English   -  Français  

Résumé de communication



Résumé 16 :

Biases in random forests variable importance measures: new developments
Boulesteix, Anne-Laure ; Bender, Andreas ; Lorenzo Bermejo, Justo ; Strobl, Carolin
Universität München

The use of random forests is increasingly common in various application fields such as, e.g., genetic association studies. The variable importance measures (VIM) that are automatically calculated as a by-product of the algorithm are often used to rank predictors with respect to their ability to predict the investigated response. It is now well-known that VIMs may be affected by substantial biases, for instance in favour of categorical predictors with many categories. After a brief survey of these issues, we address another type of bias: VIMs tend to favor categorical predictors with approximately balanced categories over predictors with unbalanced categories - the number of categories being equal. This bias is particularly relevant to genetic association studies. The goal of our study is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the bias; and (iii) to summarize these results and previously investigated properties of random forest VIMs and to make practical recommendations regarding the choice of the random forest and variable importance type.