Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees

Jonathan Brophy, Zayd Hammoudeh, Daniel Lowd; 24(154):1−48, 2023.

Abstract

Analyzing the influence of changes to the training data on model predictions can provide valuable insights into the predictions, the models themselves, and the training datasets. However, most influence-estimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are widely used and powerful models, but their decision-making processes are often opaque. To improve our understanding of GBDT predictions and enhance these models, we adapt popular influence-estimation methods designed for deep learning models to GBDTs. We introduce our new methods, TREX and BoostIn, which are adaptations of representer-point methods and TracIn, respectively. The source code for these methods is available at https://github.com/jjbrophy47/treeinfluence. We evaluate the performance of TREX, BoostIn, LeafInfluence, and other baselines using five different evaluation measures on 22 real-world datasets with four popular GBDT implementations. Through these experiments, we gain a comprehensive understanding of the effectiveness of different influence estimation approaches for GBDT models. Our results show that BoostIn is an efficient influence-estimation method for GBDTs, performing equally well or better than existing work while being four orders of magnitude faster. Additionally, our evaluation suggests that the leave-one-out (LOO) retraining approach, while identifying the single-most influential training example, performs poorly in identifying the most influential set of training examples for a given target prediction.

[abs]

[pdf][bib]
      
[code]