Can we prevent target leakage?

Data & Data Governance Category
Design PhaseInput PhaseModel PhaseMonitor Phase
Can we prevent target leakage?

Target Leakage is present when your features contain information that your model should not legitimately be allowed to use, leading to overestimation of the model's performance. It can occur when information from outside the training dataset is improperly included in the model during training. This can result in an unrealistically high performance during evaluation.

If you answered No then you are at risk

If you are not sure, then you might be at risk too

Recommendations

  • Avoid using proxies for the outcome variable as a feature.
  • Do not use the entire data set for imputations, data-based transformations or feature selection.
  • Avoid doing standard k-fold cross-validation when you have temporal data.
  • Avoid using data that happened before model training time but is not available until later. This is common where there is delay in data collection.
  • Do not use data in the training set based on information from the future: if X happened after Y, you shouldn’t build a model that uses X to predict Y.