Can we prevent target leakage?
Can we prevent target leakage?
Target Leakage is present when your features contain information that your model should not legitimately be allowed to use, leading to overestimation of the model's performance. It can occur when information from outside the training dataset is improperly included in the model during training. This can result in an unrealistically high performance during evaluation.
If you answered No then you are at risk
If you are not sure, then you might be at risk too
Recommendations
- Avoid using proxies for the outcome variable as a feature.
- Do not use the entire data set for imputations, data-based transformations or feature selection.
- Avoid doing standard k-fold cross-validation when you have temporal data.
- Avoid using data that happened before model training time but is not available until later. This is common where there is delay in data collection.
- Do not use data in the training set based on information from the future: if X happened after Y, you shouldn’t build a model that uses X to predict Y.