Are we preventing Data Leakage?

This page is a fallback for search engines and cases when javascript fails or is disabled.
Please view this card in the library, where you can also find the rest of the plot4ai cards.

Technique & Processes Category
Design PhaseInput PhaseModel PhaseOutput Phase
Are we preventing Data Leakage?

Data Leakage is present when your features contain information that your model should not legitimately be allowed to use, leading to overestimation of the model's performance.

If you answered No then you are at risk

If you are not sure, then you might be at risk too

Recommendations

  • Avoid using proxies for the outcome variable as a feature.
  • Do not use the entire data set for imputations, data-based transformations or feature selection.
  • Avoid doing standard k-fold cross-validation when you have temporal data.
  • Avoid using data that happened before model training time but is not available until later. This is common where there is delay in data collection.
  • Do not use data in the training set based on information from the future: if X happened after Y, you shouldn’t build a model that uses X to predict Y.