Is the dataset representative of the different real-world groups, populations and environments?

This page is a fallback for search engines and cases when javascript fails or is disabled.
Please view this card in the library, where you can also find the rest of the plot4ai cards.

Bias, Fairness & Discrimination CategoryData & Data Governance Category
Design PhaseInput PhaseModel PhaseOutput PhaseMonitor Phase
Is the dataset representative of the different real-world groups, populations and environments?

Have you considered the diversity and representativeness of individuals, user groups, and environments in the data? When applying statistical generalisation, the risk exists of making inferences due to misrepresentation, for instance: a postal code where mostly young families live can discriminate the few old families living there because they are not properly represented in the group.

  • Deployment bias arises when there is a mismatch between the environment where the AI is developed and where it is deployed. Key data-related biases that contribute to it include:
  • Mismatch between the target population and the actual user base.
  • Underrepresentation of certain groups.
  • Flaws in the data collection/selection process, such as:
  • Sampling bias: Data isn't randomly collected, skewing the representation.
  • Self-selection bias: Certain groups opt out, leading to gaps in the data.
  • Coverage bias: The data collection method fails to include all relevant segments of the population.

If you answered No then you are at risk

If you are not sure, then you might be at risk too

Recommendations

  • Who is represented, and who might be underrepresented?
  • Prevent disparate impact: when the output of a member of a minority group is disparate compared to representation of the group. Consider measuring the accuracy from minority classes too instead of measuring only the total accuracy. Adjusting the weighting factors to avoid disparate impact can result in positive discrimination which has also its own issues: disparate treatment.
  • One approach to addressing the problem of class imbalance is to randomly resample the training dataset. This technique can help to rebalance the class distribution when classes are under or over represented:
    • random oversampling (i.e. duplicating samples from the minority class)
    • random undersampling (i.e. deleting samples from the majority class)
  • There are trade-offs when determining an AI system’s metrics for success. It is important to balance performance metrics against the risk of negatively impacting vulnerable populations.
  • When using techniques like statistical generalisation is important to know your data well, and get familiarised with who is and who is not represented in the samples. Check the samples for expectations that can be easily verified. For example, if half the population is known to be female, then you can check if approximately half the sample is female.
  • After deployment, monitor the AI’s performance to catch any unexpected issues.
  • Focus on making the model interpretable so that deployment problems can be quickly identified and addressed.