Can the data be representative of the different groups/populations?
This page is a fallback for search engines and cases when javascript fails or is disabled.
Please view this card in the library, where you can also find the rest of the plot4ai cards.
Can the data be representative of the different groups/populations?
- It is important to reduce the risk of bias and different types of discrimination. Did you consider diversity and representativeness of users/individuals in the data?
- When applying statistical generalisation, the risk exists of making inferences due to misrepresentation, for instance: a postal code where mostly young families live can discriminate the few old families living there because they are not properly represented in the group.
If you answered No then you are at risk
If you are not sure, then you might be at risk too
Recommendations
- Who is covered and who is underrepresented?
- Prevent disparate impact: when the output of a member of a minority group is disparate compared to representation of the group. Consider measuring the accuracy from minority classes too instead of measuring only the total accuracy. Adjusting the weighting factors to avoid disparate impact can result in positive discrimination which has also its own issues: disparate treatment.
- One approach to addressing the problem of class imbalance is to randomly resample the training dataset. This technique can help to rebalance the class distribution when classes are under or over represented:
- random oversampling (i.e. duplicating samples from the minority class)
- random undersampling (i.e. deleting samples from the majority class)
- There are trade-offs when determining an AI system’s metrics for success. It is important to balance performance metrics against the risk of negatively impacting vulnerable populations.
- When using techniques like statistical generalisation is important to know your data well, and get familiarised with who is and who is not represented in the samples. Check the samples for expectations that can be easily verified. For example, if half the population is known to be female, then you can check if approximately half the sample is female.
Interesting resources/references
- Related to disparate impact
- AI Fairness - Explanation of Disparate Impact Remover
- Mitigating Bias in AI/ML Models with Disparate Impact Analysis
- Certifying and removing disparate impact
- Avoiding Disparate Impact with Counterfactual Distributions
- Related to random resampling
- Oversampling and Undersampling
- Random Oversampling and Undersampling for Imbalanced Classification
- Related to Statistical Generalization
- Generalization in quantitative and qualitative research: Myths and strategies
- Generalizing Statistical Results to the Entire Population