This page is a fallback for search engines and cases when javascript fails or is disabled.
Please view this card in the library, where you can also find the rest of the plot4ai cards.
Can we minimize the amount of personal data used while preserving model performance?
Can we minimize the amount of personal data used while preserving model performance?
The principle of data minimization, as outlined in the General Data Protection Regulation (GDPR) and reflected in many global privacy standards, requires that only data necessary for achieving the system's purpose is collected and processed. However, reducing data too much can sometimes negatively impact the accuracy and performance of AI models, leading to critical or damaging consequences. Balancing regulatory compliance with operational effectiveness is essential to avoid undermining the model's reliability while adhering to privacy principles.
If you answered No then you are at risk
If you are not sure, then you might be at risk too
Recommendations
- Achieve data minimization by starting with a smaller dataset and iteratively adding data only as needed, based on observed performance improvements, to justify why additional data is necessary.
- Use high-quality data to reduce the need for large datasets while ensuring sufficient diversity and representativeness for your model.
- Apply advanced privacy-preserving techniques such as pseudonymization, perturbation, differential privacy, federated learning, or synthetic data generation to comply with privacy regulations while using larger datasets.
- Collaborate with experts to select the minimum set of features needed, ensuring relevance to the objective and avoiding issues like the Curse of Dimensionality, which can degrade model performance when unnecessary features are included.
Interesting resources/references
- Page 13 Artificial Intelligence and Data Protection How the GDPR Regulates AI
- Data Minimization for GDPR Compliance in Machine Learning Models: Methods like the one proposed in this paper can inspire you to find a way to mitigate the accuracy risk. They show how to reduce the amount of personal data needed to perform predictions, by removing or generalizing some of the input features.
- The answer to this post also contains information about this problems in different models: Does Dimensionality curse effect some models more than others?
- Towards Breaking the Curse of Dimensionality for High-Dimensional Privacy