This page is a fallback for search engines and cases when javascript fails or is disabled.
Please view this card in the library, where you can also find the rest of the plot4ai cards.
Could the AI system deliberately mislead users or hide its capabilities during deployment or evaluation?
Could the AI system deliberately mislead users or hide its capabilities during deployment or evaluation?
- Advanced models may learn to present false information or appear compliant during oversight, while internally pursuing misaligned goals.
- Deceptive behavior poses a serious safety risk if systems adapt strategically to evade human control or auditing.
If you answered Yes then you are at risk
If you are not sure, then you might be at risk too
Recommendations
- Conduct adversarial testing for deception and misalignment.
- Use interpretability tools to identify goal misgeneralization.
- Include behavior probes during training and monitoring.
- Flag deceptive responses in benchmark datasets.