This page is a fallback for search engines and cases when javascript fails or is disabled.
Please view this card in the library, where you can also find the rest of the plot4ai cards.
Could malicious fine-tuning compromise the safety or alignment of our GenAI model?
Could malicious fine-tuning compromise the safety or alignment of our GenAI model?
- Adversaries can fine-tune or subtly manipulate your LLM using harmful data, leading to unsafe, biased, or deceptive behaviors.
- Common fine-tuning attacks include:
- Instruction Manipulation: Injects unsafe instructions into fine-tuning data, teaching the model to follow harmful prompts.
- Output Manipulation: Poisons target outputs in the fine-tuning data, causing the model to generate malicious or biased responses, even when prompts seem neutral.
- Backdoor Attacks: Implant hidden triggers during fine-tuning that activate malicious behavior only when specific input patterns appear. The model behaves normally otherwise, making these attacks hard to detect.
- Alignment Degradation: Subtly erodes the model’s safety alignment during fine-tuning, making it gradually more permissive to unsafe behavior without explicit instructions.
- Reward Hijacking: Tricks the reward model into preferring harmful outputs, effectively training the model to give unsafe or misleading responses.
- Semantic Drift: Slightly alters wording or context in fine-tuning data to shift the model’s behavior, causing it to appear aligned while subtly reinforcing harmful stereotypes or unsafe reasoning.
- These threats can be introduced via fine-tuning-as-a-service platforms, open-source model reuse, or contaminated user-provided datasets.
- Even small amounts of harmful fine-tuning data can significantly degrade model alignment and safety.
If you answered Yes then you are at risk
If you are not sure, then you might be at risk too
Recommendations
- Vet and sanitize fine-tuning datasets, including user-submitted data and third-party sources.
- Implement anomaly detection and alignment regression tests before and after fine-tuning.
- Restrict or audit fine-tuning privileges, especially on shared infrastructure or open APIs.
- Use differential privacy, prompt injection detection, and trigger auditing tools to detect backdoors.
- Conduct red-teaming to assess the effects of adversarial fine-tuning and monitor for misalignment drift over time.