IRepair | Fraol Batole

Co-author ESEC/FSE 2025 Model repair

IRepair treats harmful LLM behavior as a repair problem rather than a full retraining problem. The method localizes error-concentrated model components and applies a targeted update so the faulty behavior is reduced while unrelated capabilities are preserved.

Problem

LLMs can acquire undesired behaviors from training data, while broad fine-tuning can introduce regressions in capabilities that should remain stable.

Approach

IRepair localizes the model components most responsible for the faulty behavior and applies a targeted update guided by the intended output distribution.

Results

Across three GPT-family models from 800M to 1.6B parameters, IRepair+KL reduced toxicity by 88.7% with an 11% perplexity increase. Compared with DPO, IRepair was 43.6% more effective at repair and caused 46% less disruption.

Why it matters

Targeted repair supports incremental maintenance of deployed models, where localized behavior changes are preferable to repeated full-model adaptation.

See the paper for the method and evaluation (Imtiaz et al., 2025).

References

2025