IRepair
A targeted repair method that localizes and edits faulty LLM behavior while limiting disruption to general model performance.
IRepair treats harmful LLM behavior as a repair problem rather than a full retraining problem. The method localizes error-concentrated model components and applies a targeted update so the faulty behavior is reduced while unrelated capabilities are preserved.
LLMs can acquire undesired behaviors from training data, while broad fine-tuning can introduce regressions in capabilities that should remain stable.
IRepair localizes the model components most responsible for the faulty behavior and applies a targeted update guided by the intended output distribution.
Across three GPT-family models from 800M to 1.6B parameters, IRepair+KL reduced toxicity by 88.7% with an 11% perplexity increase. Compared with DPO, IRepair was 43.6% more effective at repair and caused 46% less disruption.
Targeted repair supports incremental maintenance of deployed models, where localized behavior changes are preferable to repeated full-model adaptation.
See the paper for the method and evaluation (Imtiaz et al., 2025).