【知识点】sft如何避免灾难性遗忘

1. 弹性权重整合（Elastic Weight Consolidation, EWC）

概述： EWC 通过在损失函数中添加一个正则化项，防止重要参数偏离先前任务的最优值。这个正则化项基于费舍尔信息矩阵，衡量每个参数对先前任务的重要性。

公式：

L(\theta) = L_{\text{new}}(\theta) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta^*_{i})^2

其中 $F_i$ 是费舍尔信息矩阵， $\theta^*_{i}$ 是旧任务的最优参数。

代码示例：

python
展开代码
import torch

def ewc_loss(new_loss, model, fisher_matrix, old_params, lambda_ewc):
    ewc_penalty = 0
    for param, fisher, old_param in zip(model.parameters(), fisher_matrix, old_params):
        ewc_penalty += torch.sum(fisher * (param - old_param).pow(2))
    return new_loss + (lambda_ewc / 2) * ewc_penalty

2. 渐进神经网络（Progressive Neural Networks）

概述：渐进神经网络通过为每个新任务引入新的网络模块，同时保持旧模块不变。新模块通过侧向连接利用旧模块的知识。

代码示例：

python
展开代码
class ProgressiveNN:
    def __init__(self, task_modules):
        self.task_modules = task_modules

    def add_task(self, new_module):
        self.task_modules.append(new_module)

    def forward(self, x, task_id):
        return self.task_modules[task_id](x)

3. 知识蒸馏（Knowledge Distillation）

概述：知识蒸馏使用旧模型的输出作为软标签来训练新模型，从而保留旧任务的知识。

公式：

L_{\text{distill}} = \alpha L_{\text{hard}} + (1 - \alpha) L_{\text{soft}}

其中 $L_{\text{soft}}$ 是基于旧模型输出的损失。

代码示例：

python
展开代码
def distillation_loss(student_logits, teacher_logits, temperature, alpha):
    soft_loss = torch.nn.functional.kl_div(
        torch.nn.functional.log_softmax(student_logits / temperature, dim=1),
        torch.nn.functional.softmax(teacher_logits / temperature, dim=1),
        reduction='batchmean'
    ) * (temperature ** 2)
    return alpha * soft_loss

4. 记忆增强（Memory Augmentation）

概述：通过引入外部记忆模块，模型可以在训练新任务时访问旧任务的数据或特征。

5. 多任务学习（Multi-task Learning）

概述：同时训练多个任务，确保模型在学习新任务时不会遗忘旧任务。

6. 经验回放（Experience Replay）

概述：在训练新任务时，定期回放旧任务的数据，以帮助模型保持对旧任务的记忆。

7. 正则化方法

概述：在损失函数中加入正则化项，限制模型参数的变化，以减少对旧任务的遗忘。

目录

1. 弹性权重整合（Elastic Weight Consolidation, EWC）

2. 渐进神经网络（Progressive Neural Networks）

3. 知识蒸馏（Knowledge Distillation）

4. 记忆增强（Memory Augmentation）

5. 多任务学习（Multi-task Learning）

6. 经验回放（Experience Replay）

7. 正则化方法