包括连续和离散动作空间的任务,涵盖不同的RL算法,同时仅简要提及sim-to-real背景下的类似硬参数迁移工作。

发布于:2025-06-02 ⋅ 阅读:(38) ⋅ 点赞:(0)

包括连续和离散动作空间的任务,涵盖不同的RL算法,同时仅简要提及sim-to-real背景下的类似硬参数迁移工作。

Atari (Discrete) Tasks

Actor-Mimic (Parisotto et al., 2016) trained a single DQN policy network on multiple Atari games via “distillation” from expert teachers. The learned network weights were then copied (except the final layer) to initialize new DQN agents on novel Atari games. On targets like Breakout, Star Gunner and Video Pinball, this initialization significantly accelerated learning: for example, pretraining on Pong and related games saved up to 5 million frames of training time on Breakout and Video Pinball. Their experiments (training on 8 source games, fine-tuning on 7 targets) showed large positive transfer when source and target had similar mechanics (e.g. Pong→Breakout).

Transferability of DQN (Sabatelli & Geurts, 2020) performed extensive experiments fine-tuning DQN/Double-DQN agents across many Atari games. They found that naively copying weights often gave little or negative benefit: most fine-tuned agents did not outperform scratch training. The authors report that “transferring neural networks in a DRL context… in most cases results in negative transfer”. In heatmaps of transfer scores, pretraining often slowed learning or produced no gain. This suggests that Atari policies can be brittle and that direct weight transfer may fail if source/target dynamics differ widely.

Other discrete examples: Early studies (e.g. Glatt et al. 2016) also tried initializing DQN weights from one game to another and fine-tuning, finding mixed success depending on similarity. In general, when action/state structure is similar, weight transfer can help Atari agents learn faster; when it’s not, it can hurt performance. These works emphasize that weight-level transfer in games can yield large speed-ups only in favorable cases of high task similarity.

Continuous Control and Robotics

Cao et al. (2021) studied fine-tuning RL policies across continuous control tasks with varying reward shapes and obstacles. They showed that a vanilla fine-tune (copy all hidden-layer weights then train) often fails when source/target require qualitatively different trajectories (different “homotopy classes”). In MuJoCo and simulated robotics domains (navigation with barriers, 3D navigation, Lunar Lander, Fetch reach, MuJoCo Ant, and an assistive feeding task), they compared naïve fine-tuning to baselines (Progressive NN, batch spectral shrinkage) and to their proposed Ease-In-Ease-Out two-stage method. Their results (measuring steps to reach a target return) found that with large differences (e.g. big obstacles), fine-tuning required far more experience than their method. In simple cases, naive fine-tune can work, but in hard transfer cases it often needed thousands more steps or even failed.

NerveNet (Wang et al., 2019) introduced a structured GNN policy for locomotion. They pre-trained NerveNet on a MuJoCo centipede agent and then fine-tuned it on morphed agents (different size or missing legs). In “size” and “disability” transfer tasks, simply initializing with the NerveNet weights cut training time dramatically: the pretrained model required far fewer episodes to reach the solved reward level than training from scratch. In their experiments (e.g. transferring from a 6-legged to 8-legged centipede), fine-tuning the shared-weights policy converged much faster than a new policy, showing clear speedup from direct parameter transfer.

Robotic Manipulation (Julian et al., 2020) explored fine-tuning vision-based manipulation policies for real robots. They found that simple off-policy RL fine-tuning on new conditions (e.g. new backgrounds, object shapes, lighting) was surprisingly effective and sample-efficient. For example, fine-tuning a policy pretrained in one scenario to a new scenario required only about 0.2% of the data that training from scratch would need. They stress that starting from an RL-pretrained policy was essential: neither scratch training nor adapting a supervised vision model could handle the distribution shifts with so little data. In short, an RL agent’s own pretrained weights served as a powerful initialization that fine-tuned robustly to the real-world variations.

Additional continuous-domain notes: Other transfer RL works have explored similar ideas. For example, Wulfmeier et al. (2017) note that in sim-to-real scenarios, “fine-tuning pretrained policies from simulation on the real platform is a straightforward approach” (though they also augment it with adversarial domain alignment). In practice, many robot studies simply copy the sim-trained network weights and continue training on the real robot, which often yields reasonable results. Overall, these continuous-control studies demonstrate that direct network initialization (with PPO, DDPG, SAC, etc.) can drastically reduce training time on new but related tasks.

Multi-Task Pretraining and Fine-Tuning

Learning-to-Modulate (Schmied et al., 2023) investigated fine-tuning of large pre-trained RL models. They pretrained a transformer-based agent on 50 Meta-World and 16 DMControl tasks, then studied various fine-tuning protocols on held-out tasks. As in supervised learning, they found that a naive full fine-tune of the large pre-trained model often led to catastrophic forgetting of the pretraining skills. In response, they proposed a learn-to-modulate approach: freezing the base model and only learning small per-layer “modulation” parameters. This kept the pre-trained knowledge intact while adapting to the new task. They showed L2M achieved state-of-the-art continual learning on these benchmarks while preserving prior-task performance. Their key insight was that simply copying all weights and fine-tuning can erode valuable pretraining in RL, much as in supervised learning.

Other large-scale pretraining: Although not always phrased as “transfer”, some RL works train on many tasks then fine-tune. For example, Agent57 (DeepMind, 2020) and Dreamer (Hafner et al.) train on collections of games or environments and then evaluate on held-out tasks via fine-tuning. Such methods implicitly rely on weight initialization from multi-task training. These efforts generally find that a good shared representation can speed new-task learning. For instance, multi-task DQN or policy networks (beyond Actor-Mimic) often serve as weight initializations for new games, echoing the success of related approaches described above.

Challenges and Empirical Findings

Across methods, a common theme is that task similarity matters greatly. When source and target share similar features (e.g. Pong → Breakout, or one robot morphology → a slightly altered morphology), transferred weights often accelerate learning. For example, Parisotto et al. found that Pong-trained features helped Breakout and VideoPinball, and Wang et al. showed centipede-learned features transfer to new agents. In these cases the low-level visual and physical skills encoded in the network are directly useful, so fine-tuning quickly adapts.

However, when tasks differ fundamentally, naive weight transfer can fail or even harm learning. Sabatelli et al. demonstrated many Atari transfers gave no benefit, and Cao et al. showed that if the required trajectories lie in different “homotopy classes”, a fine-tuned policy may never reach the optimal behavior. In other words, direct parameter transfer sometimes locks the agent into suboptimal behavior in parts of the state space. As Wolczyk et al. (2024) put it, standard fine-tuning “catastrophically” forgets pre-trained capabilities on unseen states, degrading performance. They found that without countermeasures, fine-tuned RL agents can lose earlier skills on parts of the environment they ignore early on. These studies highlight the practical need for care: fine-tuning alone is not always a panacea in RL, especially when dynamics or goals shift.

In contrast, specialized methods often combine weight transfer with additional mechanisms (e.g. distillation or experience replay) to mitigate forgetting. But the central mechanism in the above works is always “copy network weights from source” and then continue standard RL on the target. For example, Actor-Mimic and NerveNet copy whole layers (or entire networks) then fine-tune, Julian et al. fine-tune pretrained Q-networks, and Wulfmeier et al. fine-tune simulators’ policies. Their experiments consistently show large sample efficiency gains when the source is well-chosen: e.g. saving days of training on Atari or requiring orders of magnitude fewer real robot trials.

In summary, direct parameter transfer – initializing a new RL agent with a pretrained network’s weights – can dramatically speed learning if the tasks are compatible. Many papers (e.g. Parisotto et al., Cao et al., Wang et al.) report successful fine-tuning on both Atari and continuous domains. Others (Sabatelli et al., Wolczyk et al.) document its limits. Together, these works form a picture: weight-level transfer in RL is powerful when source/target similarities exist, but practitioners should be mindful of task mismatch and potential forgetting.

Sources: We cite representative examples of each category. For Atari/discrete tasks: Actor-Mimic (positive transfers) and Sabatelli et al. (largely negative results). For continuous control: Cao et al. (homotopy curricula), Wang et al. (NerveNet), and Julian et al. (robotic fine-tuning). For multi-task pretraining: Schmied et al. (L2M). For sim-to-real: Wulfmeier et al… Each demonstrates “copy weights, then fine-tune” under different settings, with quantitative results as noted.
明白了。我将重点查找那些与您提供的PDF方法类似的强化学习论文,特别是:

  • 在源任务中训练一个模型(不限于DDPG、TD3、PPO、SAC等),
  • 将其网络参数(例如隐藏层或全部网络)直接迁移(拷贝)到目标任务中,
  • 在目标任务中继续训练或微调。

我会包括连续和离散动作空间的任务,涵盖不同的RL算法,同时仅简要提及sim-to-real背景下的类似硬参数迁移工作。请稍等,我马上为您整理结果。

Atari (Discrete) Tasks

Actor-Mimic (Parisotto et al., 2016) trained a single DQN policy network on multiple Atari games via “distillation” from expert teachers. The learned network weights were then copied (except the final layer) to initialize new DQN agents on novel Atari games. On targets like Breakout, Star Gunner and Video Pinball, this initialization significantly accelerated learning: for example, pretraining on Pong and related games saved up to 5 million frames of training time on Breakout and Video Pinball. Their experiments (training on 8 source games, fine-tuning on 7 targets) showed large positive transfer when source and target had similar mechanics (e.g. Pong→Breakout).

Transferability of DQN (Sabatelli & Geurts, 2020) performed extensive experiments fine-tuning DQN/Double-DQN agents across many Atari games. They found that naively copying weights often gave little or negative benefit: most fine-tuned agents did not outperform scratch training. The authors report that “transferring neural networks in a DRL context… in most cases results in negative transfer”. In heatmaps of transfer scores, pretraining often slowed learning or produced no gain. This suggests that Atari policies can be brittle and that direct weight transfer may fail if source/target dynamics differ widely.

Other discrete examples: Early studies (e.g. Glatt et al. 2016) also tried initializing DQN weights from one game to another and fine-tuning, finding mixed success depending on similarity. In general, when action/state structure is similar, weight transfer can help Atari agents learn faster; when it’s not, it can hurt performance. These works emphasize that weight-level transfer in games can yield large speed-ups only in favorable cases of high task similarity.

Continuous Control and Robotics

Cao et al. (2021) studied fine-tuning RL policies across continuous control tasks with varying reward shapes and obstacles. They showed that a vanilla fine-tune (copy all hidden-layer weights then train) often fails when source/target require qualitatively different trajectories (different “homotopy classes”). In MuJoCo and simulated robotics domains (navigation with barriers, 3D navigation, Lunar Lander, Fetch reach, MuJoCo Ant, and an assistive feeding task), they compared naïve fine-tuning to baselines (Progressive NN, batch spectral shrinkage) and to their proposed Ease-In-Ease-Out two-stage method. Their results (measuring steps to reach a target return) found that with large differences (e.g. big obstacles), fine-tuning required far more experience than their method. In simple cases, naive fine-tune can work, but in hard transfer cases it often needed thousands more steps or even failed.

NerveNet (Wang et al., 2019) introduced a structured GNN policy for locomotion. They pre-trained NerveNet on a MuJoCo centipede agent and then fine-tuned it on morphed agents (different size or missing legs). In “size” and “disability” transfer tasks, simply initializing with the NerveNet weights cut training time dramatically: the pretrained model required far fewer episodes to reach the solved reward level than training from scratch. In their experiments (e.g. transferring from a 6-legged to 8-legged centipede), fine-tuning the shared-weights policy converged much faster than a new policy, showing clear speedup from direct parameter transfer.

Robotic Manipulation (Julian et al., 2020) explored fine-tuning vision-based manipulation policies for real robots. They found that simple off-policy RL fine-tuning on new conditions (e.g. new backgrounds, object shapes, lighting) was surprisingly effective and sample-efficient. For example, fine-tuning a policy pretrained in one scenario to a new scenario required only about 0.2% of the data that training from scratch would need. They stress that starting from an RL-pretrained policy was essential: neither scratch training nor adapting a supervised vision model could handle the distribution shifts with so little data. In short, an RL agent’s own pretrained weights served as a powerful initialization that fine-tuned robustly to the real-world variations.

Additional continuous-domain notes: Other transfer RL works have explored similar ideas. For example, Wulfmeier et al. (2017) note that in sim-to-real scenarios, “fine-tuning pretrained policies from simulation on the real platform is a straightforward approach” (though they also augment it with adversarial domain alignment). In practice, many robot studies simply copy the sim-trained network weights and continue training on the real robot, which often yields reasonable results. Overall, these continuous-control studies demonstrate that direct network initialization (with PPO, DDPG, SAC, etc.) can drastically reduce training time on new but related tasks.

Multi-Task Pretraining and Fine-Tuning

Learning-to-Modulate (Schmied et al., 2023) investigated fine-tuning of large pre-trained RL models. They pretrained a transformer-based agent on 50 Meta-World and 16 DMControl tasks, then studied various fine-tuning protocols on held-out tasks. As in supervised learning, they found that a naive full fine-tune of the large pre-trained model often led to catastrophic forgetting of the pretraining skills. In response, they proposed a learn-to-modulate approach: freezing the base model and only learning small per-layer “modulation” parameters. This kept the pre-trained knowledge intact while adapting to the new task. They showed L2M achieved state-of-the-art continual learning on these benchmarks while preserving prior-task performance. Their key insight was that simply copying all weights and fine-tuning can erode valuable pretraining in RL, much as in supervised learning.

Other large-scale pretraining: Although not always phrased as “transfer”, some RL works train on many tasks then fine-tune. For example, Agent57 (DeepMind, 2020) and Dreamer (Hafner et al.) train on collections of games or environments and then evaluate on held-out tasks via fine-tuning. Such methods implicitly rely on weight initialization from multi-task training. These efforts generally find that a good shared representation can speed new-task learning. For instance, multi-task DQN or policy networks (beyond Actor-Mimic) often serve as weight initializations for new games, echoing the success of related approaches described above.

Challenges and Empirical Findings

Across methods, a common theme is that task similarity matters greatly. When source and target share similar features (e.g. Pong → Breakout, or one robot morphology → a slightly altered morphology), transferred weights often accelerate learning. For example, Parisotto et al. found that Pong-trained features helped Breakout and VideoPinball, and Wang et al. showed centipede-learned features transfer to new agents. In these cases the low-level visual and physical skills encoded in the network are directly useful, so fine-tuning quickly adapts.

However, when tasks differ fundamentally, naive weight transfer can fail or even harm learning. Sabatelli et al. demonstrated many Atari transfers gave no benefit, and Cao et al. showed that if the required trajectories lie in different “homotopy classes”, a fine-tuned policy may never reach the optimal behavior. In other words, direct parameter transfer sometimes locks the agent into suboptimal behavior in parts of the state space. As Wolczyk et al. (2024) put it, standard fine-tuning “catastrophically” forgets pre-trained capabilities on unseen states, degrading performance. They found that without countermeasures, fine-tuned RL agents can lose earlier skills on parts of the environment they ignore early on. These studies highlight the practical need for care: fine-tuning alone is not always a panacea in RL, especially when dynamics or goals shift.

In contrast, specialized methods often combine weight transfer with additional mechanisms (e.g. distillation or experience replay) to mitigate forgetting. But the central mechanism in the above works is always “copy network weights from source” and then continue standard RL on the target. For example, Actor-Mimic and NerveNet copy whole layers (or entire networks) then fine-tune, Julian et al. fine-tune pretrained Q-networks, and Wulfmeier et al. fine-tune simulators’ policies. Their experiments consistently show large sample efficiency gains when the source is well-chosen: e.g. saving days of training on Atari or requiring orders of magnitude fewer real robot trials.

In summary, direct parameter transfer – initializing a new RL agent with a pretrained network’s weights – can dramatically speed learning if the tasks are compatible. Many papers (e.g. Parisotto et al., Cao et al., Wang et al.) report successful fine-tuning on both Atari and continuous domains. Others (Sabatelli et al., Wolczyk et al.) document its limits. Together, these works form a picture: weight-level transfer in RL is powerful when source/target similarities exist, but practitioners should be mindful of task mismatch and potential forgetting.

Sources: We cite representative examples of each category. For Atari/discrete tasks: Actor-Mimic (positive transfers) and Sabatelli et al. (largely negative results). For continuous control: Cao et al. (homotopy curricula), Wang et al. (NerveNet), and Julian et al. (robotic fine-tuning). For multi-task pretraining: Schmied et al. (L2M). For sim-to-real: Wulfmeier et al… Each demonstrates “copy weights, then fine-tune” under different settings, with quantitative results as noted.


网站公告

今日签到

点亮在社区的每一天
去签到