Received Signals.SIGHUP death signal, shutting down workers

发布于:2024-05-10 ⋅ 阅读:(21) ⋅ 点赞:(0)

单机多卡训练大模型的时候,突然报错:


  3%|▎         | 146/4992 [2:08:21<72:57:12, 54.20s/it][2024-05-10 13:27:11,479] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-05-10 13:27:11,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46635 closing signal SIGHUP
[2024-05-10 13:27:11,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46636 closing signal SIGHUP
[2024-05-10 13:27:11,481] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46637 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/wangguisen/miniconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    result = agent.run()
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
    result = self._invoke_run(role)
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 868, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/wangguisen/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 46460 got signal: 1

我的环境:

  • torch:2.2
  • Python:3.10

我是以 nohup 的形式后台运行的:

nohup sh ./dk/multi_run_demo.sh &

看了issues后,我们使用 screen 或 tmux 代替,具体是为什么大家可以看一下下面的链接:

  1. 下载 tmux:
sudo apt-get install tmux # ubuntu
sudo yum install tmux # centos

在命令行输入 tmux 即可进入 tmux 的界面,其常用命令如下:

  • 查看当前全部的tmux会话:tmux ls

  • 新建会话:tmux new -s [会话名字]

  • 分离会话并回到原始界面:tmux detach

  • 重新进入会话:

    • 按照编号:tmux attach -t 0
    • 按照名字:tmux attach -t [会话名字]
  • kill会话:

    • 按照编号:tmux kill-session -t 0
    • 按照名字:tmux kill-session -t [会话名字]
    • 或者会话内直接输入exit
  • 使用快捷键需要先按一下ctrl(control)+b,然后:

    • 快捷键帮助:ctrl+b,然后按一下?,最后按Esc退出
    • 查看当前全部的tmux会话:ctrl+b,然后按一下s
    • 分离会话并回到原始界面:ctrl+b,然后按一下d
  1. 后台运行

我们的 shell 脚本内容如下:

CUDA_VISIBLE_DEVICES=0,1,2 accelerate launch \
    --config_file yamls/accelerate_single_config.yaml \
    src/train.py yamls/qwen_lora_sft_multi_gpu_demo.yaml \
    > weights/run.log 2>&1

脚本名为:multi_run_demo.sh,给它权限:

chmod +x ./dk/multi_run_demo.sh

然后新建一个 tmux 会话:

tmux new -s multi_run_demo

然后在新的窗口中运行:

./dk/multi_run_demo.sh

此时我们的程序已经运行了:

在这里插入图片描述

利用快捷键分离页面,回到我们的原始页面:ctrl(control)+b,然后d

在这里插入图片描述

要是想看运行情况,也可以:

tmux attach -t multi_run_demo

另外,可以使用 accelerate 的--main_process_port XXX 重新指定端口号

ref:

https://github.com/pytorch/pytorch/issues/76894

https://github.com/hiyouga/ChatGLM-Efficient-Tuning/issues/72


网站公告

今日签到

点亮在社区的每一天
去签到