The client socket has timed out after 900000ms while trying to connect to (10.130.18.56, 58509).
LLaMA-Factory 的nccl通信超时,日志显示,保存模型只使用了36秒,但之后就无法连接NCCL了。
export NCCL_SOCKET_IFNAME=brainpf0
NCCL默认就选高性能网卡ens20f0np0。通信超时可能是别的情况引起的,比如训练时候代码出了问题。
bash展开代码sudo apt update sudo apt install net-tools -y
bash展开代码主要物理网卡: ens20f0np0: 主要物理网卡,MTU=9050(高性能配置) IP: 10.130.18.86 流量极大(RX: 339.1 TB / TX: 189.4 TB) 可能是RDMA网卡(MTU=9050是高吞吐配置的典型特征)
bash展开代码root@gpu-a800-0061:/app# ifconfig -a
3b9dff80e729_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether 72:c5:81:9d:40:8b txqueuelen 1000 (以太网)
RX packets 2566 bytes 179716 (179.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 311834 bytes 81040172 (81.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
411e5f73fa97_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether d2:14:6f:89:92:46 txqueuelen 1000 (以太网)
RX packets 402 bytes 28236 (28.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 47413 bytes 12327380 (12.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
691193de7103_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether e2:2f:75:60:2b:8c txqueuelen 1000 (以太网)
RX packets 2851 bytes 199666 (199.6 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 346096 bytes 89981458 (89.9 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
7fa5c8d3db4e_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether be:8d:10:3c:0f:15 txqueuelen 1000 (以太网)
RX packets 251236 bytes 169876695 (169.8 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 272663 bytes 29576550 (29.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
98f931b627bf_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether f6:dd:7f:8f:38:cc txqueuelen 1000 (以太网)
RX packets 13436094 bytes 1671969435 (1.6 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 24959555 bytes 818168434071 (818.1 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
br-int: flags=4098<BROADCAST,MULTICAST> mtu 9000
ether 88:66:64:60:a0:01 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 6 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainpf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.131.70.12 netmask 255.255.254.0 broadcast 10.131.71.255
inet6 fe80::ba3f:d2ff:fe51:eca0 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:51:ec:a0 txqueuelen 1000 (以太网)
RX packets 15911615 bytes 3472724855 (3.4 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8908442 bytes 2594423502 (2.5 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainpf1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.131.72.11 netmask 255.255.254.0 broadcast 10.131.73.255
inet6 fe80::ba3f:d2ff:fe51:f75c prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:51:f7:5c txqueuelen 1000 (以太网)
RX packets 15907074 bytes 3472450827 (3.4 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8907226 bytes 2594265488 (2.5 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainpf2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.131.74.11 netmask 255.255.254.0 broadcast 10.131.75.255
inet6 fe80::ba3f:d2ff:fe51:fd50 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:51:fd:50 txqueuelen 1000 (以太网)
RX packets 18106452 bytes 4119626024 (4.1 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11100077 bytes 3241021806 (3.2 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainpf3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.131.76.55 netmask 255.255.254.0 broadcast 10.131.77.255
inet6 fe80::ba3f:d2ff:fe51:f7d8 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:51:f7:d8 txqueuelen 1000 (以太网)
RX packets 17231529 bytes 3859409313 (3.8 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10227404 bytes 2980944724 (2.9 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainvf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether f4:d1:fa:a8:e3:9e txqueuelen 1000 (以太网)
RX packets 6815937 bytes 942186162 (942.1 MB)
RX errors 0 dropped 2021773 overruns 0 frame 0
TX packets 396428 bytes 99233961 (99.2 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainvf1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether c4:42:f9:ff:a2:5c txqueuelen 1000 (以太网)
RX packets 6812921 bytes 942061972 (942.0 MB)
RX errors 0 dropped 2021530 overruns 0 frame 0
TX packets 396206 bytes 99213701 (99.2 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainvf2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether 44:03:d2:95:14:bf txqueuelen 1000 (以太网)
RX packets 6818101 bytes 942386138 (942.3 MB)
RX errors 0 dropped 2021281 overruns 0 frame 0
TX packets 395870 bytes 99200745 (99.2 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
brainvf3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether 34:e9:11:5d:dd:c4 txqueuelen 1000 (以太网)
RX packets 6815508 bytes 942238665 (942.2 MB)
RX errors 0 dropped 2021015 overruns 0 frame 0
TX packets 396213 bytes 99212290 (99.2 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
e1e19ae7737d_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether 0a:7c:e4:e4:9c:74 txqueuelen 1000 (以太网)
RX packets 1253797 bytes 983650649 (983.6 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1485469 bytes 221404124 (221.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens20f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9050
inet 10.130.18.86 netmask 255.255.255.224 broadcast 10.130.18.95
inet6 fe80::eaeb:d3ff:fe55:b2b4 prefixlen 64 scopeid 0x20<link>
ether e8:eb:d3:55:b2:b4 txqueuelen 1000 (以太网)
RX packets 50004672435 bytes 339123544936425 (339.1 TB)
RX errors 0 dropped 62236 overruns 0 frame 0
TX packets 35496985439 bytes 189439234466451 (189.4 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens20f1np1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether e8:eb:d3:55:b2:b5 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
f545a49a2c0a_h: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether ba:13:a7:3d:cb:69 txqueuelen 1000 (以太网)
RX packets 10567576 bytes 9825328934 (9.8 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10725014 bytes 1173934024 (1.1 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
host0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 100.96.160.1 netmask 255.255.255.0 broadcast 100.96.160.255
ether 88:66:64:60:a0:01 txqueuelen 1000 (以太网)
RX packets 432530206 bytes 2631251242981 (2.6 TB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 2424984 bytes 1656608891 (1.6 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
kube-ipvs0: flags=130<BROADCAST,NOARP> mtu 1500
inet 10.69.152.183 netmask 255.255.255.255 broadcast 0.0.0.0
ether e6:f9:2b:e5:d8:b9 txqueuelen 0 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (本地环回)
RX packets 51464151060 bytes 49276175203724 (49.2 TB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 51464151060 bytes 49276175203724 (49.2 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
nodelocaldns: flags=130<BROADCAST,NOARP> mtu 1500
inet 169.254.20.10 netmask 255.255.255.255 broadcast 169.254.20.10
ether 66:b5:07:64:af:4b txqueuelen 0 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ovs-system: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 7e:7c:ff:c3:ba:31 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tap-all: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
ether 12:88:7b:61:83:ea txqueuelen 1000 (以太网)
RX packets 991 bytes 59160 (59.1 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 150648 bytes 38113944 (38.1 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vxlan_sys_4789: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65000
ether e2:19:83:b1:8a:c7 txqueuelen 1000 (以太网)
RX packets 530780089 bytes 811932561899 (811.9 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 818743220 bytes 3670147642243 (3.6 TB)
TX errors 1 dropped 0 overruns 0 carrier 0 collisions 0
root@gpu-a800-0061:/app#
本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!