1. 概述记录GPU驱动安装步骤2. NVIDIA 驱动安装2.1 检查显卡驱动# 安装 aplay,ubuntu-drivers命令会调sudoaptinstallalsa-utilssudoubuntu-drivers devicesubuntu-drivers devices udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. ERROR:root:aplaycommandnot found/sys/devices/pci0000:00/0000:00:0f.0modalias:pci:v000015ADd00000405sv000015ADsd00000405bc03sc00i00 vendor:VMware model:SVGA II Adapter driver:open-vm-tools-desktop - distrofree/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0modalias:pci:v000010DEd00002204sv00001458sd0000403Bbc03sc00i00 vendor:NVIDIA Corporation model:GA102[GeForce RTX3090]manual_install: True driver:nvidia-driver-570-server - distro non-free driver:nvidia-driver-470-server - distro non-free driver:nvidia-driver-580-open - distro non-free driver:nvidia-driver-580 - distro non-free driver:nvidia-driver-570-open - distro non-free driver:nvidia-driver-470 - distro non-free driver:nvidia-driver-590-server-open - distro non-free driver:nvidia-driver-535-server-open - distro non-free driver:nvidia-driver-535-open - distro non-free driver:nvidia-driver-580-server - distro non-free driver:nvidia-driver-580-server-open - distro non-free driver:nvidia-driver-570-server-open - distro non-free driver:nvidia-driver-570 - distro non-free driver:nvidia-driver-535 - distro non-free driver:nvidia-driver-590-server - distro non-free driver:nvidia-driver-535-server - distro non-free driver:nvidia-driver-590 - distro non-free driver:nvidia-driver-590-open - distro non-free recommended driver:xserver-xorg-video-nouveau - distrofreebuiltin2.2 选择带open后缀的驱动上述推荐了驱动 : nvidia-driver-590-open - distro non-free recommendedsudoaptinstallnvidia-driver-590-open2.3 安装完后重启reboot3. nvidia-fabricmanager安装可选nvidia-fabricmanager 是专门管理多张通过NVLink/NVSwitch互连的NVIDIA GPU的软件。如果是单卡安装服务器启动会报错提示3.1 查看驱动版本nvidia-smirootubuntu:~# nvidia-smiTue Jan2014:43:072026-----------------------------------------------------------------------------------------|NVIDIA-SMI590.48.01 Driver Version:590.48.01 CUDA Version:13.1|---------------------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA GeForce RTX3090Off|00000000:03:00.0 Off|N/A||30% 26C P8 10W / 350W|4MiB / 24576MiB|0% Default||||N/A|--------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||No running processes found|-----------------------------------------------------------------------------------------3.2 下载fabricmanager软件Nvidia官方下载地址1.下载和驱动版本一样的fabricmanager软件。这里是590.48.01nvidia-fabricmanager_*.deb这是主软件包运行服务所必需。nvidia-fabricmanager-dev_*.deb这是开发包头文件等仅在你需要编译基于该组件的软件时才需要。请忽略它们。# 示例exportDRIVER_VERSION590.48.01wgethttps://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-$(echo$DRIVER_VERSION|awk-F.{print $1})_${DRIVER_VERSION}-1_amd64.debwgethttps://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager_${DRIVER_VERSION}-0ubuntu1_amd64.deb3.3 安装fabricmanager# 示例dpkg-invidia-fabricmanager-$(echo$DRIVER_VERSION|awk-F.{print $1})_${DRIVER_VERSION}-1_amd64.debdpkg-invidia-fabricmanager_590.48.01-0ubuntu1_amd64.deb Selecting previously unselected package nvidia-fabricmanager.(Reading database...147187files and directories currently installed.)Preparing to unpack nvidia-fabricmanager_590.48.01-0ubuntu1_amd64.deb... Unpacking nvidia-fabricmanager(590.48.01-0ubuntu1)... Setting up nvidia-fabricmanager(590.48.01-0ubuntu1)... Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /usr/lib/systemd/system/nvidia-fabricmanager.service. Could not execute systemctl: at /usr/bin/deb-systemd-invoke line148. Processing triggersforlibc-bin(2.39-0ubuntu8.6)...查看是否正常运行systemctl status nvidia-fabricmanagersystemctl status nvidia-fabricmanager ● nvidia-fabricmanager.service - NVIDIA fabric managerserviceLoaded: loaded(/lib/systemd/system/nvidia-fabricmanager.service;enabled;vendor preset: enabled)Active: active(running)since Sun2025-10-1913:47:06 CST;3months1day ago Main PID:3020(nv-fabricmanage)Tasks:19(limit:629145)Memory:32.0M CPU: 1h 25min41.831s CGroup: /system.slice/nvidia-fabricmanager.service └─3020 /usr/bin/nv-fabricmanager-c/usr/share/nvidia/nvswitch/fabricmanager.cfg10月1913:46:53 ubuntu22-172-027-003-002 systemd[1]: Starting NVIDIA fabric manager service...10月1913:46:53 ubuntu22-172-027-003-002 nvidia-fabricmanager-start.sh[2994]: Detected Pre-NVL5 system10月1913:46:55 ubuntu22-172-027-003-002 nv-fabricmanager[3020]: Connected to1node.10月1913:47:06 ubuntu22-172-027-003-002 nv-fabricmanager[3020]: Successfully configured all the available GPUs and NVSwitches to route NVL10月1913:47:06 ubuntu22-172-027-003-002 nvidia-fabricmanager-start.sh[2994]: StartedNvidia Fabric Manager10月1913:47:06 ubuntu22-172-027-003-002 systemd[1]: Started NVIDIA fabric manager service.检查已安装的Fabric Manager版本dpkg-l|grepnvidia-fabricmanagerdpkg-l|grepnvidia-fabricmanager ii nvidia-fabricmanager590.48.01-0ubuntu1 amd64 Fabric ManagerforNVSwitch based systems禁用nvidia-fabricmanager自动升级将下面570改为安装的版本sudoapt-mark hold nvidia-fabricmanager-590nvidia-fabricmanager-590seton hold.查看已禁用版本有输出则为已禁用sudoapt-mark showholdnvidia-fabricmanager-5904. 安装软件4.1 安装CUDA Toolkit、NVIDIA Container Toolkit开发或运行任何需要直接调用GPU的应用程序– 需要 CUDA Toolkit希望在Docker容器内使用GPU– 需要 NVIDIA Container Toolkit很多AI开发、科学计算的场景 -- 两者都需要4.2 CUDA Toolkit 说明这是由NVIDIA提供的、用于开发和运行GPU加速应用程序的完整软件平台。它包含了编译器、数学库、调试工具等。具体如下CUDA 驱动nvidia-driver已经安装了 nvidia-driver-590-open。CUDA 运行时CUDA Runtime及开发工具nvcc编译器、cuBLAS等库这是CUDA Toolkit软件包的主体。如果需要编译或运行任何直接调用GPU的C/Python程序例如从源码编译PyTorch/TensorFlow运行CUDA C项目那么必须安装CUDA Toolkit。4.3 CUDA Toolkit 安装查看当前安装版本如果安装过nvcc-V(base)rootubuntu:/public/software# nvcc -Vnvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:20:09_PST_2025 Cuda compilation tools, release12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0访问官网NVIDIA CUDA Toolkit 下载页官方手册链接选择Linux - x86_64 --Ubuntu - 24.04 - deb (network)严格按照网页上给出的命令行指令执行即可。网络安装方式会自动配置源并确保安装与系统驱动兼容的版本。wgethttps://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.debsudodpkg-icuda-keyring_1.1-1_all.debsudoapt-getupdateapt-cachesearch cuda-toolkit cuda-toolkit-12 - CUDA Toolkit12meta-package cuda-toolkit - CUDA Toolkit meta-package cuda-toolkit-12-5 - CUDA Toolkit12.5meta-package cuda-toolkit-12-5-config-common - Common config packageforCUDA Toolkit12.5. cuda-toolkit-12-config-common - Common config packageforCUDA Toolkit12. cuda-toolkit-config-common - Common config packageforCUDA Toolkit. cuda-toolkit-12-6 - CUDA Toolkit12.6meta-package cuda-toolkit-12-6-config-common - Common config packageforCUDA Toolkit12.6. cuda-toolkit-12-8 - CUDA Toolkit12.8meta-package cuda-toolkit-12-8-config-common - Common config packageforCUDA Toolkit12.8. cuda-toolkit-12-9 - CUDA Toolkit12.9meta-package cuda-toolkit-12-9-config-common - Common config packageforCUDA Toolkit12.9. cuda-toolkit-13-0 - CUDA Toolkit13.0meta-package cuda-toolkit-13-0-config-common - Common config packageforCUDA Toolkit13.0. cuda-toolkit-13 - CUDA Toolkit13meta-package cuda-toolkit-13-config-common - Common config packageforCUDA Toolkit13. cuda-toolkit-13-1 - CUDA Toolkit13.1meta-package cuda-toolkit-13-1-config-common - Common config packageforCUDA Toolkit13.1. nvidia-cuda-toolkit - NVIDIA CUDA development toolkit nvidia-cuda-toolkit-doc - NVIDIA CUDA and OpenCL documentation nvidia-cuda-toolkit-gcc - NVIDIA CUDA development toolkit(GCC compatibility)查看当前可安装的版本aptlist|grep-Ecuda-toolkit-[0-9]{2}-[0-9]{1,2}WARNING:aptdoes not have a stable CLI interface. Use with cautioninscripts. cuda-toolkit-11-7-config-common/unknown11.7.99-1 all cuda-toolkit-11-7/unknown11.7.1-1 amd64 cuda-toolkit-11-8-config-common/unknown11.8.89-1 all cuda-toolkit-11-8/unknown11.8.0-1 amd64 cuda-toolkit-12-0-config-common/unknown12.0.146-1 all cuda-toolkit-12-0/unknown12.0.1-1 amd64 cuda-toolkit-12-1-config-common/unknown12.1.105-1 all cuda-toolkit-12-1/unknown12.1.1-1 amd64 cuda-toolkit-12-2-config-common/unknown12.2.140-1 all cuda-toolkit-12-2/unknown12.2.2-1 amd64 cuda-toolkit-12-3-config-common/unknown12.3.101-1 all cuda-toolkit-12-3/unknown12.3.2-1 amd64 cuda-toolkit-12-4-config-common/unknown12.4.127-1 all cuda-toolkit-12-4/unknown12.4.1-1 amd64 cuda-toolkit-12-5-config-common/unknown12.5.82-1 all cuda-toolkit-12-5/unknown12.5.1-1 amd64 cuda-toolkit-12-6-config-common/unknown12.6.77-1 all cuda-toolkit-12-6/unknown12.6.3-1 amd64 cuda-toolkit-12-8-config-common/unknown12.8.90-1 all cuda-toolkit-12-8/unknown12.8.2-1 amd64 cuda-toolkit-12-9-config-common/unknown12.9.79-1 all cuda-toolkit-12-9/unknown12.9.2-1 amd64 cuda-toolkit-13-0-config-common/unknown13.0.96-1 all cuda-toolkit-13-0/unknown13.0.3-1 amd64 cuda-toolkit-13-1-config-common/unknown13.1.80-1 all cuda-toolkit-13-1/unknown13.1.2-1 amd64 cuda-toolkit-13-2-config-common/unknown13.2.75-1 all cuda-toolkit-13-2/unknown13.2.1-1 amd64 cuda-toolkit-13-3-config-common/unknown,now13.3.29-1 all[installed,auto-removable]cuda-toolkit-13-3/unknown13.3.1-1 amd64nvidia-smi 命令右上角显示的CUDA版本是当前驱动支持的最高版本** 注意不是安装最新版本就好 **** 安装vLLM 默认预编译版本 **当前是13.0后续会随vLLM发展而变化aptinstallcuda-toolkit-13-0 cuda-toolkit-13-0-config-common添加环境变量vim/etc/profileexportPATH/usr/local/cuda-13.0/bin:$PATHexportLD_LIBRARY_PATH/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATHsource/etc/profilenvcc-V(vllm-env)rootxunku:/public/vLLM# nvcc -Vnvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Aug_20_01:58:59_PM_PDT_2025 Cuda compilation tools, release13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_0sudoapt-mark hold nvidia-driver-595-open cuda-toolkit-13-0 cuda-toolkit-13-0-config-common4.4 Docker GPU 说明Docker GPU 支持 (NVIDIA Container Toolkit)是一个让Docker容器能够访问和使用宿主机HostNVIDIA GPU的工具集。它实质上是创建了一个兼容层将宿主机的GPU驱动映射到容器内部。具体如下主要是 nvidia-container-toolkit 这个包它会修改Docker的配置。如果需要在Docker容器内运行任何需要GPU的镜像例如运行 docker run --gpus all nvidia/cuda:12.1.1-base-ubuntu24.04 或官方的PyTorch/TensorFlow Docker镜像那么必须安装此工具包。4.5 Docker GPU 安装官方安装指南安装工具sudoapt-getupdatesudoapt-getinstall-y--no-install-recommendscurlgnupg2sudoapt-getupdatesudoapt-getinstall-y--no-install-recommendscurlgnupg2 Hit:1 http://mirrors.aliyun.com/ubuntu noble InRelease Hit:2 http://mirrors.aliyun.com/ubuntu noble-updates InRelease Hit:3 http://mirrors.aliyun.com/ubuntu noble-security InRelease Hit:4 http://mirrors.aliyun.com/ubuntu noble-backports InRelease Hit:5 http://mirrors.aliyun.com/ubuntu noble-proposed InRelease Hit:6 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64 InRelease Reading package lists... Done Reading package lists... Done Building dependency tree... Done Reading state information... Donecurlis already the newest version(8.5.0-2ubuntu10.6). gnupg2 is already the newest version(2.4.4-2ubuntu17.4). The following packages were automatically installed and are no longer required: liburing2 mailcap plocate Usesudo apt autoremoveto remove them.0upgraded,0newly installed,0to remove and4not upgraded.配置仓库curl-fsSLhttps://nvidia.github.io/libnvidia-container/gpgkey|sudogpg--dearmor-o/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg\curl-s-Lhttps://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list|\seds#deb https://#deb [signed-by/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g|\sudotee/etc/apt/sources.list.d/nvidia-container-toolkit.list更新仓库sudoapt-getupdate安装工具包exportNVIDIA_CONTAINER_TOOLKIT_VERSION1.18.1-1sudoapt-getinstall-y\nvidia-container-toolkit${NVIDIA_CONTAINER_TOOLKIT_VERSION}\nvidia-container-toolkit-base${NVIDIA_CONTAINER_TOOLKIT_VERSION}\libnvidia-container-tools${NVIDIA_CONTAINER_TOOLKIT_VERSION}\libnvidia-container1${NVIDIA_CONTAINER_TOOLKIT_VERSION}配置容器sudonvidia-ctk runtime configure--runtimedockerINFO[0000]Loading config from /etc/docker/daemon.json INFO[0000]Wrote updated config to /etc/docker/daemon.json INFO[0000]It is recommended thatdockerdaemon be restarted.重启容器sudosystemctl restartdocker运行测试容器找了半天终于找到了一个可以下载的镜像dockerpull nvidia/cuda:13.0.1-runtime-ubuntu22.0413.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda 60d98d907669: Pulling fs layer 7e9d3b636b44: Pulling fs layer 3a2ba8ed1759: Pulling fs layer 5eaf68e8e556: Waiting c03b8ec8dd33: Waiting 47140273f24a: Waiting f1e29f967bcf: Waiting 48feaf8fd5bd: Waiting 8006ce821e80: Waiting13.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda 60d98d907669: Pull complete 7e9d3b636b44: Pull complete 3a2ba8ed1759: Pull complete 5eaf68e8e556: Pull complete c03b8ec8dd33: Pull complete 47140273f24a: Pull complete f1e29f967bcf: Pull complete 48feaf8fd5bd: Pull complete 8006ce821e80: Pull complete Digest: sha256:e4511e846c49e5495ef3d80c82b8f5dd597c6ef5c7f355601ead776ae3e96c67 Status: Downloaded newer imagefornvidia/cuda:13.0.1-runtime-ubuntu22.04 docker.io/nvidia/cuda:13.0.1-runtime-ubuntu22.04dockerrun--rm--gpusall nvidia/cuda:13.0.1-runtime-ubuntu22.04 nvidia-smiCUDACUDA Version13.0.1 Container image Copyright(c)2016-2023, NVIDIA CORPORATIONAFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license A copy of this license is made availableinthis container at /NGC-DL-CONTAINER-LICENSEforyour convenience. Tue Jan2013:37:442026-----------------------------------------------------------------------------------------|NVIDIA-SMI590.48.01 Driver Version:590.48.01 CUDA Version:13.1|---------------------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA GeForce RTX3090Off|00000000:03:00.0 Off|N/A||30% 26C P8 10W / 350W|4MiB / 24576MiB|0% Default||||N/A|--------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||No running processes found|-----------------------------------------------------------------------------------------
安装GPU环境
发布时间:2026/7/4 20:54:56
1. 概述记录GPU驱动安装步骤2. NVIDIA 驱动安装2.1 检查显卡驱动# 安装 aplay,ubuntu-drivers命令会调sudoaptinstallalsa-utilssudoubuntu-drivers devicesubuntu-drivers devices udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. udevadm hwdb is deprecated. Use systemd-hwdb instead. ERROR:root:aplaycommandnot found/sys/devices/pci0000:00/0000:00:0f.0modalias:pci:v000015ADd00000405sv000015ADsd00000405bc03sc00i00 vendor:VMware model:SVGA II Adapter driver:open-vm-tools-desktop - distrofree/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0modalias:pci:v000010DEd00002204sv00001458sd0000403Bbc03sc00i00 vendor:NVIDIA Corporation model:GA102[GeForce RTX3090]manual_install: True driver:nvidia-driver-570-server - distro non-free driver:nvidia-driver-470-server - distro non-free driver:nvidia-driver-580-open - distro non-free driver:nvidia-driver-580 - distro non-free driver:nvidia-driver-570-open - distro non-free driver:nvidia-driver-470 - distro non-free driver:nvidia-driver-590-server-open - distro non-free driver:nvidia-driver-535-server-open - distro non-free driver:nvidia-driver-535-open - distro non-free driver:nvidia-driver-580-server - distro non-free driver:nvidia-driver-580-server-open - distro non-free driver:nvidia-driver-570-server-open - distro non-free driver:nvidia-driver-570 - distro non-free driver:nvidia-driver-535 - distro non-free driver:nvidia-driver-590-server - distro non-free driver:nvidia-driver-535-server - distro non-free driver:nvidia-driver-590 - distro non-free driver:nvidia-driver-590-open - distro non-free recommended driver:xserver-xorg-video-nouveau - distrofreebuiltin2.2 选择带open后缀的驱动上述推荐了驱动 : nvidia-driver-590-open - distro non-free recommendedsudoaptinstallnvidia-driver-590-open2.3 安装完后重启reboot3. nvidia-fabricmanager安装可选nvidia-fabricmanager 是专门管理多张通过NVLink/NVSwitch互连的NVIDIA GPU的软件。如果是单卡安装服务器启动会报错提示3.1 查看驱动版本nvidia-smirootubuntu:~# nvidia-smiTue Jan2014:43:072026-----------------------------------------------------------------------------------------|NVIDIA-SMI590.48.01 Driver Version:590.48.01 CUDA Version:13.1|---------------------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA GeForce RTX3090Off|00000000:03:00.0 Off|N/A||30% 26C P8 10W / 350W|4MiB / 24576MiB|0% Default||||N/A|--------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||No running processes found|-----------------------------------------------------------------------------------------3.2 下载fabricmanager软件Nvidia官方下载地址1.下载和驱动版本一样的fabricmanager软件。这里是590.48.01nvidia-fabricmanager_*.deb这是主软件包运行服务所必需。nvidia-fabricmanager-dev_*.deb这是开发包头文件等仅在你需要编译基于该组件的软件时才需要。请忽略它们。# 示例exportDRIVER_VERSION590.48.01wgethttps://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager-$(echo$DRIVER_VERSION|awk-F.{print $1})_${DRIVER_VERSION}-1_amd64.debwgethttps://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64/nvidia-fabricmanager_${DRIVER_VERSION}-0ubuntu1_amd64.deb3.3 安装fabricmanager# 示例dpkg-invidia-fabricmanager-$(echo$DRIVER_VERSION|awk-F.{print $1})_${DRIVER_VERSION}-1_amd64.debdpkg-invidia-fabricmanager_590.48.01-0ubuntu1_amd64.deb Selecting previously unselected package nvidia-fabricmanager.(Reading database...147187files and directories currently installed.)Preparing to unpack nvidia-fabricmanager_590.48.01-0ubuntu1_amd64.deb... Unpacking nvidia-fabricmanager(590.48.01-0ubuntu1)... Setting up nvidia-fabricmanager(590.48.01-0ubuntu1)... Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /usr/lib/systemd/system/nvidia-fabricmanager.service. Could not execute systemctl: at /usr/bin/deb-systemd-invoke line148. Processing triggersforlibc-bin(2.39-0ubuntu8.6)...查看是否正常运行systemctl status nvidia-fabricmanagersystemctl status nvidia-fabricmanager ● nvidia-fabricmanager.service - NVIDIA fabric managerserviceLoaded: loaded(/lib/systemd/system/nvidia-fabricmanager.service;enabled;vendor preset: enabled)Active: active(running)since Sun2025-10-1913:47:06 CST;3months1day ago Main PID:3020(nv-fabricmanage)Tasks:19(limit:629145)Memory:32.0M CPU: 1h 25min41.831s CGroup: /system.slice/nvidia-fabricmanager.service └─3020 /usr/bin/nv-fabricmanager-c/usr/share/nvidia/nvswitch/fabricmanager.cfg10月1913:46:53 ubuntu22-172-027-003-002 systemd[1]: Starting NVIDIA fabric manager service...10月1913:46:53 ubuntu22-172-027-003-002 nvidia-fabricmanager-start.sh[2994]: Detected Pre-NVL5 system10月1913:46:55 ubuntu22-172-027-003-002 nv-fabricmanager[3020]: Connected to1node.10月1913:47:06 ubuntu22-172-027-003-002 nv-fabricmanager[3020]: Successfully configured all the available GPUs and NVSwitches to route NVL10月1913:47:06 ubuntu22-172-027-003-002 nvidia-fabricmanager-start.sh[2994]: StartedNvidia Fabric Manager10月1913:47:06 ubuntu22-172-027-003-002 systemd[1]: Started NVIDIA fabric manager service.检查已安装的Fabric Manager版本dpkg-l|grepnvidia-fabricmanagerdpkg-l|grepnvidia-fabricmanager ii nvidia-fabricmanager590.48.01-0ubuntu1 amd64 Fabric ManagerforNVSwitch based systems禁用nvidia-fabricmanager自动升级将下面570改为安装的版本sudoapt-mark hold nvidia-fabricmanager-590nvidia-fabricmanager-590seton hold.查看已禁用版本有输出则为已禁用sudoapt-mark showholdnvidia-fabricmanager-5904. 安装软件4.1 安装CUDA Toolkit、NVIDIA Container Toolkit开发或运行任何需要直接调用GPU的应用程序– 需要 CUDA Toolkit希望在Docker容器内使用GPU– 需要 NVIDIA Container Toolkit很多AI开发、科学计算的场景 -- 两者都需要4.2 CUDA Toolkit 说明这是由NVIDIA提供的、用于开发和运行GPU加速应用程序的完整软件平台。它包含了编译器、数学库、调试工具等。具体如下CUDA 驱动nvidia-driver已经安装了 nvidia-driver-590-open。CUDA 运行时CUDA Runtime及开发工具nvcc编译器、cuBLAS等库这是CUDA Toolkit软件包的主体。如果需要编译或运行任何直接调用GPU的C/Python程序例如从源码编译PyTorch/TensorFlow运行CUDA C项目那么必须安装CUDA Toolkit。4.3 CUDA Toolkit 安装查看当前安装版本如果安装过nvcc-V(base)rootubuntu:/public/software# nvcc -Vnvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:20:09_PST_2025 Cuda compilation tools, release12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0访问官网NVIDIA CUDA Toolkit 下载页官方手册链接选择Linux - x86_64 --Ubuntu - 24.04 - deb (network)严格按照网页上给出的命令行指令执行即可。网络安装方式会自动配置源并确保安装与系统驱动兼容的版本。wgethttps://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.debsudodpkg-icuda-keyring_1.1-1_all.debsudoapt-getupdateapt-cachesearch cuda-toolkit cuda-toolkit-12 - CUDA Toolkit12meta-package cuda-toolkit - CUDA Toolkit meta-package cuda-toolkit-12-5 - CUDA Toolkit12.5meta-package cuda-toolkit-12-5-config-common - Common config packageforCUDA Toolkit12.5. cuda-toolkit-12-config-common - Common config packageforCUDA Toolkit12. cuda-toolkit-config-common - Common config packageforCUDA Toolkit. cuda-toolkit-12-6 - CUDA Toolkit12.6meta-package cuda-toolkit-12-6-config-common - Common config packageforCUDA Toolkit12.6. cuda-toolkit-12-8 - CUDA Toolkit12.8meta-package cuda-toolkit-12-8-config-common - Common config packageforCUDA Toolkit12.8. cuda-toolkit-12-9 - CUDA Toolkit12.9meta-package cuda-toolkit-12-9-config-common - Common config packageforCUDA Toolkit12.9. cuda-toolkit-13-0 - CUDA Toolkit13.0meta-package cuda-toolkit-13-0-config-common - Common config packageforCUDA Toolkit13.0. cuda-toolkit-13 - CUDA Toolkit13meta-package cuda-toolkit-13-config-common - Common config packageforCUDA Toolkit13. cuda-toolkit-13-1 - CUDA Toolkit13.1meta-package cuda-toolkit-13-1-config-common - Common config packageforCUDA Toolkit13.1. nvidia-cuda-toolkit - NVIDIA CUDA development toolkit nvidia-cuda-toolkit-doc - NVIDIA CUDA and OpenCL documentation nvidia-cuda-toolkit-gcc - NVIDIA CUDA development toolkit(GCC compatibility)查看当前可安装的版本aptlist|grep-Ecuda-toolkit-[0-9]{2}-[0-9]{1,2}WARNING:aptdoes not have a stable CLI interface. Use with cautioninscripts. cuda-toolkit-11-7-config-common/unknown11.7.99-1 all cuda-toolkit-11-7/unknown11.7.1-1 amd64 cuda-toolkit-11-8-config-common/unknown11.8.89-1 all cuda-toolkit-11-8/unknown11.8.0-1 amd64 cuda-toolkit-12-0-config-common/unknown12.0.146-1 all cuda-toolkit-12-0/unknown12.0.1-1 amd64 cuda-toolkit-12-1-config-common/unknown12.1.105-1 all cuda-toolkit-12-1/unknown12.1.1-1 amd64 cuda-toolkit-12-2-config-common/unknown12.2.140-1 all cuda-toolkit-12-2/unknown12.2.2-1 amd64 cuda-toolkit-12-3-config-common/unknown12.3.101-1 all cuda-toolkit-12-3/unknown12.3.2-1 amd64 cuda-toolkit-12-4-config-common/unknown12.4.127-1 all cuda-toolkit-12-4/unknown12.4.1-1 amd64 cuda-toolkit-12-5-config-common/unknown12.5.82-1 all cuda-toolkit-12-5/unknown12.5.1-1 amd64 cuda-toolkit-12-6-config-common/unknown12.6.77-1 all cuda-toolkit-12-6/unknown12.6.3-1 amd64 cuda-toolkit-12-8-config-common/unknown12.8.90-1 all cuda-toolkit-12-8/unknown12.8.2-1 amd64 cuda-toolkit-12-9-config-common/unknown12.9.79-1 all cuda-toolkit-12-9/unknown12.9.2-1 amd64 cuda-toolkit-13-0-config-common/unknown13.0.96-1 all cuda-toolkit-13-0/unknown13.0.3-1 amd64 cuda-toolkit-13-1-config-common/unknown13.1.80-1 all cuda-toolkit-13-1/unknown13.1.2-1 amd64 cuda-toolkit-13-2-config-common/unknown13.2.75-1 all cuda-toolkit-13-2/unknown13.2.1-1 amd64 cuda-toolkit-13-3-config-common/unknown,now13.3.29-1 all[installed,auto-removable]cuda-toolkit-13-3/unknown13.3.1-1 amd64nvidia-smi 命令右上角显示的CUDA版本是当前驱动支持的最高版本** 注意不是安装最新版本就好 **** 安装vLLM 默认预编译版本 **当前是13.0后续会随vLLM发展而变化aptinstallcuda-toolkit-13-0 cuda-toolkit-13-0-config-common添加环境变量vim/etc/profileexportPATH/usr/local/cuda-13.0/bin:$PATHexportLD_LIBRARY_PATH/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATHsource/etc/profilenvcc-V(vllm-env)rootxunku:/public/vLLM# nvcc -Vnvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Aug_20_01:58:59_PM_PDT_2025 Cuda compilation tools, release13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_0sudoapt-mark hold nvidia-driver-595-open cuda-toolkit-13-0 cuda-toolkit-13-0-config-common4.4 Docker GPU 说明Docker GPU 支持 (NVIDIA Container Toolkit)是一个让Docker容器能够访问和使用宿主机HostNVIDIA GPU的工具集。它实质上是创建了一个兼容层将宿主机的GPU驱动映射到容器内部。具体如下主要是 nvidia-container-toolkit 这个包它会修改Docker的配置。如果需要在Docker容器内运行任何需要GPU的镜像例如运行 docker run --gpus all nvidia/cuda:12.1.1-base-ubuntu24.04 或官方的PyTorch/TensorFlow Docker镜像那么必须安装此工具包。4.5 Docker GPU 安装官方安装指南安装工具sudoapt-getupdatesudoapt-getinstall-y--no-install-recommendscurlgnupg2sudoapt-getupdatesudoapt-getinstall-y--no-install-recommendscurlgnupg2 Hit:1 http://mirrors.aliyun.com/ubuntu noble InRelease Hit:2 http://mirrors.aliyun.com/ubuntu noble-updates InRelease Hit:3 http://mirrors.aliyun.com/ubuntu noble-security InRelease Hit:4 http://mirrors.aliyun.com/ubuntu noble-backports InRelease Hit:5 http://mirrors.aliyun.com/ubuntu noble-proposed InRelease Hit:6 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2404/x86_64 InRelease Reading package lists... Done Reading package lists... Done Building dependency tree... Done Reading state information... Donecurlis already the newest version(8.5.0-2ubuntu10.6). gnupg2 is already the newest version(2.4.4-2ubuntu17.4). The following packages were automatically installed and are no longer required: liburing2 mailcap plocate Usesudo apt autoremoveto remove them.0upgraded,0newly installed,0to remove and4not upgraded.配置仓库curl-fsSLhttps://nvidia.github.io/libnvidia-container/gpgkey|sudogpg--dearmor-o/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg\curl-s-Lhttps://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list|\seds#deb https://#deb [signed-by/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g|\sudotee/etc/apt/sources.list.d/nvidia-container-toolkit.list更新仓库sudoapt-getupdate安装工具包exportNVIDIA_CONTAINER_TOOLKIT_VERSION1.18.1-1sudoapt-getinstall-y\nvidia-container-toolkit${NVIDIA_CONTAINER_TOOLKIT_VERSION}\nvidia-container-toolkit-base${NVIDIA_CONTAINER_TOOLKIT_VERSION}\libnvidia-container-tools${NVIDIA_CONTAINER_TOOLKIT_VERSION}\libnvidia-container1${NVIDIA_CONTAINER_TOOLKIT_VERSION}配置容器sudonvidia-ctk runtime configure--runtimedockerINFO[0000]Loading config from /etc/docker/daemon.json INFO[0000]Wrote updated config to /etc/docker/daemon.json INFO[0000]It is recommended thatdockerdaemon be restarted.重启容器sudosystemctl restartdocker运行测试容器找了半天终于找到了一个可以下载的镜像dockerpull nvidia/cuda:13.0.1-runtime-ubuntu22.0413.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda 60d98d907669: Pulling fs layer 7e9d3b636b44: Pulling fs layer 3a2ba8ed1759: Pulling fs layer 5eaf68e8e556: Waiting c03b8ec8dd33: Waiting 47140273f24a: Waiting f1e29f967bcf: Waiting 48feaf8fd5bd: Waiting 8006ce821e80: Waiting13.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda 60d98d907669: Pull complete 7e9d3b636b44: Pull complete 3a2ba8ed1759: Pull complete 5eaf68e8e556: Pull complete c03b8ec8dd33: Pull complete 47140273f24a: Pull complete f1e29f967bcf: Pull complete 48feaf8fd5bd: Pull complete 8006ce821e80: Pull complete Digest: sha256:e4511e846c49e5495ef3d80c82b8f5dd597c6ef5c7f355601ead776ae3e96c67 Status: Downloaded newer imagefornvidia/cuda:13.0.1-runtime-ubuntu22.04 docker.io/nvidia/cuda:13.0.1-runtime-ubuntu22.04dockerrun--rm--gpusall nvidia/cuda:13.0.1-runtime-ubuntu22.04 nvidia-smiCUDACUDA Version13.0.1 Container image Copyright(c)2016-2023, NVIDIA CORPORATIONAFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license A copy of this license is made availableinthis container at /NGC-DL-CONTAINER-LICENSEforyour convenience. Tue Jan2013:37:442026-----------------------------------------------------------------------------------------|NVIDIA-SMI590.48.01 Driver Version:590.48.01 CUDA Version:13.1|---------------------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA GeForce RTX3090Off|00000000:03:00.0 Off|N/A||30% 26C P8 10W / 350W|4MiB / 24576MiB|0% Default||||N/A|--------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||No running processes found|-----------------------------------------------------------------------------------------