NVIDIA

最後更新: 2021-02-12

 


BIOS

 

Nvidia bios is encrypted. No bios mod available.

 


NVIDIA SLI (Scalable Link Interface)

 

 * 須要相同型號的顯示卡

  • 3-way SLI
  • Quad SLI

 


Install NVIDIA

 

Check HW Installed

lspci | grep NVIDIA

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10f1 (rev a1)

lshw -C display

  ...
  *-display
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:42 memory:fd000000-fdffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:fe000000-fe07ffff

Install Driver

apt-get install build-essential

# 因為 Ubuntu package - nvidia-384 沒有 nvidia-settings 及 opencl 所以才自行 Download

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/384.111/NVIDIA-Linux-...

chmod 700 NVIDIA-Linux-x86_64-384.111.run

./NVIDIA-Linux-x86_64-384.111.run

P.S.

如果在 Ubuntu 安裝最新 Kernel (linux-image-4.15.0-999-generic_4.15.0-999.201801292100_amd64.deb) 那要安以下 Package

 - linux-headers-4.15.0-999_4.15.0-999.201801292100_all.deb

 - linux-headers-4.15.0-999-generic_4.15.0-999.201801292100_amd64.deb

Check Driver Version

# 安裝完的見到它們, card0 應該係 onboard display

ls -1 /dev/dri/card*

# Version

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.111  Tue Dec 19 23:51:45 PST                            2017
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)

Error msg

ERROR: The Nouveau kernel driver is currently in use by your system.  
This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. 

cat /etc/modprobe.d/nvidia-installer-disable-nouveau.conf

blacklist nouveau
options nouveau modeset=0

update-initramfs -u

Or

kernel parameter

# for Nvidia graphics

nomodeset nouveau.modeset=0

nomodeset

instructs the kernel to not load video drivers and use BIOS modes instead until X is loaded.

nouveau.modeset=0

The nouveau project aims to build high-quality, free/libre software drivers for nVidia cards.

The xxx.modeset=0 disables kernel mode setting for the hardware.

 


Driver with Cuda Version

 

# Check current cuda version

nvidia-smi

List

NVIDIA-Linux-x86_64-455.45.01       # CUDA Version: 11.1

 


nvidia-settings

 

lightdm 成功與 Xorg 成功 start 後

# List GPU unit in X

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -c :0 -q gpus

1 GPU on home:0

    [0] home:0[gpu:0] (GeForce GTX 1060 6GB)

      Has the following names:
        GPU-0
        GPU-x-x-x-x-x

# Verify your GPU is in that perf level:

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -q '[gpu:0]/GPUCurrentPerfLevel'

  Attribute 'GPUCurrentPerfLevel' (home:0[gpu:0]): 2.
    'GPUCurrentPerfLevel' is an integer attribute.
    'GPUCurrentPerfLevel' is a read-only attribute.
    'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.

# check GPU 可以行幾快

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -q '[gpu:0]/GPUPerfModes'

  Attribute 'GPUPerfModes' (home:0[gpu:0]): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405,
  memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139,
  nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1, memTransferRate=1620,
  memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=202, nvclockmin=202, nvclockmax=1974, nvclockeditable=1,
  memclock=3802, memclockmin=3802, memclockmax=3802, memclockeditable=1, memTransferRate=7604, memTransferRatemin=7604, memTransferRatemax=7604,
  memTransferRateeditable=1 ; perf=3, nvclock=202, nvclockmin=202, nvclockmax=1974, nvclockeditable=1, memclock=4004, memclockmin=4004, memclockmax=4004,
  memclockeditable=1, memTransferRate=8008, memTransferRatemin=8008, memTransferRatemax=8008, memTransferRateeditable=1

 


OC

 

Fan

# Check

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 \
 nvidia-settings -c :0 -q 'GPUTargetFanSpeed'

-c                            CTRL-DISPLAY    # Control the specified X display.

# Set speed to 70%

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 \
 nvidia-settings -c :0 \
 -a 'GPUFanControlState=1' \
 -a 'GPUTargetFanSpeed=70'

# 多 GPU 時 Control 風扇

#!/bin/bash

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 \
        nvidia-settings -c :0 \
        -a '[gpu:0]/GPUFanControlState=1' \
        -a '[fan:0]/GPUTargetFanSpeed=70'

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 \
        nvidia-settings -c :0 \
        -a '[gpu:1]/GPUFanControlState=1' \
        -a '[fan:1]/GPUTargetFanSpeed=60'

GPUGraphicsClockOffset & GPUMemoryTransferRateOffset

# You can not overclock without powermizer in xorg. You need to use coolbits.

# there's no way to overclock an nvidia GPU without X - nvidia-settings is the only way to currently oc 10xx cards.

nvidia-smi -ac MEM_CLOCK,GRAPHICS_CLOCK   <-- 唔work

Powermizer with coolbits

# nvidia-xconfig to generate a minimal

# Sets cool-bits as needed in xorg to enable overclocking

nvidia-xconfig -a --enable-all-gpus \
  --allow-empty-initial-configuration \
  --cool-bits=31 \
  --use-display-device="DFP-0" \
  --connected-monitor="DFP-0"

.........
Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1060 6GB"
    BusID          "PCI:1:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "UseDisplayDevice" "DFP-0"
    Option         "Coolbits" "31"
    Option         "ConnectedMonitor" "DFP-0"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

/etc/init.d/lightdm restart

Checking

nvidia-smi

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11964      G   /usr/lib/xorg/Xorg                            12MiB |
|    1     11964      G   /usr/lib/xorg/Xorg                             6MiB |
|    2     11964      G   /usr/lib/xorg/Xorg                             6MiB |
|    3     11964      G   /usr/lib/xorg/Xorg                             6MiB |
+-----------------------------------------------------------------------------+

PowerMizer

 * values are offsets

# check

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 \
 nvidia-settings -c :0 -q 'GPUCurrentClockFreqsString'

# set

#!/bin/bash

DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 \
        nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1 \
        -a [gpu:0]/GPUGraphicsClockOffset[3]=50 \
        -a [gpu:0]/GPUMemoryTransferRateOffset[3]=400

Be wary of very high overclocks as often the error correction is being used to achieve it and performance drops off.

To find the right amount you should overclock slowly and benchmark each stage until performance drops,

when it does back off and that is your actual overclock.

Remark

... \
-a [gpu:0]/GPUGraphicsClockOffsetAllPerformanceLevels=1100 \
-a [gpu:0]/GPUMemoryTransferRateOffsetAllPerformanceLevels=-400

 


Card Hang

 

[215304.057110] Process accounting resumed
[240674.872126] NVRM: GPU at PCI:0000:05:00: GPU-6c4480a9-?-?-?-?
[240674.872133] NVRM: GPU Board Serial Number:
[240674.872138] NVRM: Xid (PCI:0000:05:00): 31, Ch 00000013, engmask 00000101, intr 10000000

 

 

 


OC 技術

 

It requires no configuration. It only increases the core clock, not the memory one.

GPU Boost 1           HW: GTX 680 ....

GPU Boost 2           HW: GTX 700 ....

GPU Boost 3           HW: GTX 1080

Offset 頻率不再固定, 每個電壓點都有對應的頻率Offset
 

 

 


nvidia-smi

 

# -i, --id=ID      Display data for a single specified GPU or Unit.

# --query-gpu 必須與 --format 一起使用

nvidia-smi -i 0 --format=csv --query-gpu=power.limit

power.limit [W]
120.00 W

# -lms ms, --loop-ms=ms: Same as -l,--loop but in milliseconds.

nvidia-smi -i 0 -lms 500 --format=csv,noheader --query-gpu=temperature.gpu

37
37
37
...

Other "--query-gpu" opt (--help-query-gpu)

  • index                         # Zero based index of the GPU. Can change at each boot.
  • count                         # The number of NVIDIA GPUs in the system.
  • pstate                        # P0 (maximum performance) to P12 (minimum performance)
  • memory.used
  • utilization.memory
  • timestamp                 # YYYY/MM/DD HH:MM:SS.msec
  • pci.bus_id                  # PCI bus id as "domain:bus:device.function" in hex. ( 00000000:07:00.0 )
  • pci.bus                       # PCI bus number, in hex. (0x07)
  • pci.device                  # PCI device number, in hex. (0x00)
  • serial                         # [Not Supported]

 

nvidia-smi --format=csv,noheader --query-gpu=index,pci.bus_id

Useful checking

nvidia-smi -l 1 --format=csv,noheader --query-gpu=index,utilization.gpu,temperature.gpu,fan.speed,power.draw

0, 100 %, 68, 67 %, 92.75 W
1, 100 %, 63, 48 %, 90.93 W
2, 100 %, 66, 60 %, 92.56 W
3, 100 %, 63, 50 %, 91.15 W
4, 100 %, 67, 64 %, 91.69 W
5, 100 %, 59, 35 %, 89.09 W
6, 100 %, 66, 58 %, 92.77 W
7, 100 %, 61, 42 %, 89.66 W
8, 100 %, 63, 51 %, 91.53 W

nvidia-smi -q -d PERFORMANCE

Timestamp                           : Mon Jan 29 21:53:05 2018
Driver Version                      : 384.111

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active

nvidia-smi -q -d CLOCK

     Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
...
   Max Clocks
        Graphics                    : 1974 MHz
        SM                          : 1974 MHz
        Memory                      : 4004 MHz
        Video                       : 1708 MHz
...

nvidia-smi -q -d SUPPORTED_CLOCKS

Timestamp                           : Tue Jan 30 19:29:25 2018
Driver Version                      : 390.25

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Supported Clocks                : N/A

# monitoring of detail stats such as power

nvidia-smi stats -i 0 -d pwrDraw   

查看 PCIe 資料

# The current PCI-E link width. These may be reduced when the GPU is not in use.

nvidia-smi -i 0 --format=csv,noheader --query-gpu=pcie.link.width.current

16

# 1 = 1x
# 16 = 16x

# The current PCI-E link generation. These may be reduced when the GPU is not in use.

nvidia-smi -i 0 --format=csv,noheader --query-gpu="pcie.link.gen.current"

2

其他 PCIe 資料另見

https://datahunter.org/pcie

 

Setting

# Set Power Limit

nvidia-smi -pl 130

Power limit for GPU 00000000:01:00.0 was set to 120.00 W from 130.00 W.
All done.

# GPU Accounting

#  keep track of usage of resources throughout lifespan of a single process.
#  0|DISABLED or 1|ENABLED

-am, --accounting-mode=MODE

# Enabled persistence mode for GPU

# Any settings below for clocks and power get reset between program runs unless you enable persistence mode (PM) for the driver.
# to persistence mode to keep the NVIDIA driver loaded even when no applications are accessing the cards

nvidia-smi -pm 1

 


新增 Display Card

 

nvidia-xconfig --enable-all-gpus

nvidia-xconfig --cool-bits=31

/etc/X11/xorg.conf 內建立了

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1060 6GB"
    BusID          "PCI:1:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1070"
    BusID          "PCI:2:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Coolbits" "31"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "Coolbits" "31"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

/etc/X11/xorg.conf 追加設定

Section "Screen"
    Option         "ConnectedMonitor" "DFP-0"
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Interactive" "False"
...
EndSection

Remark

Option "ConnectedMonitor" "string"

# string: "CRT" (cathode ray tube), "DFP" (digital flat panel), or "TV" (television);

Allows you to override what the NVIDIA kernel module detects is connected to your graphics card.

Option "Interactive" "boolean"

# Default: on

This option controls the behavior of the driver's watchdog,
which attempts to detect and terminate GPU programs that get stuck,
in order to ensure that the GPU remains available for other processes.

This option controls the behavior of the driver's watchdog,
which attempts to detect and terminate GPU programs that get stuck,
in order to ensure that the GPU remains available for other processes.

Option "AllowEmptyInitialConfiguration" "boolean"

# Default: off.

Normally, the NVIDIA X driver will fail to start if it cannot find any display devices connected to the NVIDIA GPU.

overrides that behavior so that the X server will start anyway, even if no display devices are connected.

/etc/init.d/lightdm restart

ps aux | grep X

 


 

Creative Commons license icon Creative Commons license icon