硬碟的健康檢查 (SMART) - smartmontools

最後更新: 2016-06-05

介紹

現時的硬碟全部都自帶有自我檢測功能, 名叫做 S.M.A.R.T.

全名是 Self-Monitoring, Analysis and Reporting Technology

它的報告基本上係信得過的, 根據官方資料, 命中率最少有 70% 左右啦 ~ (2008年)

以下介紹如何應用它 ^^

目錄

 

學習方法

man 8 smartctl

 


安裝檢查工具

 

Debian

apt-get install smartmontools -y

Centos

yum install smartmontools -y

在 M$ 安裝方法:

在 sourceforge.net 下載 win32

 


檢看硬碟基本資料及有沒有啟用驗身功能

 

基本資料

smartctl -i /dev/hda

如果 SMART support is: Disabled 那就要啟動它了.

=== START OF INFORMATION SECTION ===
Device Model:     ST500DM002-1BC142
Serial Number:    S2A38N69
Firmware Version: JC4B
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed May  2 01:32:34 2012 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

 

在 HDD 上啟用了 smart 功能

smartctl -s on /dev/hda

-s VALUE, --smart=VALUE

 

保存 SMART 記錄

# 在 HardDisk 斷電後, smart 的數據仍能保存

-S VALUE, --saveauto=VALUE

Example:

smartctl -S on /dev/sdb

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Attribute Autosave Enabled.

smartctl -a /dev/sdb

SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.

 

 


簡單的檢查結果

 

smartctl -H /dev/hda

報告只有 PASSEDFAILED 兩回應 ...

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

 


即時為硬碟測試(Self-test)

 

Executes  TEST immediately

-t <type>

 * running self-test can => degrade performance of the drive

Test type:

offline              # 1. No entry is placed in the selftest log.

                       # 2. The effects of this test are visible only in that it updates the SMART offline Attribute values

short                # 立即聽到 harddisk 在做野 (check the electrical and mechanical performance)

long                 # 相當於 Offline Extended self-test

conveyance(輸送)        # intended to identify damage incurred during transporting of the device

select,M-N                 # to test a range of disk LBAs

pending,N

afterselect,on afterselect,off

Example:

smartctl -t short /dev/sde

Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Sat Dec 18 17:10:35 2010

Use smartctl -X to abort test.

Test with usb disk

smartctl -d sat -t offline /dev/sdd

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART off-line routine immediately in off-line mode".
Drive command "Execute SMART off-line routine immediately in off-line mode" successful.
Testing has begun.
Please wait 426 seconds for test to complete.
Test will complete after Sat May  6 01:06:43 2017
Use smartctl -X to abort test.

Test 所需的時間

對 80G / 7200 rpm 的硬碟來講

  • short 大約要要 2 min
  • conveyance 大約要 5 min
  • long 大約要 38 min

想在過程中強制中止

smartctl -X

想知幾耐後完成

心急的朋友, 如果想查看仍有多久才測試完成, 可以用 -c (--capabilities) 指令

# -c, --capabilities (SMART  commands)

smartctl -c /dev/hda

Self-test execution status:      ( 247) Self-test routine in progress...
                                        70% of test remaining.
....................................
Short self-test routine
recommended polling time:        (   1) minutes.
....................................

一段時間後:

Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.

capabilities list (查看支持什麼 test)

capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.

If the ´-c´ option show that the device has the "Abort Offline collection upon new command" capability then

most commands will abort the Immediate Offline Test, so you should not try to track the progress of the test with ´-c´, as it will abort the test.

Captive Mode

'-C'            

conjunction with short, long, selective or conveyance

self-tests in captive mode (known as "foreground mode" for SCSI devices)

 


看測試結果

 

 

-l type          # Prints either the SMART Error Log (TYPE: error, selftest, selective, directory, ssd)

-l error    => offline test 要用此看

                    prints the Summary SMART error log

                    SMART disks maintain a log of the most recent five non-trivial errors

                    the disk  power-on lifetime at which the error occurred is recorded

-l selftest => "short", "long" test 要用此看

                    The time at which the test took place, measured in hours of disk lifetime

-l ssd       => prints the Solid State Media percentage used endurance indicator

                    (0 indicates as new condition while 100 indicates the device is at the end of its lifetime)

If any errors were detected, the Logical Block Address (LBA) of the first error is printed in decimal notation.

i.e.

smartctl -l error /dev/sdd

SMART Error Log Version: 1
No Errors Logged

smartctl -l selftest /dev/sda

# 沒有進行過 test, Output:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

#1(最近一次), #1, #2, #3 一路加上去, LifeTime = 在幾多歲時做檢查

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       959         -
# 2  Conveyance offline  Completed without error       00%       168         -
# 3  Short offline       Completed without error       00%       167         -
# 4  Short offline       Completed without error       00%       126         -
# 5  Extended offline    Aborted by host               90%       400         -

 


"long" test timeout

 

情況

smartctl -d sat -l selftest /dev/sdd

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Interrupted (host reset)      00%     17222         -
# 2  Extended offline    Interrupted (host reset)      00%     17222         -

"Extended test" NEVER finish

Extended offline

The host may send a standby command to the drive after some time of I/O inactivity.

This also aborts any running self-test.

The self-test log then reports Aborted by host or Interrupted (host reset) as status.

This is typical for drives behind USB bridges.

解決方案:

while true ; do dd if=/dev/sda iflag=direct count=1 of=/dev/null ; sleep 30 ; done

Extended captive

"Captive" mean the test is running in exclusive mode and drive respond to no commands until test completed.

Unfortunately, long-term busy out may cause many symptoms.

For example for RAID members it may cause the drive is expelled from RAID.

i.e. Linux

[ 2510.698368] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 2510.698376] ata1.00: failed command: SMART
[ 2510.698382] ata1.00: cmd b0/d4:00:82:4f:c2/00:00:00:00:00/00 tag 24
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 2510.698385] ata1.00: status: { DRDY }
[ 2510.698391] ata1: hard resetting link
[ 2511.003326] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2511.024026] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359)
[ 2511.024035] ACPI Error: Method parse/execution failed [\_SB.PCI0.SAT0.SPT0._GTF] (Node ffff8802938d4a00),
                           AE_NOT_FOUND (20150930/psparse-542)
[ 2511.025051] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359)
[ 2511.025060] ACPI Error: Method parse/execution failed [\_SB.PCI0.SAT0.SPT0._GTF] (Node ffff8802938d4a00),
                           AE_NOT_FOUND (20150930/psparse-542)
[ 2511.025471] ata1.00: configured for UDMA/133
[ 2511.025506] ata1: EH complete

 

 


查看有關硬碟的所有健康指數

 

-A, --attributes

資料是以 RAW_VALUE 的形式存在, 它們是記錄在 HDD 的 SA 上

一般情況下只要 VALUE 數值 >> THRESH 數值就叫安全

這些參數可以幫助我們了解 HDD 的不同方面的健康

  • "VALUE"   - means the current value
  • "WORST"  - tells you what worst value SMART has ever assigned to this attribute
  • "THRESH" - indicates the value at/below which SMART consideres the attribute a failure

i.e.

smartctl -A /dev/hda

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       35476009
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       31189422
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1637
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       15
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   097   097   000    Old_age   Always       -       3
190 Airflow_Temperature_Cel 0x0022   059   056   045    Old_age   Always       -       41 (Lifetime Min/Max 37/42)
194 Temperature_Celsius     0x0022   041   044   000    Old_age   Always       -       41 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   042   023   000    Old_age   Always       -       35476009
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       66060891981450
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2183356688
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2225147360

詳盡一些

-a, --all =  -H -i -A -l error -l selftest

 


Offline Scan

 

說明

最後, 就是 offline scan 了, 它是要花很長時間才完成檢查, 雖然佢叫 offline scan, 不過隻碟依然用到的

1 TB 大約 7 小時左右

因為有些 "參數" 要 offline scan 時才收集到, 比如 Multi_Zone_Error_Rate, 所以 offline test 是必須的.

Offline test

smartctl -t offline /dev/sda

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART off-line routine immediately in off-line mode".
Drive command "Execute SMART off-line routine immediately in off-line mode" successful.
Testing has begun.
Please wait 584 seconds for test to complete.
Test will complete after Mon Nov 11 16:23:12 2019

看看 Offline check 特質

smartctl -c /dev/sdc

 * If the "-c" option show that the device has the "Abort Offline collection upon new command" capability then

    most commands will abort the Immediate Offline Test,

    so you should not try to track the progress of the test with '-c', as it will abort the test.

    建議在 584 秒後再 check status

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
...

查看 offline scan result:

smartctl -l selftest /dev/sda

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%      4895         -
# 2  Conveyance offline  Completed without error       00%      1398         -
...

Automatic offline test

# 另外, 我們還可叫硬碟每 4 小時做一次 offline test  (ATA only)

# Obsolete (However it is implemented and used by many vendors.)

# -o, --offlineauto

smartctl -o on /dev/hda

SMART Automatic Offline Testing Enabled every four hours.

Remark: Online V.S. Offline Check

SMART provides three basic categories of testing.

Online: no effect on the performance of the device. (-s on)

Offline: degrade the device performance. (-o on)

           Normally, the disk will suspend offline testing while disk accesses are taking  place, 

           and then automatically resume it when the disk would otherwise be idle

Self-test: (-t type)

 


SCT Error Recovery Control(-l scterc)

 

Western Digital: Time-Limited Error Recovery (TLER)

Seagate: Error Recovery Control (ERC)

當使用 ZFS 時有 mirror/zraid, 建議設定成 0.5 秒 (5) / Disabled (0)

Checking

smartctl -l scterc /dev/sdd

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Remark

SCT  = SMART Command Transport(protocol)

scterc = prints values and descriptions of the SCT Error Recovery Control settings.

 

 


Compile from source

 

到以下網址下載最新版

http://sourceforge.net/projects/smartmontools/files/

下載

wget http://sourceforge.net/projects/smartmontools/files/smartmontools/6.4/sm...

tar zxvf smartmontools-6.4.tar.gz

cd smartmontools-6.4

compile

./configure

make

Install

cp -a smartctl /usr/sbin

Usage

smartctl -A /dev/sdb

 


第三方 RAID 的 Disk SMART

 

 *  smartmontools has to use "vendor specific I/O controls"

    pass through I/O controls providing direct access to each physical disk.

# Link

https://www.smartmontools.org/wiki/Supported_RAID-Controllers

# Device

# LSI 3ware SATA RAID controller

# LSI MegaRAID SAS RAID controller Dell PERC 5/i,6/i controller

# Intel ICHxR RAID(Intel Rapid/Matrix Storage)

# Adaptec SAS RAID controller(devices supported by ​aacraid driver)

# Areca SATA[/SAS] RAID controller

# HighPoint RocketRAID SATA RAID controller

# CCISS (HP/Compaq Smart Array Controller)

# Dell 's PERC 6/i is really just an LSI MegaRAID controller rebranded.

# Check Smart Usage

smartctl -a -d megaraid,n /dev/sdx

-d TYPE           # Specifies the type of the device

megaraid,N       # The device consists of one or more SCSI/SAS disks connected to a MegaRAID controller. (0~127)

i.e.

smartctl -d megaraid,2 -a /dev/sda

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       932
  3 Spin_Up_Time            0x0027   144   143   021    Pre-fail  Always       -       3791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22273
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   112   096   000    Old_age   Always       -       7711 (0 0 0 32)
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   194   194   000    Old_age   Offline      -       1361

 


Update smartctl s' HDD DB

 

# current drive database

file /var/lib/smartmontools/drivedb/drivedb.h

/var/lib/smartmontools/drivedb/drivedb.h: ASCII text

update-smart-drivedb

/var/lib/smartmontools/drivedb/drivedb.h updated from branches/RELEASE_7_2_DRIVEDB

 


心得

 

Using smartctl on usb disk

smartctl -d sat -s on /dev/sdb

Usage on Window

# 查看系統安有什麼 Harddisk

smartctl --scan

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device

# 查看它的 Info

smartctl -i /dev/sdb

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD2002FAEX-007BA0
Serial Number:    WD-WMAWP0493116
LU WWN Device Id: 5 0014ee 058bf4a6f
Firmware Version: 05.01D05
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 16 15:57:23 2023 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

 


另外參考

 

 

 

Creative Commons license icon Creative Commons license icon