最後更新: 2016-06-05
介紹
現時的硬碟全部都自帶有自我檢測功能, 名叫做 S.M.A.R.T.
全名是 Self-Monitoring, Analysis and Reporting Technology
它的報告基本上係信得過的, 根據官方資料, 命中率最少有 70% 左右啦 ~ (2008年)
以下介紹如何應用它 ^^
目錄
- 安裝檢查工具
- 檢看硬碟基本資料及有沒有啟用驗身功能
- 最簡單的檢查結果
- 即時為硬碟測試(Self-test)
- 看測試結果
- 查看有關硬碟的所有健康指數
- Offline Scan
- SCT Error Recovery Control(-l scterc)
- Compile from source
- Update smartctl s' HDD DB
- 心得
- 另外參考
學習方法
man 8 smartctl
安裝檢查工具
Debian
apt-get install smartmontools -y
Centos
yum install smartmontools -y
在 M$ 安裝方法:
在 sourceforge.net 下載 win32
檢看硬碟基本資料及有沒有啟用驗身功能
基本資料
smartctl -i /dev/hda
如果 SMART support is: Disabled 那就要啟動它了.
=== START OF INFORMATION SECTION ===
Device Model: ST500DM002-1BC142
Serial Number: S2A38N69
Firmware Version: JC4B
User Capacity: 500,107,862,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed May 2 01:32:34 2012 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
在 HDD 上啟用了 smart 功能
smartctl -s on /dev/hda
-s VALUE, --smart=VALUE
保存 SMART 記錄
# 在 HardDisk 斷電後, smart 的數據仍能保存
-S VALUE, --saveauto=VALUE
Example:
smartctl -S on /dev/sdb
=== START OF ENABLE/DISABLE COMMANDS SECTION === SMART Attribute Autosave Enabled.
smartctl -a /dev/sdb
SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer.
最簡單的檢查結果
smartctl -H /dev/hda
報告只有 PASSED 或 FAILED 兩回應 ...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
即時為硬碟測試(Self-test)
Executes TEST immediately
-t <type>
* running self-test can => degrade performance of the drive
Test type:
offline # 1. No entry is placed in the selftest log.
# 2. The effects of this test are visible only in that it updates the SMART offline Attribute values
short # 立即聽到 harddisk 在做野 (check the electrical and mechanical performance)
long # 相當於 Offline Extended self-test
conveyance(輸送) # intended to identify damage incurred during transporting of the device
select,M-N # to test a range of disk LBAs
pending,N
afterselect,on afterselect,off
Example:
smartctl -t short /dev/sde
Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Sat Dec 18 17:10:35 2010 Use smartctl -X to abort test.
Test with usb disk
smartctl -d sat -t offline /dev/sdd
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART off-line routine immediately in off-line mode".
Drive command "Execute SMART off-line routine immediately in off-line mode" successful.
Testing has begun.
Please wait 426 seconds for test to complete.
Test will complete after Sat May 6 01:06:43 2017
Use smartctl -X to abort test.
Test 所需的時間
對 80G / 7200 rpm 的硬碟來講
- short 大約要要 2 min
- conveyance 大約要 5 min
- long 大約要 38 min
想在過程中強制中止
smartctl -X
想知幾耐後完成
心急的朋友, 如果想查看仍有多久才測試完成, 可以用 -c (--capabilities) 指令
# -c, --capabilities (SMART commands)
smartctl -c /dev/hda
Self-test execution status: ( 247) Self-test routine in progress... 70% of test remaining. .................................... Short self-test routine recommended polling time: ( 1) minutes. ....................................
一段時間後:
Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run.
capabilities list (查看支持什麼 test)
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
If the ´-c´ option show that the device has the "Abort Offline collection upon new command" capability then
most commands will abort the Immediate Offline Test, so you should not try to track the progress of the test with ´-c´, as it will abort the test.
Captive Mode
'-C'
conjunction with short, long, selective or conveyance
self-tests in captive mode (known as "foreground mode" for SCSI devices)
看測試結果
-l type # Prints either the SMART Error Log (TYPE: error, selftest, selective, directory, ssd)
-l error => offline test 要用此看
prints the Summary SMART error log
SMART disks maintain a log of the most recent five non-trivial errors
the disk power-on lifetime at which the error occurred is recorded
-l selftest => "short", "long" test 要用此看
The time at which the test took place, measured in hours of disk lifetime
-l ssd => prints the Solid State Media percentage used endurance indicator
(0 indicates as new condition while 100 indicates the device is at the end of its lifetime)
If any errors were detected, the Logical Block Address (LBA) of the first error is printed in decimal notation.
i.e.
smartctl -l error /dev/sdd
SMART Error Log Version: 1 No Errors Logged
smartctl -l selftest /dev/sda
# 沒有進行過 test, Output:
=== START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t]
#1(最近一次), #1, #2, #3 一路加上去, LifeTime = 在幾多歲時做檢查
=== START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 959 - # 2 Conveyance offline Completed without error 00% 168 - # 3 Short offline Completed without error 00% 167 - # 4 Short offline Completed without error 00% 126 - # 5 Extended offline Aborted by host 90% 400 -
"long" test timeout
情況
smartctl -d sat -l selftest /dev/sdd
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended captive Interrupted (host reset) 00% 17222 - # 2 Extended offline Interrupted (host reset) 00% 17222 -
"Extended test" NEVER finish
Extended offline
The host may send a standby command to the drive after some time of I/O inactivity.
This also aborts any running self-test.
The self-test log then reports Aborted by host or Interrupted (host reset) as status.
This is typical for drives behind USB bridges.
解決方案:
while true ; do dd if=/dev/sda iflag=direct count=1 of=/dev/null ; sleep 30 ; done
Extended captive
"Captive" mean the test is running in exclusive mode and drive respond to no commands until test completed.
Unfortunately, long-term busy out may cause many symptoms.
For example for RAID members it may cause the drive is expelled from RAID.
i.e. Linux
[ 2510.698368] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 2510.698376] ata1.00: failed command: SMART [ 2510.698382] ata1.00: cmd b0/d4:00:82:4f:c2/00:00:00:00:00/00 tag 24 res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [ 2510.698385] ata1.00: status: { DRDY } [ 2510.698391] ata1: hard resetting link [ 2511.003326] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 2511.024026] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359) [ 2511.024035] ACPI Error: Method parse/execution failed [\_SB.PCI0.SAT0.SPT0._GTF] (Node ffff8802938d4a00), AE_NOT_FOUND (20150930/psparse-542) [ 2511.025051] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359) [ 2511.025060] ACPI Error: Method parse/execution failed [\_SB.PCI0.SAT0.SPT0._GTF] (Node ffff8802938d4a00), AE_NOT_FOUND (20150930/psparse-542) [ 2511.025471] ata1.00: configured for UDMA/133 [ 2511.025506] ata1: EH complete
查看有關硬碟的所有健康指數
-A, --attributes
資料是以 RAW_VALUE 的形式存在, 它們是記錄在 HDD 的 SA 上
一般情況下只要 VALUE 數值 >> THRESH 數值就叫安全
這些參數可以幫助我們了解 HDD 的不同方面的健康
- "VALUE" - means the current value
- "WORST" - tells you what worst value SMART has ever assigned to this attribute
- "THRESH" - indicates the value at/below which SMART consideres the attribute a failure
i.e.
smartctl -A /dev/hda
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 35476009 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 31189422 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1637 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 15 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3 190 Airflow_Temperature_Cel 0x0022 059 056 045 Old_age Always - 41 (Lifetime Min/Max 37/42) 194 Temperature_Celsius 0x0022 041 044 000 Old_age Always - 41 (0 20 0 0) 195 Hardware_ECC_Recovered 0x001a 042 023 000 Old_age Always - 35476009 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 66060891981450 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2183356688 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2225147360
詳盡一些
-a, --all = -H -i -A -l error -l selftest
Offline Scan
說明
最後, 就是 offline scan 了, 它是要花很長時間才完成檢查, 雖然佢叫 offline scan, 不過隻碟依然用到的
1 TB 大約 7 小時左右
因為有些 "參數" 要 offline scan 時才收集到, 比如 Multi_Zone_Error_Rate, 所以 offline test 是必須的.
Offline test
smartctl -t offline /dev/sda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART off-line routine immediately in off-line mode". Drive command "Execute SMART off-line routine immediately in off-line mode" successful. Testing has begun. Please wait 584 seconds for test to complete. Test will complete after Mon Nov 11 16:23:12 2019
看看 Offline check 特質
smartctl -c /dev/sdc
* If the "-c" option show that the device has the "Abort Offline collection upon new command" capability then
most commands will abort the Immediate Offline Test,
so you should not try to track the progress of the test with '-c', as it will abort the test.
建議在 584 秒後再 check status
=== START OF READ SMART DATA SECTION === General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. ...
查看 offline scan result:
smartctl -l selftest /dev/sda
=== START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 4895 - # 2 Conveyance offline Completed without error 00% 1398 - ...
Automatic offline test
# 另外, 我們還可叫硬碟每 4 小時做一次 offline test (ATA only)
# Obsolete (However it is implemented and used by many vendors.)
# -o, --offlineauto
smartctl -o on /dev/hda
SMART Automatic Offline Testing Enabled every four hours.
Remark: Online V.S. Offline Check
SMART provides three basic categories of testing.
Online: no effect on the performance of the device. (-s on)
Offline: degrade the device performance. (-o on)
Normally, the disk will suspend offline testing while disk accesses are taking place,
and then automatically resume it when the disk would otherwise be idle
Self-test: (-t type)
SCT Error Recovery Control(-l scterc)
Western Digital: Time-Limited Error Recovery (TLER)
Seagate: Error Recovery Control (ERC)
當使用 ZFS 時有 mirror/zraid, 建議設定成 0.5 秒 (5) / Disabled (0)
Checking
smartctl -l scterc /dev/sdd
SCT Error Recovery Control: Read: Disabled Write: Disabled
Remark
SCT = SMART Command Transport(protocol)
scterc = prints values and descriptions of the SCT Error Recovery Control settings.
Compile from source
到以下網址下載最新版
http://sourceforge.net/projects/smartmontools/files/
下載
wget http://sourceforge.net/projects/smartmontools/files/smartmontools/6.4/sm...
tar zxvf smartmontools-6.4.tar.gz
cd smartmontools-6.4
compile
./configure
make
Install
cp -a smartctl /usr/sbin
Usage
smartctl -A /dev/sdb
第三方 RAID 的 Disk SMART
* smartmontools has to use "vendor specific I/O controls"
pass through I/O controls providing direct access to each physical disk.
# Link
https://www.smartmontools.org/wiki/Supported_RAID-Controllers
# Device
# LSI 3ware SATA RAID controller
# LSI MegaRAID SAS RAID controller Dell PERC 5/i,6/i controller
# Intel ICHxR RAID(Intel Rapid/Matrix Storage)
# Adaptec SAS RAID controller(devices supported by aacraid driver)
# Areca SATA[/SAS] RAID controller
# HighPoint RocketRAID SATA RAID controller
# CCISS (HP/Compaq Smart Array Controller)
# Dell 's PERC 6/i is really just an LSI MegaRAID controller rebranded.
# Check Smart Usage
smartctl -a -d megaraid,n /dev/sdx
-d TYPE # Specifies the type of the device
megaraid,N # The device consists of one or more SCSI/SAS disks connected to a MegaRAID controller. (0~127)
i.e.
smartctl -d megaraid,2 -a /dev/sda
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 932 3 Spin_Up_Time 0x0027 144 143 021 Pre-fail Always - 3791 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22273 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 37 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2 194 Temperature_Celsius 0x0022 112 096 000 Old_age Always - 7711 (0 0 0 32) 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 194 194 000 Old_age Offline - 1361
Update smartctl s' HDD DB
# current drive database
file /var/lib/smartmontools/drivedb/drivedb.h
/var/lib/smartmontools/drivedb/drivedb.h: ASCII text
update-smart-drivedb
/var/lib/smartmontools/drivedb/drivedb.h updated from branches/RELEASE_7_2_DRIVEDB
心得
Using smartctl on usb disk
smartctl -d sat -s on /dev/sdb
Usage on Window
# 查看系統安有什麼 Harddisk
smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device
# 查看它的 Info
smartctl -i /dev/sdb
=== START OF INFORMATION SECTION === Model Family: Western Digital Caviar Black Device Model: WDC WD2002FAEX-007BA0 Serial Number: WD-WMAWP0493116 LU WWN Device Id: 5 0014ee 058bf4a6f Firmware Version: 05.01D05 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue May 16 15:57:23 2023 HKT SMART support is: Available - device has SMART capability. SMART support is: Enabled
另外參考