最後更新: 2020-02-03
SAS Disk 的健康指數
有關參數
- Total uncorrected errors
- Elements in grown defect list
Disk on HW RAID
lspci | grep -i -E 'raid|adaptec'
01:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)
smartctl --scan
/dev/bus/8 -d megaraid,0 # /dev/bus/8 [megaraid_disk_00], SCSI device /dev/bus/8 -d megaraid,2 # /dev/bus/8 [megaraid_disk_02], SCSI device
smartctl -H -d megaraid,0 /dev/bus/8
=== START OF READ SMART DATA SECTION === SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5 [asc=5d, ascq=5]
smartctl -A -d megaraid,0 /dev/bus/8
=== START OF READ SMART DATA SECTION === Current Drive Temperature: 44 C Drive Trip Temperature: 68 C Accumulated power on time, hours:minutes 11909:58 Elements in grown defect list: 1962
Default: an empty grown defect list (or maybe up to 5 entries on just a few drives)
If the number is not zero => monitor the defect list for some time to see if it is still growing.
A steadily growing defect list is a good sign for the drive to fail in the near future.
檢查方式
1. 快速概覽
MegaCli64
MegaCli64 -PDlist -a0
MegaCli64 -PDinfo -PhysDrv[252:2] -a0
... Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0
smartctl
smartctl --scan
... /dev/bus/0 -d megaraid,16 # /dev/bus/0 [megaraid_disk_16], SCSI device
smartctl -d megaraid,16 -a /dev/bus/0
Remark
- -l error
- -l selftest
- -x # 有 "Background scan results log" Info
smartctl -d megaraid,16 -t short /dev/bus/0
smartctl -d megaraid,16 -a /dev/bus/0
2. 詳細檢查
MegaCli64 -cfgldadd R0[252:2] WT NORA -a0
MegaCli64 -LDInfo -Lall -a0
smartctl -d megaraid,16 -t long /dev/bus/0
smartctl -d megaraid,16 -a /dev/bus/0
... Self-test execution status: 89% of test remaining SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Self test in progress ... 64 NOW - [- - -] # 2 Background short Completed 64 39758 - [- - -] Long (extended) Self-test duration: 5616 seconds [93.6 minutes]
LifeTime: 相當卡 "number of hours powered up"
Vendor (Seagate Cache) information Blocks sent to initiator = 2668630296 # 不會上升 Blocks received from initiator = 1675807082 # 不會上升 Blocks read from cache and sent to initiator = 4272401990
3. 清 Data
dmesg # 找出正確的清的 Disk
dd if=/dev/zero of=/dev/sdX bs=32M oflag=direct # "Blocks received from initiator" 會不斷上升
smartctl -d megaraid,16 -a /dev/bus/0
MegaCli64 -CfgLdDel -L1 -a0
Adapter 0: Deleted Virtual Drive-1(target id-1)
MegaCli64 -PDPrpRmv -PhysDrv[252:2] -a0
Prepare for removal Success
MegaCli64 -PDinfo -PhysDrv[252:2] -a0
...
Firmware state: Unconfigured(good), Spun down
Toubleshoot
smartctl -d megaraid,16 -a /dev/bus/0
... SMART support is: Unavailable - device lacks SMART capability. === START OF READ SMART DATA SECTION === Current Drive Temperature: 0 C Drive Trip Temperature: 0 C Error Counter logging not supported Device does not support Self Test logging
原因: Firmware state: Unconfigured(good), Spun down
Disk Error Info
# SAS Disk
smartctl -d megaraid,20 -l error /dev/bus/0 | less
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 593604873 1 0 593604874 1 1150264.607 0 write: 0 0 3 3 3 159035.025 0 verify: 2060722324 0 0 2060722324 0 212460.568 0 Non-medium error count: 57
Errors Corrected by ECC: fast (00h)
-
Errors corrected without substantial delay
(correction did not postpone reading of later sectors) - Two different blocks corrected during the same command are counted as two events.
Errors Corrected by ECC: delayed (01h)
-
"With possible delay" means the correction took longer than a sector time
so that reading/writing of subsequent sectors was delayed
Error corrected by (rereads/rewrites) (02h)
- This counts errors recovered, not the number of retries.
- If five retries were required to recover one block of data, the counter increments by one, not five.
- If an error is not recoverable while applying retries and is recovered by ECC, it isn't counted by this counter;
Correction algorithm invocations (04h)
If after five attempts a counter 02h type error is recovered, then five is added to this counter.
If three retries are required to get stable ECC syndrome before a counter 01h type error is corrected,
then those three retries are also counted here.
The number of retries applied to unsuccessfully recover an error (counter 06h type error) are also counted by this counter.
Background Media Scan info
These are reads of the whole media with recoverable errors acted on and unrecoverable errors noted.
If a sector (block) is found with a recoverable error it may be fixed with a re-write "in place".
(i.e. the error correction codes (ECC) detect a problem but contain enough redundant information to fix the problem)
Alternatively the disk may decide to re-assign the recovered data to another physical sector
which is assigned the same logical block address
(The original faulted sector is unmapped and placed on the grown defect list (GLIST))
smartctl -d megaraid,20 -l background /dev/bus/0
=== START OF READ SMART DATA SECTION === Background scan results log Status: scan is active Accumulated power on time, hours:minutes 39686:25 [2381185 minutes] Number of background scans performed: 554, scan progress: 47.28% Number of background medium scans performed: 554 # when lba(hex) [sk,asc,ascq] reassign_status 1 793:29 0000000001d44b4e [1,17,1] Recovered via rewrite in-place 2 2162:12 00000000055af5de [1,17,2] Recovered via rewrite in-place 3 2738:24 0000000001d4415b [1,17,1] Recovered via rewrite in-place 4 2882:29 0000000002d5cba8 [1,17,1] Recovered via rewrite in-place
Other