Cleanup & Check Harddisk Under LSI RAID

由 datahunter 在二, 02/02/2021 - 17:47 發表

最後更新: 2020-02-03

SAS Disk 的健康指數

有關參數

Total uncorrected errors
Elements in grown defect list

Disk on HW RAID

lspci | grep -i -E 'raid|adaptec'

01:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)

smartctl --scan

/dev/bus/8 -d megaraid,0 # /dev/bus/8 [megaraid_disk_00], SCSI device
/dev/bus/8 -d megaraid,2 # /dev/bus/8 [megaraid_disk_02], SCSI device

smartctl -H -d megaraid,0 /dev/bus/8

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5 [asc=5d, ascq=5]

smartctl -A -d megaraid,0 /dev/bus/8

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     44 C
Drive Trip Temperature:        68 C

Accumulated power on time, hours:minutes 11909:58
Elements in grown defect list: 1962

Elements in grown defect list

Default: an empty grown defect list (or maybe up to 5 entries on just a few drives)

If the number is not zero => monitor the defect list for some time to see if it is still growing.

A steadily growing defect list is a good sign for the drive to fail in the near future.

檢查方式

1. 快速概覽

MegaCli64

MegaCli64 -PDlist -a0

MegaCli64 -PDinfo -PhysDrv[252:2] -a0

...
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0

smartctl

smartctl --scan

...
/dev/bus/0 -d megaraid,16 # /dev/bus/0 [megaraid_disk_16], SCSI device

smartctl -d megaraid,16 -a /dev/bus/0

Remark

-l error
-l selftest
-x # 有 "Background scan results log" Info

smartctl -d megaraid,16 -t short /dev/bus/0

smartctl -d megaraid,16 -a /dev/bus/0

2. 詳細檢查

MegaCli64 -cfgldadd R0[252:2] WT NORA -a0

MegaCli64 -LDInfo -Lall -a0

smartctl -d megaraid,16 -t long /dev/bus/0

smartctl -d megaraid,16 -a /dev/bus/0

...
Self-test execution status:             89% of test remaining
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...  64     NOW                 - [-   -    -]
# 2  Background short  Completed                  64   39758                 - [-   -    -]

Long (extended) Self-test duration: 5616 seconds [93.6 minutes]

LifeTime: 相當卡 "number of hours powered up"

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2668630296                # 不會上升
  Blocks received from initiator = 1675807082          # 不會上升
  Blocks read from cache and sent to initiator = 4272401990

3. 清 Data

dmesg # 找出正確的清的 Disk

dd if=/dev/zero of=/dev/sdX bs=32M oflag=direct # "Blocks received from initiator" 會不斷上升

smartctl -d megaraid,16 -a /dev/bus/0

MegaCli64 -CfgLdDel -L1 -a0

Adapter 0: Deleted Virtual Drive-1(target id-1)

MegaCli64 -PDPrpRmv -PhysDrv[252:2] -a0

Prepare for removal Success

MegaCli64 -PDinfo -PhysDrv[252:2] -a0

...
Firmware state: Unconfigured(good), Spun down

Toubleshoot

smartctl -d megaraid,16 -a /dev/bus/0

...
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

原因: Firmware state: Unconfigured(good), Spun down

Disk Error Info

# SAS Disk

smartctl -d megaraid,20 -l error /dev/bus/0 | less

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   593604873        1         0  593604874          1    1150264.607           0
write:          0        0         3          3          3     159035.025           0
verify: 2060722324       0         0  2060722324         0     212460.568           0

Non-medium error count:       57

Errors Corrected by ECC: fast (00h)

Errors corrected without substantial delay
(correction did not postpone reading of later sectors)
Two different blocks corrected during the same command are counted as two events.

Errors Corrected by ECC: delayed (01h)

"With possible delay" means the correction took longer than a sector time
so that reading/writing of subsequent sectors was delayed

Error corrected by (rereads/rewrites) (02h)

This counts errors recovered, not the number of retries.
If five retries were required to recover one block of data, the counter increments by one, not five.
If an error is not recoverable while applying retries and is recovered by ECC, it isn't counted by this counter;

Correction algorithm invocations (04h)

If after five attempts a counter 02h type error is recovered, then five is added to this counter.
If three retries are required to get stable ECC syndrome before a counter 01h type error is corrected,
then those three retries are also counted here.
The number of retries applied to unsuccessfully recover an error (counter 06h type error) are also counted by this counter.

Background Media Scan info

These are reads of the whole media with recoverable errors acted on and unrecoverable errors noted.

If a sector (block) is found with a recoverable error it may be fixed with a re-write "in place".

(i.e. the error correction codes (ECC) detect a problem but contain enough redundant information to fix the problem)

Alternatively the disk may decide to re-assign the recovered data to another physical sector

which is assigned the same logical block address

(The original faulted sector is unmapped and placed on the grown defect list (GLIST))

smartctl -d megaraid,20 -l background /dev/bus/0

=== START OF READ SMART DATA SECTION ===
Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 39686:25 [2381185 minutes]
    Number of background scans performed: 554,  scan progress: 47.28%
    Number of background medium scans performed: 554

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1  793:29  0000000001d44b4e  [1,17,1]   Recovered via rewrite in-place
   2 2162:12  00000000055af5de  [1,17,2]   Recovered via rewrite in-place
   3 2738:24  0000000001d4415b  [1,17,1]   Recovered via rewrite in-place
   4 2882:29  0000000002d5cba8  [1,17,1]   Recovered via rewrite in-place

Other

sdparm

瀏覽次數： 881

夢想家

Cleanup & Check Harddisk Under LSI RAID

SAS Disk 的健康指數

Disk on HW RAID

檢查方式

Toubleshoot

Disk Error Info

Background Media Scan info

Other