Cleanup & Check Harddisk Under LSI RAID三, 03/02/2021 - 16:20 的修訂版本

修訂版本可以讓你追蹤文章的多個版本的不同之處。

最後更新: 2020-02-03

 

 


SAS Disk 的健康指數

 

有關參數

  • Total uncorrected errors
  • Elements in grown defect list

Default: an empty grown defect list (or maybe up to 5 entries on just a few drives)

If the number is not zero => monitor the defect list for some time to see if it is still growing.

A steadily growing defect list is a good sign for the drive to fail in the near future.

 


檢查方式

 

1. 快速概覽

MegaCli64 -PDinfo -PhysDrv[252:2] -a0

...
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0

smartctl --scan

...
/dev/bus/0 -d megaraid,16 # /dev/bus/0 [megaraid_disk_16], SCSI device

smartctl -d megaraid,16 -a /dev/bus/0

Remark

  • -l error
  • -l selftest
  • -x                # 有 "Background scan results log" Info

smartctl -d megaraid,16 -t short /dev/bus/0

smartctl -d megaraid,16 -a /dev/bus/0

2. 詳細檢查

MegaCli64 -cfgldadd R0[252:2] WT NORA -a0

MegaCli64 -LDInfo -Lall -a0

smartctl -d megaraid,16 -t long /dev/bus/0

smartctl -d megaraid,16 -a /dev/bus/0

...
Self-test execution status:             89% of test remaining
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...  64     NOW                 - [-   -    -]
# 2  Background short  Completed                  64   39758                 - [-   -    -]

Long (extended) Self-test duration: 5616 seconds [93.6 minutes]

LifeTime: 相當卡 "number of hours powered up"

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2668630296                # 不會上升
  Blocks received from initiator = 1675807082          # 不會上升
  Blocks read from cache and sent to initiator = 4272401990

3. 清 Data

dmesg                                                                    # 找出正確的清的 Disk

dd if=/dev/zero of=/dev/sdX bs=32M oflag=direct     # "Blocks received from initiator" 會不斷上升

smartctl -d megaraid,16 -a /dev/bus/0

MegaCli64 -CfgLdDel -L1 -a0

Adapter 0: Deleted Virtual Drive-1(target id-1)

MegaCli64 -PDPrpRmv -PhysDrv[252:2] -a0

Prepare for removal Success

MegaCli64 -PDinfo -PhysDrv[252:2] -a0

...
Firmware state: Unconfigured(good), Spun down

 


Toubleshoot

 

smartctl -d megaraid,16 -a /dev/bus/0

...
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

原因: Firmware state: Unconfigured(good), Spun down

 => 為免 HDD Spun down, 我為要建立 R0

 

 


Disk Error Info

 

smartctl -d megaraid,20 -l error /dev/bus/0 | less

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   593604873        1         0  593604874          1    1150264.607           0
write:         0        0         3         3          3     159035.025           0
verify: 2060722324        0         0  2060722324          0     212460.568           0

Non-medium error count:       57

Errors Corrected by ECC, fast (00h)

Errors corrected without substantial delay
(correction did not postpone reading of later sectors)
Two different blocks corrected during the same command are counted as two events.

Errors Corrected by ECC: delayed (01h)

"With possible delay" means the correction took longer than a sector time
so that reading/writing of subsequent sectors was delayed

Error corrected by rereads/rewrites (02h)

This counts errors recovered, not the number of retries.
If five retries were required to recover one block of data, the counter increments by one, not five.
If an error is not recoverable while applying retries and is recovered by ECC, it isn't counted by this counter;

Correction algorithm invocations (04h)

If after five attempts a counter 02h type error is recovered, then five is added to this counter.
If three retries are required to get stable ECC syndrome before a counter 01h type error is corrected,
then those three retries are also counted here.
The number of retries applied to unsuccessfully recover an error (counter 06h type error) are also counted by this counter.

 


Background Media Scan info

 

These are reads of the whole media with recoverable errors acted on and unrecoverable errors noted.

If a sector (block) is found with a recoverable error it may be fixed with a re-write "in place".

(i.e. the error correction codes (ECC) detect a problem but contain enough redundant information to fix the problem)

Alternatively the disk may decide to re-assign the recovered data to another physical sector

which is assigned the same logical block address

(The original faulted sector is unmapped and placed on the grown defect list (GLIST))

smartctl -d megaraid,20 -l background /dev/bus/0

=== START OF READ SMART DATA SECTION ===
Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 39686:25 [2381185 minutes]
    Number of background scans performed: 554,  scan progress: 47.28%
    Number of background medium scans performed: 554

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1  793:29  0000000001d44b4e  [1,17,1]   Recovered via rewrite in-place
   2 2162:12  00000000055af5de  [1,17,2]   Recovered via rewrite in-place
   3 2738:24  0000000001d4415b  [1,17,1]   Recovered via rewrite in-place
   4 2882:29  0000000002d5cba8  [1,17,1]   Recovered via rewrite in-place

 


Other

 

sdparm