[tech] DegradedArray event on /dev/md/1:motsugo

Sat Sep 6 16:54:09 WST 2014

So [BG3] and I set up the new SSD: all seems to be working fine.

Linux is calling it sdi, because it can, so the system raid is:
/dev/md1: sdb1, sdi1

Everything's been synced over, and there have been no obvious problems. 
Grub was installed with the usual "grub-install /dev/sdi". It should work, 
but we haven't rebooted motsugo to check. The next person to reboot it may 
be in for some excitement if grub is feeling grumpy.

— [SLX]

On Sat, 6 Sep 2014, Mitchell Pomery wrote:

> If someone can help me work out which SSD needs to be replaced, I can sort
> that out in about an hour.
>
> Mitch
>
> On Sat, 6 Sep 2014, Matt Johnston wrote:
>
>> I got a 128GB Samsung 850 Pro, $145 from PLE, MSY had no stock.
>> It's on the floor of the machineroom, I forgot my key.
>> I'll take coke credit.
>>
>> Matt
>>
>> On Fri, Sep 05, 2014 at 10:24:14PM +0800, Andrew Adamson wrote:
>>> Nick and I pulled the busted OCZ Vertex 2 disk out tonight - it does get
>>> recognised when plugged back in but `smartctl -a /dev/sdi' is showing lots
>>> of old-age/pre-fail errors (output txt attached).
>>>
>>> Is anyone free this weekend to go and get another of the 128G Samsungs
>>> that we've been buying lately? We've only got the one system disk at the
>>> moment so it's rather urgent.
>>>
>>> On a side note, this is the disk that we were worried about dying at the
>>> start of this year (and caused us to add another disk) - adding the extra
>>> disk seems to have paid off :-)
>>>
>>> Andrew Adamson
>>> bob at ucc.asn.au
>>>
>>> |"If you can't beat them, join them, and then beat them."                |
>>> | ---Peter's Laws                                                        |
>>>
>>> On Fri, 5 Sep 2014, Matt Johnston wrote:
>>>
>>>> Does motsugo's disk need replacing, or is something wrong
>>>> with cables etc? smartctl can't see it I don't think. It's
>>>> the system raid VG 'reliable'.
>>>>
>>>> Matt
>>>>
>>>> ----- Forwarded message from mdadm monitoring <root at ucc.gu.uwa.edu.au> -----
>>>>
>>>> Date: Fri,  5 Sep 2014 06:27:38 +0800 (WST)
>>>> From: mdadm monitoring <root at ucc.gu.uwa.edu.au>
>>>> To: root at ucc.gu.uwa.edu.au
>>>> Subject: DegradedArray event on /dev/md/1:motsugo
>>>>
>>>> This is an automatically generated mail message from mdadm
>>>> running on motsugo
>>>>
>>>> A DegradedArray event had been detected on md device /dev/md/1.
>>>>
>>>> Faithfully yours, etc.
>>>>
>>>> P.S. The /proc/mdstat file currently contains the following:
>>>>
>>>> Personalities : [raid1] [raid6] [raid5] [raid4]
>>>> md0 : active raid6 sdc1[0] sdg1[4] sdf1[3] sde1[2] sdd1[1]
>>>>       5860535808 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
>>>>
>>>> md1 : active raid1 sda1[2](F) sdb1[1]
>>>>       117211608 blocks super 1.2 [2/1] [_U]
>>>>
>>>> unused devices: <none>
>>>>
>>>> ----- End forwarded message -----
>>>> _______________________________________________
>>>> List Archives: http://lists.ucc.gu.uwa.edu.au/pipermail/tech
>>>>
>>>> Unsubscribe here: http://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bob%40ucc.gu.uwa.edu.au
>>>>
>>
>>> root at motsugo:/var/log# tail -f /var/log/kern.log
>>> Sep  5 22:06:23 motsugo kernel: [16759166.111673] ata1: SError: { PHYRdyChg DevExch }
>>> Sep  5 22:06:23 motsugo kernel: [16759166.111705] ata1: hard resetting link
>>> Sep  5 22:06:23 motsugo kernel: [16759166.831245] ata1: SATA link down (SStatus 0 SControl 300)
>>> Sep  5 22:06:23 motsugo kernel: [16759166.831256] ata1: EH complete
>>> Sep  5 22:06:23 motsugo kernel: [16759166.831269] ata1.00: detaching (SCSI 0:0:0:0)
>>> Sep  5 22:06:23 motsugo kernel: [16759166.834140] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>> Sep  5 22:06:23 motsugo kernel: [16759166.834182] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>>> Sep  5 22:06:23 motsugo kernel: [16759166.834187] sd 0:0:0:0: [sda] Stopping disk
>>> Sep  5 22:06:23 motsugo kernel: [16759166.834200] sd 0:0:0:0: [sda] START_STOP FAILED
>>> Sep  5 22:06:23 motsugo kernel: [16759166.834202] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>>>
>>>
>>>
>>> Sep  5 22:08:54 motsugo kernel: [16759317.250330] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
>>> Sep  5 22:08:54 motsugo kernel: [16759317.250377] ata1: irq_stat 0x00400040, connection status changed
>>> Sep  5 22:08:54 motsugo kernel: [16759317.250406] ata1: SError: { RecovComm PHYRdyChg CommWake DevExch }
>>> Sep  5 22:08:54 motsugo kernel: [16759317.250441] ata1: hard resetting link
>>> Sep  5 22:08:55 motsugo kernel: [16759317.969984] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>> Sep  5 22:08:55 motsugo kernel: [16759318.560019] ata1.00: ATA-8: OCZ-VERTEX2, 1.27, max UDMA/133
>>> Sep  5 22:08:55 motsugo kernel: [16759318.560024] ata1.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
>>> Sep  5 22:08:55 motsugo kernel: [16759318.631611] ata1.00: configured for UDMA/133
>>> Sep  5 22:08:55 motsugo kernel: [16759318.631621] ata1: EH complete
>>> Sep  5 22:08:55 motsugo kernel: [16759318.631752] scsi 0:0:0:0: Direct-Access     ATA      OCZ-VERTEX2      1.27 PQ: 0 ANSI: 5
>>> Sep  5 22:08:55 motsugo kernel: [16759318.632304] sd 0:0:0:0: Attached scsi generic sg0 type 0
>>> Sep  5 22:08:55 motsugo kernel: [16759318.632307] sd 0:0:0:0: [sdi] 234441648 512-byte logical blocks: (120 GB/111 GiB)
>>> Sep  5 22:08:55 motsugo kernel: [16759318.632395] sd 0:0:0:0: [sdi] Write Protect is off
>>> Sep  5 22:08:55 motsugo kernel: [16759318.632399] sd 0:0:0:0: [sdi] Mode Sense: 00 3a 00 00
>>> Sep  5 22:08:55 motsugo kernel: [16759318.632430] sd 0:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>>> Sep  5 22:08:55 motsugo kernel: [16759318.633184]  sdi: sdi1
>>> Sep  5 22:08:55 motsugo kernel: [16759318.633491] sd 0:0:0:0: [sdi] Attached SCSI disk
>>>
>>>
>>> ^C
>>> root at motsugo:/var/log# fdisk -l /dev/sdi
>>>
>>> Disk /dev/sdi: 120.0 GB, 120034123776 bytes
>>> 81 heads, 63 sectors/track, 45941 cylinders, total 234441648 sectors
>>> Units = sectors of 1 * 512 = 512 bytes
>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>> Disk identifier: 0x000291a1
>>>
>>>    Device Boot      Start         End      Blocks   Id  System
>>> /dev/sdi1   *        2048   234441647   117219800   fd  Linux raid autodetect
>>> root at motsugo:/var/log# smartctl -a /dev/sdi
>>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
>>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
>>>
>>> === START OF INFORMATION SECTION ===
>>> Model Family:     SandForce Driven SSDs
>>> Device Model:     OCZ-VERTEX2
>>> Serial Number:    OCZ-10O78Z46ES6Z8177
>>> LU WWN Device Id: 5 e83a97 fead4d449
>>> Firmware Version: 1.27
>>> User Capacity:    120,034,123,776 bytes [120 GB]
>>> Sector Size:      512 bytes logical/physical
>>> Device is:        In smartctl database [for details use: -P show]
>>> ATA Version is:   8
>>> ATA Standard is:  ATA-8-ACS revision 6
>>> Local Time is:    Fri Sep  5 22:09:48 2014 WST
>>> SMART support is: Available - device has SMART capability.
>>> SMART support is: Enabled
>>>
>>> === START OF READ SMART DATA SECTION ===
>>> SMART overall-health self-assessment test result: PASSED
>>>
>>> General SMART Values:
>>> Offline data collection status:  (0x00) Offline data collection activity
>>>                                         was never started.
>>>                                         Auto Offline Data Collection: Disabled.
>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>                                         without error or no self-test has ever
>>>                                         been run.
>>> Total time to complete Offline
>>> data collection:                (    0) seconds.
>>> Offline data collection
>>> capabilities:                    (0x7f) SMART execute Offline immediate.
>>>                                         Auto Offline data collection on/off support.
>>>                                         Abort Offline collection upon new
>>>                                         command.
>>>                                         Offline surface scan supported.
>>>                                         Self-test supported.
>>>                                         Conveyance Self-test supported.
>>>                                         Selective Self-test supported.
>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>                                         power-saving mode.
>>>                                         Supports SMART auto save timer.
>>> Error logging capability:        (0x01) Error logging supported.
>>>                                         General Purpose Logging supported.
>>> Short self-test routine
>>> recommended polling time:        (   1) minutes.
>>> Extended self-test routine
>>> recommended polling time:        (  48) minutes.
>>> Conveyance self-test routine
>>> recommended polling time:        (   2) minutes.
>>> SCT capabilities:              (0x003d) SCT Status supported.
>>>                                         SCT Error Recovery Control supported.
>>>                                         SCT Feature Control supported.
>>>                                         SCT Data Table supported.
>>>
>>> SMART Attributes Data Structure revision number: 10
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>>>   1 Raw_Read_Error_Rate     0x000f   120   120   050    Pre-fail  Always       -       0/0
>>>   5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
>>>   9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       31432h+05m+50.310s
>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
>>> 171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
>>> 172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
>>> 174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       28
>>> 177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0
>>> 181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
>>> 182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
>>> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
>>> 194 Temperature_Celsius     0x0022   030   129   000    Old_age   Always       -       30 (Min/Max 30/30)
>>> 195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/0
>>> 196 Reallocated_Event_Count 0x0033   100   100   000    Pre-fail  Always       -       0
>>> 231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
>>> 233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       3392
>>> 234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       3456
>>> 241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       3456
>>> 242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       13184
>>>
>>> SMART Error Log not supported
>>> SMART Self-test Log not supported
>>> SMART Selective self-test log data structure revision number 1
>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>     1        0        0  Not_testing
>>>     2        0        0  Not_testing
>>>     3        0        0  Not_testing
>>>     4        0        0  Not_testing
>>>     5        0        0  Not_testing
>>> Selective self-test flags (0x0):
>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>> _______________________________________________
>> List Archives: http://lists.ucc.gu.uwa.edu.au/pipermail/tech
>>
>> Unsubscribe here: http://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bobgeorge33%40ucc.asn.au
>>
> _______________________________________________
> List Archives: http://lists.ucc.gu.uwa.edu.au/pipermail/tech
>
> Unsubscribe here: http://lists.ucc.gu.uwa.edu.au/mailman/options/tech/sulix%40ucc.gu.uwa.edu.au
>