[tech] DegradedArray event on /dev/md/1:motsugo
Mitchell Pomery
mjpomery at ucc.asn.au
Sat Sep 6 14:23:01 WST 2014
If someone can help me work out which SSD needs to be replaced, I can sort
that out in about an hour.
Mitch
On Sat, 6 Sep 2014, Matt Johnston wrote:
> I got a 128GB Samsung 850 Pro, $145 from PLE, MSY had no stock.
> It's on the floor of the machineroom, I forgot my key.
> I'll take coke credit.
>
> Matt
>
> On Fri, Sep 05, 2014 at 10:24:14PM +0800, Andrew Adamson wrote:
>> Nick and I pulled the busted OCZ Vertex 2 disk out tonight - it does get
>> recognised when plugged back in but `smartctl -a /dev/sdi' is showing lots
>> of old-age/pre-fail errors (output txt attached).
>>
>> Is anyone free this weekend to go and get another of the 128G Samsungs
>> that we've been buying lately? We've only got the one system disk at the
>> moment so it's rather urgent.
>>
>> On a side note, this is the disk that we were worried about dying at the
>> start of this year (and caused us to add another disk) - adding the extra
>> disk seems to have paid off :-)
>>
>> Andrew Adamson
>> bob at ucc.asn.au
>>
>> |"If you can't beat them, join them, and then beat them." |
>> | ---Peter's Laws |
>>
>> On Fri, 5 Sep 2014, Matt Johnston wrote:
>>
>>> Does motsugo's disk need replacing, or is something wrong
>>> with cables etc? smartctl can't see it I don't think. It's
>>> the system raid VG 'reliable'.
>>>
>>> Matt
>>>
>>> ----- Forwarded message from mdadm monitoring <root at ucc.gu.uwa.edu.au> -----
>>>
>>> Date: Fri, 5 Sep 2014 06:27:38 +0800 (WST)
>>> From: mdadm monitoring <root at ucc.gu.uwa.edu.au>
>>> To: root at ucc.gu.uwa.edu.au
>>> Subject: DegradedArray event on /dev/md/1:motsugo
>>>
>>> This is an automatically generated mail message from mdadm
>>> running on motsugo
>>>
>>> A DegradedArray event had been detected on md device /dev/md/1.
>>>
>>> Faithfully yours, etc.
>>>
>>> P.S. The /proc/mdstat file currently contains the following:
>>>
>>> Personalities : [raid1] [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdc1[0] sdg1[4] sdf1[3] sde1[2] sdd1[1]
>>> 5860535808 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
>>>
>>> md1 : active raid1 sda1[2](F) sdb1[1]
>>> 117211608 blocks super 1.2 [2/1] [_U]
>>>
>>> unused devices: <none>
>>>
>>> ----- End forwarded message -----
>>> _______________________________________________
>>> List Archives: http://lists.ucc.gu.uwa.edu.au/pipermail/tech
>>>
>>> Unsubscribe here: http://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bob%40ucc.gu.uwa.edu.au
>>>
>
>> root at motsugo:/var/log# tail -f /var/log/kern.log
>> Sep 5 22:06:23 motsugo kernel: [16759166.111673] ata1: SError: { PHYRdyChg DevExch }
>> Sep 5 22:06:23 motsugo kernel: [16759166.111705] ata1: hard resetting link
>> Sep 5 22:06:23 motsugo kernel: [16759166.831245] ata1: SATA link down (SStatus 0 SControl 300)
>> Sep 5 22:06:23 motsugo kernel: [16759166.831256] ata1: EH complete
>> Sep 5 22:06:23 motsugo kernel: [16759166.831269] ata1.00: detaching (SCSI 0:0:0:0)
>> Sep 5 22:06:23 motsugo kernel: [16759166.834140] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> Sep 5 22:06:23 motsugo kernel: [16759166.834182] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>> Sep 5 22:06:23 motsugo kernel: [16759166.834187] sd 0:0:0:0: [sda] Stopping disk
>> Sep 5 22:06:23 motsugo kernel: [16759166.834200] sd 0:0:0:0: [sda] START_STOP FAILED
>> Sep 5 22:06:23 motsugo kernel: [16759166.834202] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>>
>>
>>
>> Sep 5 22:08:54 motsugo kernel: [16759317.250330] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
>> Sep 5 22:08:54 motsugo kernel: [16759317.250377] ata1: irq_stat 0x00400040, connection status changed
>> Sep 5 22:08:54 motsugo kernel: [16759317.250406] ata1: SError: { RecovComm PHYRdyChg CommWake DevExch }
>> Sep 5 22:08:54 motsugo kernel: [16759317.250441] ata1: hard resetting link
>> Sep 5 22:08:55 motsugo kernel: [16759317.969984] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> Sep 5 22:08:55 motsugo kernel: [16759318.560019] ata1.00: ATA-8: OCZ-VERTEX2, 1.27, max UDMA/133
>> Sep 5 22:08:55 motsugo kernel: [16759318.560024] ata1.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
>> Sep 5 22:08:55 motsugo kernel: [16759318.631611] ata1.00: configured for UDMA/133
>> Sep 5 22:08:55 motsugo kernel: [16759318.631621] ata1: EH complete
>> Sep 5 22:08:55 motsugo kernel: [16759318.631752] scsi 0:0:0:0: Direct-Access ATA OCZ-VERTEX2 1.27 PQ: 0 ANSI: 5
>> Sep 5 22:08:55 motsugo kernel: [16759318.632304] sd 0:0:0:0: Attached scsi generic sg0 type 0
>> Sep 5 22:08:55 motsugo kernel: [16759318.632307] sd 0:0:0:0: [sdi] 234441648 512-byte logical blocks: (120 GB/111 GiB)
>> Sep 5 22:08:55 motsugo kernel: [16759318.632395] sd 0:0:0:0: [sdi] Write Protect is off
>> Sep 5 22:08:55 motsugo kernel: [16759318.632399] sd 0:0:0:0: [sdi] Mode Sense: 00 3a 00 00
>> Sep 5 22:08:55 motsugo kernel: [16759318.632430] sd 0:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>> Sep 5 22:08:55 motsugo kernel: [16759318.633184] sdi: sdi1
>> Sep 5 22:08:55 motsugo kernel: [16759318.633491] sd 0:0:0:0: [sdi] Attached SCSI disk
>>
>>
>> ^C
>> root at motsugo:/var/log# fdisk -l /dev/sdi
>>
>> Disk /dev/sdi: 120.0 GB, 120034123776 bytes
>> 81 heads, 63 sectors/track, 45941 cylinders, total 234441648 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disk identifier: 0x000291a1
>>
>> Device Boot Start End Blocks Id System
>> /dev/sdi1 * 2048 234441647 117219800 fd Linux raid autodetect
>> root at motsugo:/var/log# smartctl -a /dev/sdi
>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
>>
>> === START OF INFORMATION SECTION ===
>> Model Family: SandForce Driven SSDs
>> Device Model: OCZ-VERTEX2
>> Serial Number: OCZ-10O78Z46ES6Z8177
>> LU WWN Device Id: 5 e83a97 fead4d449
>> Firmware Version: 1.27
>> User Capacity: 120,034,123,776 bytes [120 GB]
>> Sector Size: 512 bytes logical/physical
>> Device is: In smartctl database [for details use: -P show]
>> ATA Version is: 8
>> ATA Standard is: ATA-8-ACS revision 6
>> Local Time is: Fri Sep 5 22:09:48 2014 WST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: (0x00) Offline data collection activity
>> was never started.
>> Auto Offline Data Collection: Disabled.
>> Self-test execution status: ( 0) The previous self-test routine completed
>> without error or no self-test has ever
>> been run.
>> Total time to complete Offline
>> data collection: ( 0) seconds.
>> Offline data collection
>> capabilities: (0x7f) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Abort Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities: (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability: (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: ( 1) minutes.
>> Extended self-test routine
>> recommended polling time: ( 48) minutes.
>> Conveyance self-test routine
>> recommended polling time: ( 2) minutes.
>> SCT capabilities: (0x003d) SCT Status supported.
>> SCT Error Recovery Control supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 10
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x000f 120 120 050 Pre-fail Always - 0/0
>> 5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
>> 9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 31432h+05m+50.310s
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 45
>> 171 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
>> 172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
>> 174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 28
>> 177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 0
>> 181 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
>> 182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
>> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
>> 194 Temperature_Celsius 0x0022 030 129 000 Old_age Always - 30 (Min/Max 30/30)
>> 195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/0
>> 196 Reallocated_Event_Count 0x0033 100 100 000 Pre-fail Always - 0
>> 231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always - 0
>> 233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 3392
>> 234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 3456
>> 241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 3456
>> 242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 13184
>>
>> SMART Error Log not supported
>> SMART Self-test Log not supported
>> SMART Selective self-test log data structure revision number 1
>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
>> 1 0 0 Not_testing
>> 2 0 0 Not_testing
>> 3 0 0 Not_testing
>> 4 0 0 Not_testing
>> 5 0 0 Not_testing
>> Selective self-test flags (0x0):
>> After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> _______________________________________________
> List Archives: http://lists.ucc.gu.uwa.edu.au/pipermail/tech
>
> Unsubscribe here: http://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bobgeorge33%40ucc.asn.au
>
More information about the tech
mailing list