[tech] Dead disk in Molmol
Andrew Adamson
bob at ucc.gu.uwa.edu.au
Fri May 10 21:35:30 AWST 2019
Hi All,
So with the help of [DBA], we unracked molmol last week and replaced the
failing slog/system disk with a Samsung . So molmol is out of the
woods...ish.
Unfortunately we were unable to install the optane drive, since it turns
out the Molmol mobo was a x9srh-7f, not a x9srh-7tf as mentioned on the
wiki, and so there is no spare slot in the machine.
There's a couple of things we could do here:
- get a pci-e x16 splitter/riser and put the two SAS cards on it to free
up a slot
- replace the two 8-port SAS cards with a single 16-port card. This could
be challenging as they are low-profile cards
- put the optane card in some other machine for some other task
Thoughts/ideas?
Andrew Adamson
bob at ucc.asn.au
|"If you can't beat them, join them, and then beat them." |
| ---Peter's Laws |
On Sun, 3 Mar 2019, Bob Adamson wrote:
> Hi All,
>
> I think we should look for something a bit more enterprisey for this task,
> since it is such a critical component - this machine hosts a lot of club and
> member VM storage, as well as clubroom desktop home directories. Given the
> issues we seem to be having with speeds, I don't think we should skimp on
> disks this time around.
>
> This page, though from 2014, details how we might check the performance of
> SSD's as a Ceph journaling device, which (aiui) uses synchronous writes
> similar to the requirements of NFS on ZFS:
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is
> -suitable-as-a-journal-device/ . The results are somewhat scant, but what is
> apparent is the order of magnitude in speed difference between consumer and
> enterprise SSDs for this use.
>
> Despite having many bays on the front, molmol only supports 8 SAS disks, 2
> SATA3 disks, and 4 SATA2 disks. The 8 SAS ports are taken up by spinning
> disks at the moment, and the 2 SATA3 ports are used by the system/SLOG
> disks. There's really no point in using the SATA2 ports due to their speed.
> The mobo is a supermicro x9srh-7tf , so it has 1xPCIe 3.0 x16 slot and
> 1xPCIe 3.0 x8 slot. Given that the mobo has 10G ethernet onboard, I think
> both of those slots should be free. The case itself is 2RU, so we could
> support 2 low profile PCIe SSD cards.
>
> Anyway, what I'm thinking is to replace the failing system disk with another
> similar SSD, then chuck a single, fast, PCIe SSD in it for the SLOG and
> L2ARC only. If it fails, aiui we don't have a corrupt file system, we just
> lose the last 5 seconds of data (correct me if I'm wrong here?). This is
> based on a few google results, like
> https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSSLOGLossEffects
>
> $409 plus delivery for an Intel Optane 900P:
> https://www.scorptec.com.au/product/Hard-Drives-&-SSDs/SSD-2.5-&-PCI-Express
> /70481-SSDPED1D280GASX
>
> Plus $90 to replace the failed system disk with a 250GB 860EVO:
> https://www.scorptec.com.au/product/Hard-Drives-&-SSDs/SSD-2.5-&-PCI-Express
> /71382-MZ-76E250BW
>
> $15 delivery
> $514 total
>
> Thoughts?
>
> I'm happy to order it, just approve it at a committee meeting (or outside of
> one via circular) and let me know.
>
> Thanks, Bob
>
> -----Original Message-----
> From: tech-bounces+bob=ucc.gu.uwa.edu.au at ucc.asn.au
> <tech-bounces+bob=ucc.gu.uwa.edu.au at ucc.asn.au> On Behalf Of David Adam
> Sent: Saturday, 2 March 2019 9:27 PM
> To: tech at ucc.gu.uwa.edu.au
> Subject: [tech] Dead disk in Molmol
>
> Hi all,
>
> Molmol has dropped one of its SSDs:
>
> Feb 26 14:15:10 molmol kernel: ahcich1: Timeout on slot 25 port 0 Feb 26
> 14:15:10 molmol kernel: ahcich1: is 00000000 cs 02000000 ss 00000000 rs
> 02000000 tfd c0 serr 00000000 cmd 0004d917 Feb 26 14:15:10 molmol kernel:
> (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> Feb 26 14:30:34 molmol kernel: (ada1:ahcich1:0:0:0): CAM status: Command
> timeout Feb 26 14:30:34 molmol kernel: (ada1:ahcich1:0:0:0): Retrying
> command Feb 26 14:30:34 molmol kernel: ahcich1: AHCI reset: device not ready
> after 31000ms (tfd = 00000080)
> (etc.)
>
> It's detached from the bus and won't reattach.
>
> The device is a Samsung SSD 840 PRO Series DXM05B0Q (s/n S1ATNSAD864731A)
> - note that there are two of these in the machine! I'm not sure whether it
> is hotpluggable or not.
>
> This SSD was providing one half of the SLOG mirror [1] and a RAID partition
> for the root filesystem. The other half is provided by the other Samsung 840
> PRO:
>
> zfs pool status:
>
> NAME STATE READ WRITE CKSUM
> logs
> mirror-4 DEGRADED 0 0 0
> 5535644740799039914 REMOVED 0 0 0 was
> /dev/gpt/molmol-slog
> gpt/molmol-slog0 ONLINE 0 0 0
>
>
> Checking status of gmirror(8) devices:
> Name Status Components
> mirror/gmirror0 DEGRADED ada0p2 (ACTIVE)
>
> If one has gone, I suspect the other is not far behind (SLOG devices do a
> lot of writing), so it is probably worth replacing at least one and possibly
> both.
>
> This may be part of why performance has tanked recently (although I have no
> evidence to support this statement).
>
> They don't need to be big - we're currently using 80 GB of the 256 GB disk
> - but they do need to be reliable and fast. I have zero idea what the best
> part to pick is; any thoughts?
>
> David Adam
> zanchey@
> UCC Wheel Member
>
> [1]:
> https://pthree.org/2012/12/06/zfs-administration-part-iii-the-zfs-intent-log
> /
>
> _______________________________________________
> List Archives: http://lists.ucc.asn.au/pipermail/tech
>
> Unsubscribe here:
> https://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bob%40ucc.gu.uwa.edu.au
>
> _______________________________________________
> List Archives: http://lists.ucc.asn.au/pipermail/tech
>
> Unsubscribe here: https://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bob%40ucc.gu.uwa.edu.au
>
More information about the tech
mailing list