[tech] Dead disk in Molmol

Fri May 10 21:35:30 AWST 2019

Hi All,

So with the help of [DBA], we unracked molmol last week and replaced the 
failing slog/system disk with a Samsung . So molmol is out of the 
woods...ish. 

Unfortunately we were unable to install the optane drive, since it turns 
out the Molmol mobo was a x9srh-7f, not a x9srh-7tf as mentioned on the 
wiki, and so there is no spare slot in the machine.

There's a couple of things we could do here:
- get a pci-e x16 splitter/riser and put the two SAS cards on it to free 
up a slot
- replace the two 8-port SAS cards with a single 16-port card. This could 
be challenging as they are low-profile cards
- put the optane card in some other machine for some other task

Thoughts/ideas?

Andrew Adamson
bob at ucc.asn.au

|"If you can't beat them, join them, and then beat them."                |
| ---Peter's Laws                                                        |

On Sun, 3 Mar 2019, Bob Adamson wrote:

> Hi All,
> 
> I think we should look for something a bit more enterprisey for this task,
> since it is such a critical component - this machine hosts a lot of club and
> member VM storage, as well as clubroom desktop home directories. Given the
> issues we seem to be having with speeds, I don't think we should skimp on
> disks this time around.
> 
> This page, though from 2014, details how we might check the performance of
> SSD's as a Ceph journaling device, which (aiui) uses synchronous writes
> similar to the requirements of NFS on ZFS:
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is
> -suitable-as-a-journal-device/ . The results are somewhat scant, but what is
> apparent is the order of magnitude in speed difference between consumer and
> enterprise SSDs for this use.
> 
> Despite having many bays on the front, molmol only supports 8 SAS disks, 2
> SATA3 disks, and 4 SATA2 disks. The 8 SAS ports are taken up by spinning
> disks at the moment, and the 2 SATA3 ports are used by the system/SLOG
> disks. There's really no point in using the SATA2 ports due to their speed.
> The mobo is a supermicro x9srh-7tf , so it has 1xPCIe 3.0 x16 slot and
> 1xPCIe 3.0 x8 slot. Given that the mobo has 10G ethernet onboard, I think
> both of those slots should be free. The case itself is 2RU, so we could
> support 2 low profile PCIe SSD cards. 
> 
> Anyway, what I'm thinking is to replace the failing system disk with another
> similar SSD, then chuck a single, fast, PCIe SSD in it for the SLOG and
> L2ARC only. If it fails, aiui we don't have a corrupt file system, we just
> lose the last 5 seconds of data (correct me if I'm wrong here?). This is
> based on a few google results, like
> https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSSLOGLossEffects 
> 
> $409 plus delivery for an Intel Optane 900P:
> https://www.scorptec.com.au/product/Hard-Drives-&-SSDs/SSD-2.5-&-PCI-Express
> /70481-SSDPED1D280GASX
> 
> Plus $90 to replace the failed system disk with a 250GB 860EVO:
> https://www.scorptec.com.au/product/Hard-Drives-&-SSDs/SSD-2.5-&-PCI-Express
> /71382-MZ-76E250BW
> 
> $15 delivery
> $514 total
> 
> Thoughts?
> 
> I'm happy to order it, just approve it at a committee meeting (or outside of
> one via circular) and let me know.
> 
> Thanks, Bob
> 
> -----Original Message-----
> From: tech-bounces+bob=ucc.gu.uwa.edu.au at ucc.asn.au
> <tech-bounces+bob=ucc.gu.uwa.edu.au at ucc.asn.au> On Behalf Of David Adam
> Sent: Saturday, 2 March 2019 9:27 PM
> To: tech at ucc.gu.uwa.edu.au
> Subject: [tech] Dead disk in Molmol
> 
> Hi all,
> 
> Molmol has dropped one of its SSDs:
> 
> Feb 26 14:15:10 molmol kernel: ahcich1: Timeout on slot 25 port 0 Feb 26
> 14:15:10 molmol kernel: ahcich1: is 00000000 cs 02000000 ss 00000000 rs
> 02000000 tfd c0 serr 00000000 cmd 0004d917 Feb 26 14:15:10 molmol kernel:
> (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> Feb 26 14:30:34 molmol kernel: (ada1:ahcich1:0:0:0): CAM status: Command
> timeout Feb 26 14:30:34 molmol kernel: (ada1:ahcich1:0:0:0): Retrying
> command Feb 26 14:30:34 molmol kernel: ahcich1: AHCI reset: device not ready
> after 31000ms (tfd = 00000080)
> (etc.)
> 
> It's detached from the bus and won't reattach.
> 
> The device is a Samsung SSD 840 PRO Series DXM05B0Q (s/n S1ATNSAD864731A)
> - note that there are two of these in the machine! I'm not sure whether it
> is hotpluggable or not.
> 
> This SSD was providing one half of the SLOG mirror [1] and a RAID partition
> for the root filesystem. The other half is provided by the other Samsung 840
> PRO:
> 
> zfs pool status:
> 
> 	NAME                             STATE     READ WRITE CKSUM
> 	logs
> 	  mirror-4                       DEGRADED     0     0     0
> 	    5535644740799039914          REMOVED      0     0     0  was
> /dev/gpt/molmol-slog
> 	    gpt/molmol-slog0             ONLINE       0     0     0
> 
> 
> Checking status of gmirror(8) devices:
>            Name    Status  Components
> mirror/gmirror0  DEGRADED  ada0p2 (ACTIVE)
> 
> If one has gone, I suspect the other is not far behind (SLOG devices do a
> lot of writing), so it is probably worth replacing at least one and possibly
> both.
> 
> This may be part of why performance has tanked recently (although I have no
> evidence to support this statement).
> 
> They don't need to be big - we're currently using 80 GB of the 256 GB disk
> - but they do need to be reliable and fast. I have zero idea what the best
> part to pick is; any thoughts?
> 
> David Adam
> zanchey@
> UCC Wheel Member
> 
> [1]: 
> https://pthree.org/2012/12/06/zfs-administration-part-iii-the-zfs-intent-log
> /
> 
> _______________________________________________
> List Archives: http://lists.ucc.asn.au/pipermail/tech
> 
> Unsubscribe here:
> https://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bob%40ucc.gu.uwa.edu.au
> 
> _______________________________________________
> List Archives: http://lists.ucc.asn.au/pipermail/tech
> 
> Unsubscribe here: https://lists.ucc.gu.uwa.edu.au/mailman/options/tech/bob%40ucc.gu.uwa.edu.au
>