[tech] Proposal to fix bad IO performance with Ceph on our cluster

Thu Oct 8 22:46:26 AWST 2020

Hi All,

As you might be aware, we've been getting abysmal IO performance with our Ceph cluster as of late.

Before I get into the meat of why, I would like to propose/request - to (hopefully) fix the issue - that I/we, for each of the QVO SSDs, one at a time:
- Ask Ceph to move all data off it and redistribute the data among the rest of the cluster
- Remove the SSD from the Ceph cluster
- Benchmark its current read and write (destructively) performance with DD
- Run a whole-disk `blkdiscard` on it - to notify the SSD that these blocks are unused
- Re-run the above benchmarks, and then do another `blkdiscard` to free up the benchmarked blocks
- Partition the SSD with GPT, create a partition for Ceph, and leave 100GiB unallocated at the end, after this partition
- Add this partition back into the Ceph cluster

Here's the meat of why I think we should do this:
- Having a quick peek at Ceph > OSD for any given host in our cluster, the "Apply/Commit Latency" for the QVO SSDs is in the order of 100s of milliseconds, compared to <15ms for all the other SSDs
- The other SSDs are all Samsung EVO or PRO series SSDs, but more importanly, have Ceph storing its data in a partition smaller than the total size of the SSD (ie. only 420GiB, of the 465GiB~1TiB actual SSD capacity, depending on host)
- The QVO SSDs, for comparison, have the whole SSD allocated to Ceph, with no space left over
- As such, for the QVOs, I suspect the bad performance we're seeing is likely synptomatic of the SSD's controller thinking that all blocks are in use
- An SSDs controller would typically have such thoughts if every block has been written to at least once, and hasn't been marked as free again through the use of the TRIM/Discard SATA/SCSI command
- In the case of Ceph, it does support TRIM/Discard commands being issued by VMs using virtio-scsi disks, however this gets passed onto the Ceph Block Device, which in Ceph then marks the block as available (to be re-used by Ceph again)
- Critically, these TRIM/Discard commands/notifications do not seem to get passed onto the SSD itself
- As additional support for this, I ran `tail -c 8G /dev/sdd | hd | less` on Mudkip (sdd being one of the QVOs), and despite Ceph only reporting ~30% space utilisation of these disks in the cluster, there was plenty of leftover data in the last 8GiB at the end of the disk
- This makes me strongly suspect that every block on this SSD has - at some point - been written to by Ceph, with it likely just leaving the data there when it no longer needs it (and overwriting it later when it needs to), rather than issuing a TRIM notification to the SSD
- For comparison, reading the last 8GiB back of /dev/sda (the PRO SSD) yields mostly zeroes, as expected, as the Ceph partition doesn't extend to the end of this SSD. These (logical) blocks have likely never been written (by the OS or Ceph), and thus would still be marked by the SSD's controller as spare
- Such spare blocks are then always on-hand and ready for when the SSD needs to do a read-modify-write cycle, ie.: read the current ~128KiB "erase-block" into DRAM, write the 512B of data that has changed to this copy in DRAM, flush this modified copy to a spare "erase-block", point the 512B logical block the OS sees to the new 128KiB erase block, mark the old erase-block for garbage collection, eventually erase it asynchronously during the next GC cycle, and then add it back to the spare pile
- As such, because of the way SSDs operate (every 512B write tends to need a "read-128KB-erase-block, change 512B of it, write it to spare erase-block that's been erased already" cycle), it's imperative that the SSD has a sufficiently large pool of spare blocks it can write to on demand
- This pool of spare blocks can either be maintained by simply never writing to a bunch of (logical) blocks (from the OS's perspective) and thus never making the SSD mark them as used, or by explicitly telling the SSD that it can return a given block (or set thereof) to its spare pool by issuing a TRIM/Discard SATA/SCSI command for that block (or set thereof)
- Otherwise, without that, he SSD has to frantically (and synchronously) erase and reshuffle blocks whilst the data is being written, severely degrading performance
- Finally, doing just a simple read benchmark (using `dd if=/dev/sdd bs=1M of=/dev/null status=progress` and `dd if=/dev/sda bs=1M of=/dev/null status=progress`) of both the QVO and the PRO SSDs yeilds ~50MB/s and ~500MB/s read speeds respectively, so that seemingly confirms that the bottleneck is on the QVO SSD somewhere, rather than somewhere in the Ceph stack
- I don't know why the SSD controller not having any spare blocks would impact read performance as well (vs. just write), but I'm guessing it's the result of some sort of low-level fragmentation, eg. multiple non-contiguous logical blocks being crammed together in contiguous erase blocks, and the SSD not having enough breathing room to reshuffle these properly

Anyway, would you all be happy for me to proceed with this proposal?

--
Thanks, and kind regards,
Dylan Hicks [333]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ucc.gu.uwa.edu.au/pipermail/tech/attachments/20201008/f6116bde/attachment.htm>