<html>Hi All,<br /><br />As you might be aware, we've been getting abysmal IO performance with our Ceph cluster as of late.<br /><br />Before I get into the meat of why, I would like to propose/request - to (hopefully) fix the issue - that I/we, for each of the QVO SSDs, one at a time:<br />- Ask Ceph to move all data off it and redistribute the data among the rest of the cluster<br />- Remove the SSD from the Ceph cluster<br />- Benchmark its current read and write (destructively) performance with DD<br />- Run a whole-disk `blkdiscard` on it - to notify the SSD that these blocks are unused<br />- Re-run the above benchmarks, and then do another `blkdiscard` to free up the benchmarked blocks<br />- Partition the SSD with GPT, create a partition for Ceph, <em>and leave 100GiB unallocated at the end, after this partition</em><br />- Add this partition back into the Ceph cluster<br /><br />Here's the meat of why I think we should do this:<br />- Having a quick peek at Ceph > OSD for any given host in our cluster, the "Apply/Commit Latency" for the QVO SSDs is in the order of 100s of milliseconds, compared to <15ms for all the other SSDs<br />- The other SSDs are all Samsung EVO or PRO series SSDs, but more importanly, <em>have Ceph storing its data in a partition smaller than the total size of the SSD</em> (ie. only 420GiB, of the 465GiB~1TiB actual SSD capacity, depending on host)<br />- The QVO SSDs, for comparison, have the whole SSD allocated to Ceph, with no space left over<br />- As such, for the QVOs, I suspect the bad performance we're seeing is likely synptomatic of the SSD's controller thinking that all blocks are in use<br />- An SSDs controller would typically have such thoughts if every block has been written to at least once, <em>and</em> hasn't been marked as free again through the use of the TRIM/Discard SATA/SCSI command<br />- In the case of Ceph, <em>it does</em> support TRIM/Discard commands being issued by VMs using virtio-scsi disks, however this gets passed onto the Ceph Block Device, which <em>in Ceph</em> then marks the block as available (to be re-used by Ceph again)<br />- Critically, these TRIM/Discard commands/notifications do not <em>seem</em> to get passed onto the SSD itself<br />- As additional support for this, I ran `tail -c 8G /dev/sdd | hd | less` on Mudkip (sdd being one of the QVOs), and despite Ceph only reporting ~30% space utilisation of these disks in the cluster, there was plenty of leftover data in the last 8GiB at the end of the disk<br />- This makes me <em>strongly suspect </em>that every block on this SSD has - at some point - been written to by Ceph, with it likely just leaving the data there when it no longer needs it (and overwriting it later when it needs to), rather than issuing a TRIM notification to the SSD<br />- For comparison, reading the last 8GiB back of /dev/sda (the PRO SSD) yields mostly zeroes, as expected, as the Ceph partition doesn't extend to the end of this SSD. These (logical) blocks have likely <em>never</em> been written (by the OS or Ceph), and thus would still be marked by the SSD's controller as spare<br />- Such spare blocks are then always on-hand and ready for when the SSD needs to do a read-modify-write cycle, ie.: read the current ~128KiB "erase-block" into DRAM, write the 512B of data that has changed to this copy in DRAM, flush this modified copy to a spare "erase-block", point the 512B logical block the OS sees to the new 128KiB erase block, mark the old erase-block for garbage collection, eventually erase it asynchronously during the next GC cycle, and then add it back to the spare pile<br />- As such, because of the way SSDs operate (every 512B write tends to need a "read-128KB-erase-block, change 512B of it, write it to spare erase-block that's been erased already" cycle), it's imperative that the SSD has a sufficiently large pool of spare blocks it can write to on demand<br />- This pool of spare blocks can either be maintained by simply never writing to a bunch of (logical) blocks (from the OS's perspective) and thus never making the SSD mark them as used, or by explicitly telling the SSD that it can return a given block (or set thereof) to its spare pool by issuing a TRIM/Discard SATA/SCSI command for that block (or set thereof)<br />- Otherwise, without that, he SSD has to frantically (and synchronously) erase and reshuffle blocks whilst the data is being written, severely degrading performance<br />- Finally, doing just a simple read benchmark (using `dd if=/dev/sdd bs=1M of=/dev/null status=progress` and `dd if=/dev/sda bs=1M of=/dev/null status=progress`) of both the QVO and the PRO SSDs yeilds ~50MB/s and ~500MB/s read speeds respectively, so that seemingly confirms that the bottleneck is on the QVO SSD somewhere, rather than somewhere in the Ceph stack<br />- I don't know why the SSD controller not having any spare blocks would impact <em>read </em>performance as well (vs. just write), but I'm guessing it's the result of some sort of low-level fragmentation, eg. multiple non-contiguous logical blocks being crammed together in contiguous erase blocks, and the SSD not having enough breathing room to reshuffle these properly<br /><br />Anyway, would you all be happy for me to proceed with this proposal?<br /><br />--<br />Thanks, and kind regards,<br />Dylan Hicks [333]</html>