[tech] Proposal to fix bad IO performance with Ceph on our cluster

Thu Dec 10 14:45:30 AWST 2020

(mailfish problems? Try #2)

Turns out: it wasn't just discard/fstrim/TRIM/UNMAP
- but we should get a cron'ed weekly fstrim(8) into the SOE
- To be certain, we partitioned the QVO's as 90% /dev/sdX1 and 10%
  free space that we could blkdiscard(8)
  - it did not fix things
- In the meantime, [MPT] added Ceph CRUSH rules to keep vmstore-ssd pool
  data on the fast drives.

What has helped:
- I freed up magikarp's Optane and used 30% (80GB) as a block.db
  - which also implied that it put block.wal, the Write Ahead Log, there as well
  - iostat(1) says that Ceph is hardly using it, so it could probably
    be 40GB, or even just 1GB for the WAL, as long as it's fast
  - https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

http://uccmonitor.ucc.asn.au:3000/d/Fj5fAfzik123/ceph-osd-single?from=now-7d&var-osd=osd.6

Currently, I've kicked out mudkip's osd.4 and osd.5 to see if I can do
something similar there. There's a rebalance going on which will probably
take about 4 hours, total.

On Thu, Oct 08, 2020 at 10:46:26PM +0800, Dylan Hicks wrote:
[...]
> - Having a quick peek at Ceph > OSD for any given host in our cluster, the "Apply/Commit Latency" for the QVO SSDs is in the order of 100s of milliseconds, compared to <15ms for all the other SSDs

I'm wondering if those are the latencies for whole 4MiB blocks or
similar - there's also read op/write op latencies which are much closer
to what I expect for a SSD (any SSD).

> - The other SSDs are all Samsung EVO or PRO series SSDs [...]
> - The QVO SSDs, for comparison, have the whole SSD allocated to Ceph, with no space left over
[...]

medico/osd.1 has been upgraded from a 500GB Samsung 850 EVO to a shiny
new 2TB Samsung PRO.

Nick.

-- 
   Nick Bannon   | "I made this letter longer than usual because
nick-sig at rcpt.to | I lack the time to make it shorter." - Pascal