[tech] UCC backup status, was Re: Cron <backups at mollitz> /backups/bin/rdiff-manager/rdiff-manager.py (fwd)

Mon May 11 22:25:34 AWST 2020

On Sun, Sep 17, 2017 at 11:21:17AM +0800, David Adam wrote:
> On Wed, 9 Aug 2017, David Adam wrote:
> > Backups for molmol have been failing for the last month or so as the
> > backup server (Mollitz) does not have enough space.

Plus Ã§a change, plus c'est la mÃªme chose.

> > I am planning to drop the backups of /away in order to maintain the
> > backups of things that actually matter, like /services.
> > 
> > Mollitz has four 2 TB drives in it. Three of them are RAID-5ed to make a
> > 4 TB array using the hardware RAID and an ext4 filesystem, and the
> > fourth is a single RAID-0 which has a ZFS pool on it. (The ZFS pool uses 
> > compression, which means it can store a bit more data.)
> > 
> > At some stage, we could delete the array and redo it so that it has a 
> > RAID-5 using all four drives, but that involves losing all the data on the 
> > array so I have been avoiding it.

Thank you muchly, David, for getting that done and much more, and for
paying at least occasional attention for so long!

mollitz ran out of space again a while ago, but it was some time before
it was noticed, in the leadup to a tech/wheel meeting.

(Next one coming soon, Saturday 2020-05-23 14:00 !)

I've spent a month, on and off, poking it back into action but there's
been some hiccups along the way.

The general status is:
  * https://wiki.ucc.asn.au/Backups
  * mollitz, an 8GiB DELL PowerEdge 2950, boots off a 60GB OCZ-VERTEX2 SSD
  * it has 6TB (5.4TiB) of /backups space, RAID-5 over 4*2TB drives
  * it can't easily fit more drives
  * the PERC 5/i RAID controller can't use larger capacity drives
  * it would normally use https://gitlab.ucc.asn.au/UCC/rdiff-manager to
    run rdiff-backup over ssh, to fetch daily backups from some UCC hosts
  * Not as many hosts as we might like, not the scratch areas or member
    VMs, prioritising the most important data, excluding some. Please
    refer to the SLA.
  * Data backups that one can restore selectively from, but not the sort
    of full images you could boot straight up on the Proxmox cluster or
    replacement hardware

> I don't know why it didn't occur to me before, but I've split the backups 
> instead - /away is now backed up to /backups/away using "away.ucc..." as 
> the hostname, while everything else on molmol remains on /backups/molmol.

Handy! I'm thinking we should probably do that with motsugo / home, and
exclude part of /home from one of them.

[...]
> I've upgraded mollitz to Debian 9 (stretch), mainly because I needed to 
> reboot anyway.
> [DAA]

Time for Debian 10 (buster)!

...and maybe the recently released rdiff-backup v2 at some stage, new
to Debian testing! though servers and clients both need to be upgraded
at the same time.

Issues so far:

  * backups started failing and no-one noticed for a while
    * so we could use an extra volunteer or two to watch hostperson at ucc email
    * but one could also add some specialised grafana alerts like the
      Discord disc space alerts - very handy!
    * we don't have "negative" alerts yet: a full drive might be
      noticed, but host that stops reporting might not
    * https://grafana.com/about/events/grafanacon/2020/ is starting this
      week, a few video sessions over a couple of weeks, starting with a
      keynote address just after midnight this Wednesday night/Thursday
      morning

  * mollitz:/backups was _completely_ full, 100%, no reserve space,
    presumably quite fragmented as it tried to squeeze in those last
    few files
    * To --check-destination-dir and fix/roll back a failed backup run,
      temporary space is needed, possibly as much as 100GB or the largest
      single file/incremental encounted.
    * To even get a "du" or rdiff-backup --list-increment-sizes takes hours, 
      100M to 124M inodes, mostly in molmol and motsugo, both of which had
      failed their last backup attempted and needed space-hungry "regressing".
    * The RAID controller is working fine, but this is RAID-5 on
      spinning rust with _lots_ of files and directories

  * Turns out, some busy files never get backed up anyway: some log
    files, busy sqlite files, which change between the start of a copy
    and the end. They get error messages such as:
    UpdateError space/services/webcam/ipcamera10-old.jpg
    Updated mirror temp file
    /backups/molmol/space/services/webcam/rdiff-backup.tmp.38474907 does
    not match source

  * We don't save much space from fewer incrementals - daily changes (of
    the things we're not excluding) are much less than 1% of our static
    total space usage

  * motsugo's backup took 24 hours to get back into sync

  * molmol's backup took 221 hours to get back into sync

  * mooneye's and maybe medico's backups were failing due to our DNS
    issues, see UWA ServiceNow bug INC0467345

  * Getting a full du(1) takes 8 hours, full stats takes >24 hours

  * ...and now, the last few motsugo backups have failed

    * There's an error path in rdiff-manager that was taken on Wednesday
      2020-04-29 , it started a rdiff-backup instance with "--exclude **"
      , which is probably an elegant self-correction measure, but it could
      result in downloading a full backup again. This was still going 20
      hours later, I interrupted it, and a backup has not completed since.

    * New backup attempts start with a "regress" phase, which takes ~90 minutes

    * Later, they error out with a large traceback, starting with:
      Exception '[Errno 13] Permission denied' raised of class '<type 'exceptions.IOError'>': File "/usr/lib/python2.7/dist-packages/rdiff_backup/Main.py", line 304, in error_check_Main

    * Full log (and an abbreviated verbose run):
	mollitz:/home/tmp-20200409/unclog.i.motsugo-error.log

Nick.

-- 
   Nick Bannon   | "I made this letter longer than usual because
nick-sig at rcpt.to | I lack the time to make it shorter." - Pascal