From bob at ucc.gu.uwa.edu.au Tue Aug 1 09:53:36 2017 From: bob at ucc.gu.uwa.edu.au (Andrew Adamson) Date: Tue, 1 Aug 2017 09:53:36 +0800 (AWST) Subject: [tech] dcc on mooneye Message-ID: CC-ing to tech@ so that others may learn... For those reading at home, DCC is a hash sharing system used in spam filtering to recognise bulk email: https://www.rhyolite.com/dcc/ We were using dcc until a debian upgrade meant zanchey had to restore a default spamassassin config. I haven't bothered looking at it since because the spam filtering was working well enough for my needs. The DCC module is loaded with a line in /etc/spamassassin/v310.pre, but then the local.cf config isn't telling spamassassin to filter traffic using said module. If anyone is keen, the old config on mooneye is in /etc/spamassassin/local.cf.dpkg-old and would need porting to /etc/spamassassin/local.cf. It would be good if we could re-enable pyzor and razor at the same time (more spam reporting and detection services). If a non-wheel member wants to take a look at this, poke me and we'll see what we can do. Andrew Adamson bob at ucc.asn.au |"If you can't beat them, join them, and then beat them." | | ---Peter's Laws | On Mon, 31 Jul 2017, Matt Johnston wrote: > I restarted postfix on mooneye after the outage, the logs have this. Is > dccifd meant to be running or spamassassin's config needs that disabled? > Jul 31 22:14:47 mooneye spamd[10989]:?dcc: failed to connect to local > socket?/var/dcc/dccifd? > > > From zanchey at ucc.gu.uwa.edu.au Sun Aug 6 21:42:51 2017 From: zanchey at ucc.gu.uwa.edu.au (David Adam) Date: Sun, 6 Aug 2017 21:42:51 +0800 (AWST) Subject: [tech] Bitumen supervisor module problems Message-ID: In short, Bitumen's second supervisor module (which runs the switch) has a memory problem and won't POST any more. I don't know whether it's worth replacing or just removing it; we don't really use the redundant feature at all. I've set it to "boot as needed" (cold standby or RPR mode) instead of "be running and able to take over at any time" (hot standby or SSO mode), which will at least stop it crashing continuously. --- In long: On IRC, [BOB] asked: > Any idea what's up with bitumen that's making rancid send out so many > emails? I see some failed.txt logs appearing on bitumen but > wouldn't know where to even start investigating Rancid is the switch configuration tracker: http://cvs.ucc.asn.au/cgi-bin/viewvc.cgi/rancid/ucc/ (I mucked up the metadata, so all the old switches look like their latest config is from today - the SVN log shows otherwise.) The first place I looked is in the event log on bitumen (`show log`), which is synced across to Murasoi (/var/log/ucc/cisco.log). It's filled with entries like these: Aug 6 17:10:37 bitumen AWST: %C4K_REDUNDANCY-6-DUPLEX_MODE: The peer Supervisor has been detected Aug 6 17:10:51 bitumen AWST: %C4K_REDUNDANCY-2-POSTFAIL: POST failure on STANDBY supervisor detected. Supervisor redundancy is **NOT** available. Check and replace failed Supervisor. Aug 6 17:10:53 bitumen AWST: %C4K_REDUNDANCY-6-MODE: ACTIVE supervisor initializing for sso mode Aug 6 17:10:53 bitumen AWST: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been established Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The bootvar has been successfully synchronized to the standby supervisor Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The config-reg has been successfully synchronized to the standby supervisor Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The startup-config has been successfully synchronized to the standby supervisor Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The private-config has been successfully synchronized to the standby supervisor Aug 6 17:11:04 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC_RATELIMIT: The vlan database has been successfully synchronized to the standby supervisor The second thing was to see the diffs in Rancid that [BOB] was complaining about: http://cvs.ucc.asn.au/cgi-bin/viewvc.cgi/rancid/ucc/configs/bitumen.ucc.gu.uwa.edu.au?view=log&pathrev=387 Looks like on August 4, a whole bunch of files appeared (then disppeared, then reappeared, etc. etc.) Revision 373 is the useful one. bitumen#dir bootflash: %Error opening bootflash:/ (No more memory for file record) I tried running `squeeze bootflash`, but that didn't work either (no deleted files to squeeze out). The file list from revision 373 had a list, though: bitumen#delete bootflash: Delete filename []? post-2017.05.01.09.16.10-failed.txt Delete bootflash:post-2017.05.01.09.16.10-failed.txt? [confirm] (repeat 7000 times to remove half of the entries, running for four hours at last count, done via vim and a tmux paste buffer) Meanwhile, I realised that the oldest file clogging up the flash was from May, so in fact that was not the problem. The documentation for the current OS is at https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst4500/12-2/54sg/configuration/guide/config.html In particular, the chapter on supervisor engine redundancy describes two modes: SSO (hot spare) and RPR (cold spare). Our system was set up with SSO redundancy, but it looks like it is crashing continuously. Changing the configuration to RPR mode meant that the second supervisor engine didn't crash any more, which gave me a chance to inspect the POST logs in slavebootflash: --- bitumen# cd slavebootflash: bitumen# dir bitumen# more post-2017.08.06.03.48.54-failed.txt Power-on-self-test for Module 2: WS-X4515 Port/Test Status: (. = Pass, F = Fail, U = Untested) ... Switch Subsystem Memory ... 49: . 50: . 51: . 52: . 53: . 54: F 55: . Module 2 Failed --- So - I think the secondary supervisor module's memory is stuffed, which means the supervisor engine won't start. I'm not sure if it's worth trying to replace, or just removing the problem part. Setting it to cold standby will stop it filling the disk with complaints, and shouldn't affect the actual operation of the switch. [DAA] From zanchey at ucc.gu.uwa.edu.au Wed Aug 9 21:22:06 2017 From: zanchey at ucc.gu.uwa.edu.au (David Adam) Date: Wed, 9 Aug 2017 21:22:06 +0800 (AWST) Subject: [tech] Cron /backups/bin/rdiff-manager/rdiff-manager.py (fwd) Message-ID: Backups for molmol have been failing for the last month or so as the backup server (Mollitz) does not have enough space. I am planning to drop the backups of /away in order to maintain the backups of things that actually matter, like /services. Mollitz has four 2 TB drives in it. Three of them are RAID-5ed to make a 4 TB array using the hardware RAID and an ext4 filesystem, and the fourth is a single RAID-0 which has a ZFS pool on it. (The ZFS pool uses compression, which means it can store a bit more data.) At some stage, we could delete the array and redo it so that it has a RAID-5 using all four drives, but that involves losing all the data on the array so I have been avoiding it. David Adam zanchey at ucc.gu.uwa.edu.au ---------- Forwarded message ---------- Date: Wed, 19 Jul 2017 04:01:21 +0800 (AWST) From: Cron Daemon Subject: Cron /backups/bin/rdiff-manager/rdiff-manager.py ---------------------------------------------------------------------------- RDIFF-MANAGER REPORT for run started Wed Jul 19 02:00:01 2017 ---------------------------------------------------------------------------- SUMMARY: Backup succeeded for motsugo Backup succeeded for merlo Backup succeeded for mooneye Backup succeeded for heathred Backup succeeded for mollitz Backup succeeded for gitlab Backup succeeded for medico Backup succeeded for murasoi Backup FAILED for molmol Backup succeeded for mussel ---------------------------------------------------------------------------- Backup results for molmol *-----------------------------------------------------------------------------* | This is node Molmol at The University of WA - for authorised clients only | | | | WARNING: Misuse of computer access can attract criminal penalties, civil | | liability for third party loss and university disciplinary action. | *-----------------------------------------------------------------------------* Exception '[Errno 28] No space left on device: '/backups/molmol/rdiff-backup-data/rdiff-backup.tmp.0'' raised of class '': From zanchey at ucc.gu.uwa.edu.au Tue Aug 15 10:53:54 2017 From: zanchey at ucc.gu.uwa.edu.au (David Adam) Date: Tue, 15 Aug 2017 10:53:54 +0800 (AWST) Subject: [tech] Molmol FreeBSD / Samba upgrades Message-ID: I upgraded Molmol to FreeBSD 11.1 (from 11.0) and Samba 4.6 (from 4.4) yesterday. Everything seems to work. [DAA] zanchey@