[tech] Tech/Wheel Meeting 2021-10-10 14:00 - One hour reminder

Mark Tearle mtearle at tearle.com
Mon Oct 11 11:44:46 AWST 2021


Hi folks

Apologies for accidentally missing this yesterday.   Will have to catch up on this with the Monday night crowd tonight.

Cheers,
Mark

-- 
Mark Tearle <mtearle at tearle.com>

On Sun, 10 Oct 2021, at 1:00 PM, root wrote:
> Tech/Wheel Meeting Agenda - Sunday 2021-10-10 14:00
> ===================================================
> - VENUE: UCC Clubroom
>   - and online at https://meetings.ucc.asn.au/b/bob-yrk-uy6
>
> *Meeting opened hh:mm*
>
> ## Attendance
> - Present
> - Apologies
> - Absent
>
> ## Next meeting
> - Schedule next meeting
>   - *day 202Y-MM-ddThh:mm
>     - let's have more doing than talking? next one pre-O-Day, 2022?
>   - ACTION: [???] Set and verify reminders of next meeting: `motsugo# 
> crontab -e`
>     - Let's try something different: 
> https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring/-/issues/1
>     - Promptly update agenda.next with the TIME/DATE/VENUE
>     - Check at T-7days that the notice really went out, fix for T-4days 
> if needed
> - Everyone, pre-meeting: Curate agenda.next
>
> ## Optional items - choose at the start of the meeting
> - Ethical guidelines
> - Monitoring
> - Backups
> - Password rotations
> - New members
>   - [BRD] nominated for wheel. ACTION: [TEC] to raise this with 
> committee
> - Quick check of ChangeLog
> - Lessons learnt
>   - 2021-10-04 magikarp's SD card wasn't so secure after all...
>     - Filesystem went readonly on 2021-10-04 and [BOB] tried to run 
> `fsck` to no avail
>       - when exactly did it fail? Noticed as grey'ed out in the web UI 
> at 19:35, but
>         VM were still running until reboot test at about 20:00
>     - [333] More details to be added to agenda as they come...
>       - [333]'s editor is leaving `foobar~` backup files about the place
>     - No spare SD card on hand (they're cheap!)
>       - Nearest replacement was a USB thumb key, slightly smaller so 
> `dd` isn't a direct option
>     - with one VM host down, ceph was over-capacity and could not meet 
> goals
>       - TODO: add NVMe to legacy hosts
>       - TODO: bring machop online
>   - 2021-10-05T0318 Power outage
>     - ssh.ucc.asn.au
>       - auth failures triggered fail2ban?
>     - samson
>       - manual, post-reboot `mount -av`
>       - manual, post-reboot `systemctl restart samba-ad-dc.service`
>       - samson RADIUS dead? -> broken wifi auth, IPSec VPN
>     - portal
>       - https://portal.ucc.asn.au/ was `403 Forbidden`-ing
>       - `uccportal# mount -av`
>         - standardise/document/expose www -> hostname mappings? 
> DocumentRoot?
>         - Cloudflare -> F5 -> mussel/mailauesi proxy config?
>           - https://wiki.ucc.asn.au/TheCloudflarening
>         - portal, bbb, gitlab, uccmonitor, element+matrix, wiki, www ...
>     - mailfish
>       - manual, post-reboot `mount -av` (try autofs?)
>     - motsugo
>       - md0 scrubbed? or rebuilt? more than once recently, but new 
> spare SSD /dev/sdh not yet in use
>     - mollitz
>       - some long-running and failed backups: away, motsugo
>
> ## Known Broken Stuff
> - [BRD] `universitycomputer.club.passwd.org` vs `*Everything.html`
> - IPv6 inbound
>   - ACTION: [TEC] to email UWA IT
> - lard
>   - Still needs a spare PSU OR replacement with something less... fatty.
>   - ACTION: [???] to send email out requesting a 1U Cisco switch to 
> replace Lard
> - ACTION: [MTL] to update Ansible scripts for mail*
>     - ACTION: [DBA] wants to give it a shot, good reason to try out 
> Proxmox
> - samson the https://wiki.ucc.asn.au/ActiveDirectory server has no 
> freshly built DC friends
>   - this is risky, a single-point-of-failure, which in turn depends on 
> the running VM cluster
>   - something to do with the current configuration is probably why 
> mussel
>     and mooneye still have auth problems
>     - can we upgrade or rebuild or document our way out of this?
>   - ...so making a quick clone and calling it "done" really isn't 
> enough, continuous integration is called for?
>   - vucc testbed in https://wiki.ucc.asn.au/NewActiveDirectory
> - mollitz is missing prometheus-node-exporter since the rebuild, months 
> ago?
>   - [NTU] anyone want a hand with a 
> https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring run ?
>   - can we use the DebianPkg:prometheus-node-exporter/stable where 
> possible?
> - motsugo i801_smbus spam
>   - [BOB] I think we should change the bios battery before we go down 
> any other rabbit holes
>     - @2021-10-08 I've rebooted the BMC, let's see if that fixes things
>       - the fact that the BMC thinks it's 2007 is rather telling
>
> ## Matters arising previously
>
> ## Extra items (rename/refile as appropriate)
> - machop: new EPYC box from Michael/Wings
>
> - Monitoring
>   - Drive health
>     - Uncorrectable errors, reallocated sectors, TBW on SSDs, 
> temperature
>     - ACTION: [NTU] and [MTL] to work together on how best to start 
> drive monitoring, and make it standard/SOE config via ansible
>       - proof-of-concept tested
>         - 
> https://matrix.to/#/!zAfheZzGazlYUQqAeJ:ucc.asn.au/$HuKyvV8eVoTXKah1Ua3hwR9jWyodlIt2P1iO4upAPmE
>       - `/etc/cron.d/node_prometheus-SMART-export`: `*/5 * * * * root 
> /usr/local/bin/smartmon.sh > /var/lib/node_exporter/smart_metrics.prom`
>       - `-rwxr-xr-x 1 nick wheel 11287 Sep 27 19:49 
> /usr/local/bin/smartmon.sh`
>       - 
> http://uccmonitor.ucc.asn.au:3000/d/PkPI4xGWz/s-m-a-r-t-dashboard
>       - both the script and dashboard probably need a bit of a rewrite, 
> so take it as inspiration?
>       - The SSDs and HDDs report with temperatures in different SMART 
> metrics; and the models are exposed in slightl
> y different text strings... so it might be better if the `smartmon.sh` 
> did some normalisation
>         - Alternatively, can do some normalisation in the Prometheus 
> query - so that's working
>           - was graphing: 
> `label_replace(avg(smartmon_temperature_celsius_raw_value{ 
> instance=~"$instance", disk=~"$disk" }) by (instance, disk, name), 
> "instance", "$1", "instance", "([^.]+).*")`
>           - now graphing: `... smartmon_temperature_celsius_raw_value{ 
> ... } or smartmon_airflow_temperature_cel_raw_value{ ... }`
>           - so if the first fails, it uses the second... though it 
> would be nice if (both existed) then {use the maximum}
>        - ACTION: [???] ansible-ise and roll out more? and/or do some 
> rewriting/tweaking
>
> - [MTL] continues looking at DNS and CI setup
>   - Not much progress
>   - Have played with coredns for a resolving server
>     - Need to do some more testing of resolving internal UWA things (to 
> check behaviours)
>   - Working on ansible to set up a primary DNS server
>   - Done a little bit of playing with Gitlab CI
>   - Need to finalise working out the best way to do this securely
>     - split out ucc.machines from zonemake.py code
>
> - Group Policy and Ansible on Windows machines
>   - ACTION: [333] to figure out most supported way to install official 
> SSHD build on Windows
>   - ACTION: [MTL] promises to look at this in more detail once back in 
> the clubroom, including WinRM
>   - Best host to run playbooks from for the Windows machines?
>
> - Post-O-Day account locking
>   - cleanup accounts e.g. `getent passwd|grep zv`, primary group memberships
>   - Fall out and thoughts from account locking
>     - not bad!
>     - on typical schedule: warnings due before O-Day, lockings due after the AGM
>     - online payment options limited (bank transfer still works)
>     - time to zip/rm some old home and away directories to save space
>       - for backups: every byte removed saves 3 (more like 4,5,6!)
>       - to move more directories onto SSD
>
> - Staging storage server:
>   - [TEC] Old DELL R710 server[s] from dadams
>   - Store images or less selective backups onsite, for rapid recovery 
> or offsite replication
>     - zfs send? btrfs send? borgbackup? expose to 
> https://pbs.proxmox.com/ appliance?
>   - Want some extra caddies: 3.5" slots, 3.5" + 2.5" SATA drives
>     - 
> https://discord.com/channels/264401248676085760/264401248676085760/878831917133353031
>     - 
> https://discord.com/channels/264401248676085760/264401248676085760/879354657976229958
>     - 3D print? does [DBA] or anyone else at UWA Makers have the model?
>       - 2021-09-07 update: .STL parts here: 
> https://discord.com/channels/264401248676085760/264411219627212801/884778548156563517
>         - ACTION: [???] print a couple?
>       - Currently just for 3.5" drives in 3.5" slots?
>         - ACTION: [???] tweak it for a 2.5" drive in a 3.5" slot?
>       - a few similar ones:
>         - https://www.thingiverse.com/search?q=dell+r710
>         - https://www.yeggi.com/q/dell+hard+drive+caddy/
>     - or ebay?
>
> - [MPT] Began (unofficial) discussions with [DBA] and CS faculty about 
> making GPU compute accessible to students
>   - Potential for funding? No assurances yet
>   - What else would UCC need to buy/build to make it happen in our MR?
>   - Plan on 2021-07-18.txt to get moving on `loveday` upgrades - wait 
> for this instead?
>     - or test with existing hardware?
>
> - ...so if `loveday` doesn't have upgrade quotes yet, how about 
> `medico` -> `machops`?
>   - 
> https://discord.com/channels/264401248676085760/264411219627212801/883522265466146869
>   - 
> https://docs.google.com/spreadsheets/d/1mbszgk9T7FU0jGXrdTKXXLzW62vuOvqG3xZ-x9CpALE/edit?usp=sharing
>
> - Build and break a PC 2021-04-20, followup
>   - Brand new motherboard missing audio capacitor, but [DBA] will resolder it
>     - ACTION: [DBA] to resolder audio capacitor on new motherboard
>
> - ACTION: [MTL] to update Ansible scripts for mail*
>   - In response to spam campaign
>
> - Rebuild rather than upgrade `discord-irc` ?
>   - ansible driven install
>     - config files:
>       - `~discord/discord-irc-config.json`
>       - `/etc/systemd/system/discord-irc.service`
>     - this machine is a non-complicated test case?
>   - https://github.com/reactiflux/discord-irc
>     - requires Debian 11 "bullseye", DebianPkg:nodejs 12.x
>   - occasionally dies, config tweaks could help?
>     - https://github.com/reactiflux/discord-irc/issues/594
>     - `journalctl -xe -u discord-irc.service`
> ```
> Aug 31 17:45:40 discord-irc discord-irc[45040]: TypeError: Converting 
> circular structure to JSON
> Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Main 
> process exited, code=exited, status=1/FAILURE
> Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Start 
> request repeated too quickly.
> ```
>
> *Meeting closed hh:mm*
>
> ----
>
> ```
> # https://demo.hedgedoc.org/Hlsapf47RsqpgIjqLVfMUw
> cd /home/wheel/docs/meetings
> HEDGEDOC_SERVER=https://demo.hedgedoc.org /home/wheel/bin/hedgedoc 
> export --md Hlsapf47RsqpgIjqLVfMUw ./$(date +%Y-%m-%d).txt
> git commit -am "Tech meeting minutes $(date +%Y-%m-%d)"
> ```
>
> <!-- vim: tabstop=2 shiftwidth=2 expandtab
> -->
> <!-- Local Variables: -->
> <!-- tab-width: 2 -->
> <!-- End: -->
> _______________________________________________
> List Archives: http://lists.ucc.asn.au/pipermail/tech
>
> Unsubscribe here: 
> https://lists.ucc.gu.uwa.edu.au/mailman/options/tech/mtearle%40ucc.gu.uwa.edu.au


More information about the tech mailing list