[tech] Tech/Wheel Meeting 2021-10-10 14:00 - One hour reminder
Mark Tearle
mtearle at tearle.com
Mon Oct 11 11:44:46 AWST 2021
Hi folks
Apologies for accidentally missing this yesterday. Will have to catch up on this with the Monday night crowd tonight.
Cheers,
Mark
--
Mark Tearle <mtearle at tearle.com>
On Sun, 10 Oct 2021, at 1:00 PM, root wrote:
> Tech/Wheel Meeting Agenda - Sunday 2021-10-10 14:00
> ===================================================
> - VENUE: UCC Clubroom
> - and online at https://meetings.ucc.asn.au/b/bob-yrk-uy6
>
> *Meeting opened hh:mm*
>
> ## Attendance
> - Present
> - Apologies
> - Absent
>
> ## Next meeting
> - Schedule next meeting
> - *day 202Y-MM-ddThh:mm
> - let's have more doing than talking? next one pre-O-Day, 2022?
> - ACTION: [???] Set and verify reminders of next meeting: `motsugo#
> crontab -e`
> - Let's try something different:
> https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring/-/issues/1
> - Promptly update agenda.next with the TIME/DATE/VENUE
> - Check at T-7days that the notice really went out, fix for T-4days
> if needed
> - Everyone, pre-meeting: Curate agenda.next
>
> ## Optional items - choose at the start of the meeting
> - Ethical guidelines
> - Monitoring
> - Backups
> - Password rotations
> - New members
> - [BRD] nominated for wheel. ACTION: [TEC] to raise this with
> committee
> - Quick check of ChangeLog
> - Lessons learnt
> - 2021-10-04 magikarp's SD card wasn't so secure after all...
> - Filesystem went readonly on 2021-10-04 and [BOB] tried to run
> `fsck` to no avail
> - when exactly did it fail? Noticed as grey'ed out in the web UI
> at 19:35, but
> VM were still running until reboot test at about 20:00
> - [333] More details to be added to agenda as they come...
> - [333]'s editor is leaving `foobar~` backup files about the place
> - No spare SD card on hand (they're cheap!)
> - Nearest replacement was a USB thumb key, slightly smaller so
> `dd` isn't a direct option
> - with one VM host down, ceph was over-capacity and could not meet
> goals
> - TODO: add NVMe to legacy hosts
> - TODO: bring machop online
> - 2021-10-05T0318 Power outage
> - ssh.ucc.asn.au
> - auth failures triggered fail2ban?
> - samson
> - manual, post-reboot `mount -av`
> - manual, post-reboot `systemctl restart samba-ad-dc.service`
> - samson RADIUS dead? -> broken wifi auth, IPSec VPN
> - portal
> - https://portal.ucc.asn.au/ was `403 Forbidden`-ing
> - `uccportal# mount -av`
> - standardise/document/expose www -> hostname mappings?
> DocumentRoot?
> - Cloudflare -> F5 -> mussel/mailauesi proxy config?
> - https://wiki.ucc.asn.au/TheCloudflarening
> - portal, bbb, gitlab, uccmonitor, element+matrix, wiki, www ...
> - mailfish
> - manual, post-reboot `mount -av` (try autofs?)
> - motsugo
> - md0 scrubbed? or rebuilt? more than once recently, but new
> spare SSD /dev/sdh not yet in use
> - mollitz
> - some long-running and failed backups: away, motsugo
>
> ## Known Broken Stuff
> - [BRD] `universitycomputer.club.passwd.org` vs `*Everything.html`
> - IPv6 inbound
> - ACTION: [TEC] to email UWA IT
> - lard
> - Still needs a spare PSU OR replacement with something less... fatty.
> - ACTION: [???] to send email out requesting a 1U Cisco switch to
> replace Lard
> - ACTION: [MTL] to update Ansible scripts for mail*
> - ACTION: [DBA] wants to give it a shot, good reason to try out
> Proxmox
> - samson the https://wiki.ucc.asn.au/ActiveDirectory server has no
> freshly built DC friends
> - this is risky, a single-point-of-failure, which in turn depends on
> the running VM cluster
> - something to do with the current configuration is probably why
> mussel
> and mooneye still have auth problems
> - can we upgrade or rebuild or document our way out of this?
> - ...so making a quick clone and calling it "done" really isn't
> enough, continuous integration is called for?
> - vucc testbed in https://wiki.ucc.asn.au/NewActiveDirectory
> - mollitz is missing prometheus-node-exporter since the rebuild, months
> ago?
> - [NTU] anyone want a hand with a
> https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring run ?
> - can we use the DebianPkg:prometheus-node-exporter/stable where
> possible?
> - motsugo i801_smbus spam
> - [BOB] I think we should change the bios battery before we go down
> any other rabbit holes
> - @2021-10-08 I've rebooted the BMC, let's see if that fixes things
> - the fact that the BMC thinks it's 2007 is rather telling
>
> ## Matters arising previously
>
> ## Extra items (rename/refile as appropriate)
> - machop: new EPYC box from Michael/Wings
>
> - Monitoring
> - Drive health
> - Uncorrectable errors, reallocated sectors, TBW on SSDs,
> temperature
> - ACTION: [NTU] and [MTL] to work together on how best to start
> drive monitoring, and make it standard/SOE config via ansible
> - proof-of-concept tested
> -
> https://matrix.to/#/!zAfheZzGazlYUQqAeJ:ucc.asn.au/$HuKyvV8eVoTXKah1Ua3hwR9jWyodlIt2P1iO4upAPmE
> - `/etc/cron.d/node_prometheus-SMART-export`: `*/5 * * * * root
> /usr/local/bin/smartmon.sh > /var/lib/node_exporter/smart_metrics.prom`
> - `-rwxr-xr-x 1 nick wheel 11287 Sep 27 19:49
> /usr/local/bin/smartmon.sh`
> -
> http://uccmonitor.ucc.asn.au:3000/d/PkPI4xGWz/s-m-a-r-t-dashboard
> - both the script and dashboard probably need a bit of a rewrite,
> so take it as inspiration?
> - The SSDs and HDDs report with temperatures in different SMART
> metrics; and the models are exposed in slightl
> y different text strings... so it might be better if the `smartmon.sh`
> did some normalisation
> - Alternatively, can do some normalisation in the Prometheus
> query - so that's working
> - was graphing:
> `label_replace(avg(smartmon_temperature_celsius_raw_value{
> instance=~"$instance", disk=~"$disk" }) by (instance, disk, name),
> "instance", "$1", "instance", "([^.]+).*")`
> - now graphing: `... smartmon_temperature_celsius_raw_value{
> ... } or smartmon_airflow_temperature_cel_raw_value{ ... }`
> - so if the first fails, it uses the second... though it
> would be nice if (both existed) then {use the maximum}
> - ACTION: [???] ansible-ise and roll out more? and/or do some
> rewriting/tweaking
>
> - [MTL] continues looking at DNS and CI setup
> - Not much progress
> - Have played with coredns for a resolving server
> - Need to do some more testing of resolving internal UWA things (to
> check behaviours)
> - Working on ansible to set up a primary DNS server
> - Done a little bit of playing with Gitlab CI
> - Need to finalise working out the best way to do this securely
> - split out ucc.machines from zonemake.py code
>
> - Group Policy and Ansible on Windows machines
> - ACTION: [333] to figure out most supported way to install official
> SSHD build on Windows
> - ACTION: [MTL] promises to look at this in more detail once back in
> the clubroom, including WinRM
> - Best host to run playbooks from for the Windows machines?
>
> - Post-O-Day account locking
> - cleanup accounts e.g. `getent passwd|grep zv`, primary group memberships
> - Fall out and thoughts from account locking
> - not bad!
> - on typical schedule: warnings due before O-Day, lockings due after the AGM
> - online payment options limited (bank transfer still works)
> - time to zip/rm some old home and away directories to save space
> - for backups: every byte removed saves 3 (more like 4,5,6!)
> - to move more directories onto SSD
>
> - Staging storage server:
> - [TEC] Old DELL R710 server[s] from dadams
> - Store images or less selective backups onsite, for rapid recovery
> or offsite replication
> - zfs send? btrfs send? borgbackup? expose to
> https://pbs.proxmox.com/ appliance?
> - Want some extra caddies: 3.5" slots, 3.5" + 2.5" SATA drives
> -
> https://discord.com/channels/264401248676085760/264401248676085760/878831917133353031
> -
> https://discord.com/channels/264401248676085760/264401248676085760/879354657976229958
> - 3D print? does [DBA] or anyone else at UWA Makers have the model?
> - 2021-09-07 update: .STL parts here:
> https://discord.com/channels/264401248676085760/264411219627212801/884778548156563517
> - ACTION: [???] print a couple?
> - Currently just for 3.5" drives in 3.5" slots?
> - ACTION: [???] tweak it for a 2.5" drive in a 3.5" slot?
> - a few similar ones:
> - https://www.thingiverse.com/search?q=dell+r710
> - https://www.yeggi.com/q/dell+hard+drive+caddy/
> - or ebay?
>
> - [MPT] Began (unofficial) discussions with [DBA] and CS faculty about
> making GPU compute accessible to students
> - Potential for funding? No assurances yet
> - What else would UCC need to buy/build to make it happen in our MR?
> - Plan on 2021-07-18.txt to get moving on `loveday` upgrades - wait
> for this instead?
> - or test with existing hardware?
>
> - ...so if `loveday` doesn't have upgrade quotes yet, how about
> `medico` -> `machops`?
> -
> https://discord.com/channels/264401248676085760/264411219627212801/883522265466146869
> -
> https://docs.google.com/spreadsheets/d/1mbszgk9T7FU0jGXrdTKXXLzW62vuOvqG3xZ-x9CpALE/edit?usp=sharing
>
> - Build and break a PC 2021-04-20, followup
> - Brand new motherboard missing audio capacitor, but [DBA] will resolder it
> - ACTION: [DBA] to resolder audio capacitor on new motherboard
>
> - ACTION: [MTL] to update Ansible scripts for mail*
> - In response to spam campaign
>
> - Rebuild rather than upgrade `discord-irc` ?
> - ansible driven install
> - config files:
> - `~discord/discord-irc-config.json`
> - `/etc/systemd/system/discord-irc.service`
> - this machine is a non-complicated test case?
> - https://github.com/reactiflux/discord-irc
> - requires Debian 11 "bullseye", DebianPkg:nodejs 12.x
> - occasionally dies, config tweaks could help?
> - https://github.com/reactiflux/discord-irc/issues/594
> - `journalctl -xe -u discord-irc.service`
> ```
> Aug 31 17:45:40 discord-irc discord-irc[45040]: TypeError: Converting
> circular structure to JSON
> Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Main
> process exited, code=exited, status=1/FAILURE
> Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Start
> request repeated too quickly.
> ```
>
> *Meeting closed hh:mm*
>
> ----
>
> ```
> # https://demo.hedgedoc.org/Hlsapf47RsqpgIjqLVfMUw
> cd /home/wheel/docs/meetings
> HEDGEDOC_SERVER=https://demo.hedgedoc.org /home/wheel/bin/hedgedoc
> export --md Hlsapf47RsqpgIjqLVfMUw ./$(date +%Y-%m-%d).txt
> git commit -am "Tech meeting minutes $(date +%Y-%m-%d)"
> ```
>
> <!-- vim: tabstop=2 shiftwidth=2 expandtab
> -->
> <!-- Local Variables: -->
> <!-- tab-width: 2 -->
> <!-- End: -->
> _______________________________________________
> List Archives: http://lists.ucc.asn.au/pipermail/tech
>
> Unsubscribe here:
> https://lists.ucc.gu.uwa.edu.au/mailman/options/tech/mtearle%40ucc.gu.uwa.edu.au
More information about the tech
mailing list