[tech] uccmonitor/prometheus/grafana alerting/debugging, re: mussel auth OOM
Nick Bannon
nick at ucc.gu.uwa.edu.au
Wed Apr 3 16:46:30 AWST 2024
At https://discord.com/channels/264401248676085760/1003514966512521246/1224715689688961095 ,
[BEN] reported a problem:
> zonemake is putting out members into /etc/apache2/sites-enabled/members.conf
> that mussel doesn't think exist, which causes Apache to fail to
> start. I've chopped out a chunk of the members.conf to get it going
> (it seemed to be more than a few users) which isn't ideal. Can someone
> with more clue on zonemake/mussel's auth setup take a look?
> These are the users that mussel can't see:
[all recently created accounts since 2024-03-13]
Turns out - this affects of the few custom metrics we're monitoring:
mussel$ getent passwd|wc -l
78
...so mussel is only seeing local /etc/passwd user accounts, not the
>1000 current+locked accounts from AD. Checking Grafana, it happened
about 2024-03-12 05:35 (local Perth +0800 time):
http://uccmonitor.ucc.asn.au:3000/d/V3mRaxPZk/ucc-overview
http://uccmonitor.ucc.asn.au:3000/d/V3mRaxPZk/ucc-overview?orgId=1&from=1710192596865&to=1710193632618&viewPanel=2
What else happened about then? hmm mussel:/var/log is a bit patchy, were
files lost? I/O errors on NFS or the VM block devices? There's NFS
outage errors in `dmesg`...
...but central logging finds the answer:
Mar 12 05:35:03 mussel winbindd[25062]: gensec_gse_unwrap: GSS UnWrap failed: Miscellaneous failure (see text): unknown mech-code 12 for mech 1 2 840 113554 1 2 2
Mar 12 05:35:33 mussel winbindd[25062]: [2024/03/12 05:35:33.358589, 0] ../source3/winbindd/winbindd_cm.c:222(fork_child_dc_connect)
Mar 12 05:35:33 mussel winbindd[25062]: fork_child_dc_connect: Could not fork: Cannot allocate memory
With that hint, back on mussel:
mussel:/var/log# bzless /var/log/messages.3.bz2
Mar 12 05:15:03 mussel out of memory [25062]
Mar 12 05:15:13 mussel last message repeated 21 times
Mar 12 05:18:01 mussel out of memory [25062]
Mar 12 05:18:01 mussel last message repeated 7 times
Mar 12 05:25:13 mussel out of memory [25062]
Mar 12 05:25:15 mussel last message repeated 5 times
Mar 12 05:26:29 mussel last message repeated 4 times
Mar 12 05:27:57 mussel last message repeated 6 times
It seems OK from a casual `systemctl status winbind.service`:
winbind.service - Samba Winbind Daemon
Loaded: loaded (/lib/systemd/system/winbind.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2023-10-09 07:02:16 AWST; 5 months 25 days ago
Main PID: 1849 (winbindd)
Memory: 2.9G
[...]
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
...though that seems like a lot of memory use?
- Anyway, after a OOM event, it probably is best to reboot the machine.
- but first, out of interest, let's try restarting the Winbind AD auth daemon. Seems OK!
- and this time, enumerating all the AD accounts works.
(takes a while the first time, though):
mussel:~# systemctl restart winbind
mussel:~# systemctl status winbind
winbind.service - Samba Winbind Daemon
Loaded: loaded (/lib/systemd/system/winbind.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-04-03 15:51:32 AWST; 13s ago
Status: "winbindd: ready to serve connections..."
Tasks: 4 (limit: 4915)
Memory: 29.5M
[...]
mussel:/etc/apache2/sites-available# time getent passwd|wc -l
1680
real 0m45.282s
- Lessons/observations?
- mussel has 8GiB of RAM and no swap, not even zram
- that should be "enough", but it might make OOMs more likely,
and the leadup to OOM harder to spot on monitoring? it was steadily
creeping up on the graph here:
http://uccmonitor.ucc.asn.au:3000/d/uYiRn3BZk/node-exporter-full?orgId=1&var-job=other&var-name=mussel&var-node=mussel.ucc.asn.au&var-port=9100&from=1709222400000&to=1710193632000
- could we make actual logged kernel OOM-killer events show up there? somewhere?
- mussel could well be kept/upgraded/rebuilt, it's currently out-of-date running Debian "buster" 10
- https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/21
- but some of that is not just a trivial package upgrade, it will
turn into specific upgrade tasks/issues e.g. for the wiki. What
we're really trying to do with it is extract all its services,
notably a new, config-managed/ansible built webserver and not hold
up on the parts that need testing old+new side-by-side
- We could re-consolidate later, a lot of these services/roles can share well
- upgrade of all the AD parts https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/27
- more monitoring! custom metrics and alerting on any likely/known/recurring issues!
- temperature, disk SMART stats, ... https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/12
- displayed on Cerberus and/or on extra clubroom displays https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/58
Nick.
--
Nick Bannon | "I made this letter longer than usual because
nick-sig at rcpt.to | I lack the time to make it shorter." - Pascal
More information about the tech
mailing list