[tech] uccmonitor/prometheus/grafana alerting/debugging, re: mussel auth OOM

Nick Bannon nick at ucc.gu.uwa.edu.au
Wed Apr 3 16:46:30 AWST 2024


At https://discord.com/channels/264401248676085760/1003514966512521246/1224715689688961095 ,
[BEN] reported a problem:
> zonemake is putting out members into /etc/apache2/sites-enabled/members.conf
> that mussel doesn't think exist, which causes Apache to fail to
> start. I've chopped out a chunk of the members.conf to get it going
> (it seemed to be more than a few users) which isn't ideal. Can someone
> with more clue on zonemake/mussel's auth setup take a look?

> These are the users that mussel can't see:
[all recently created accounts since 2024-03-13]

Turns out - this affects of the few custom metrics we're monitoring:
mussel$ getent passwd|wc -l
78

...so mussel is only seeing local /etc/passwd user accounts, not the
>1000 current+locked accounts from AD. Checking Grafana, it happened
about 2024-03-12 05:35 (local Perth +0800 time):
http://uccmonitor.ucc.asn.au:3000/d/V3mRaxPZk/ucc-overview
http://uccmonitor.ucc.asn.au:3000/d/V3mRaxPZk/ucc-overview?orgId=1&from=1710192596865&to=1710193632618&viewPanel=2

What else happened about then? hmm mussel:/var/log is a bit patchy, were
files lost? I/O errors on NFS or the VM block devices? There's NFS
outage errors in `dmesg`...

...but central logging finds the answer:
Mar 12 05:35:03 mussel winbindd[25062]:   gensec_gse_unwrap: GSS UnWrap failed:  Miscellaneous failure (see text): unknown mech-code 12 for mech 1 2 840 113554 1 2 2 
Mar 12 05:35:33 mussel winbindd[25062]: [2024/03/12 05:35:33.358589,  0] ../source3/winbindd/winbindd_cm.c:222(fork_child_dc_connect) 
Mar 12 05:35:33 mussel winbindd[25062]:   fork_child_dc_connect: Could not fork: Cannot allocate memory 

With that hint, back on mussel:
mussel:/var/log# bzless /var/log/messages.3.bz2
Mar 12 05:15:03 mussel out of memory [25062]
Mar 12 05:15:13 mussel last message repeated 21 times
Mar 12 05:18:01 mussel out of memory [25062]
Mar 12 05:18:01 mussel last message repeated 7 times
Mar 12 05:25:13 mussel out of memory [25062]
Mar 12 05:25:15 mussel last message repeated 5 times
Mar 12 05:26:29 mussel last message repeated 4 times
Mar 12 05:27:57 mussel last message repeated 6 times

It seems OK from a casual `systemctl status winbind.service`:
winbind.service - Samba Winbind Daemon
   Loaded: loaded (/lib/systemd/system/winbind.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-10-09 07:02:16 AWST; 5 months 25 days ago
 Main PID: 1849 (winbindd)
   Memory: 2.9G
[...]
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

...though that seems like a lot of memory use?

- Anyway, after a OOM event, it probably is best to reboot the machine.
  - but first, out of interest, let's try restarting the Winbind AD auth daemon. Seems OK!
  - and this time, enumerating all the AD accounts works.
    (takes a while the first time, though):

mussel:~# systemctl restart winbind    
mussel:~# systemctl status winbind     
winbind.service - Samba Winbind Daemon  
   Loaded: loaded (/lib/systemd/system/winbind.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2024-04-03 15:51:32 AWST; 13s ago
   Status: "winbindd: ready to serve connections..."
    Tasks: 4 (limit: 4915)
   Memory: 29.5M
[...]
mussel:/etc/apache2/sites-available# time getent passwd|wc -l
1680
real    0m45.282s

- Lessons/observations?
  - mussel has 8GiB of RAM and no swap, not even zram
    - that should be "enough", but it might make OOMs more likely,
      and the leadup to OOM harder to spot on monitoring? it was steadily
      creeping up on the graph here:
http://uccmonitor.ucc.asn.au:3000/d/uYiRn3BZk/node-exporter-full?orgId=1&var-job=other&var-name=mussel&var-node=mussel.ucc.asn.au&var-port=9100&from=1709222400000&to=1710193632000
    - could we make actual logged kernel OOM-killer events show up there? somewhere?
  - mussel could well be kept/upgraded/rebuilt, it's currently out-of-date running Debian "buster" 10
    - https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/21
    - but some of that is not just a trivial package upgrade, it will
      turn into specific upgrade tasks/issues e.g. for the wiki. What
      we're really trying to do with it is extract all its services,
      notably a new, config-managed/ansible built webserver and not hold
      up on the parts that need testing old+new side-by-side
      - We could re-consolidate later, a lot of these services/roles can share well
  - upgrade of all the AD parts https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/27
  - more monitoring! custom metrics and alerting on any likely/known/recurring issues!
    - temperature, disk SMART stats, ...  https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/12
    - displayed on Cerberus and/or on extra clubroom displays https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/58

Nick.

-- 
   Nick Bannon   | "I made this letter longer than usual because
nick-sig at rcpt.to | I lack the time to make it shorter." - Pascal


More information about the tech mailing list