[tech] Bitumen supervisor module problems
David Adam
zanchey at ucc.gu.uwa.edu.au
Sun Aug 6 21:42:51 AWST 2017
In short, Bitumen's second supervisor module (which runs the switch) has a
memory problem and won't POST any more.
I don't know whether it's worth replacing or just removing it; we don't
really use the redundant feature at all. I've set it to "boot as needed"
(cold standby or RPR mode) instead of "be running and able to take over at
any time" (hot standby or SSO mode), which will at least stop it crashing
continuously.
---
In long:
On IRC, [BOB] asked:
> Any idea what's up with bitumen that's making rancid send out so many
> emails? I see some <thing>failed.txt logs appearing on bitumen but
> wouldn't know where to even start investigating
Rancid is the switch configuration tracker:
http://cvs.ucc.asn.au/cgi-bin/viewvc.cgi/rancid/ucc/
(I mucked up the metadata, so all the old switches look like their latest
config is from today - the SVN log shows otherwise.)
The first place I looked is in the event log on bitumen (`show log`),
which is synced across to Murasoi (/var/log/ucc/cisco.log). It's filled
with entries like these:
Aug 6 17:10:37 bitumen AWST: %C4K_REDUNDANCY-6-DUPLEX_MODE: The peer
Supervisor has been detected
Aug 6 17:10:51 bitumen AWST: %C4K_REDUNDANCY-2-POSTFAIL: POST failure on
STANDBY supervisor detected. Supervisor redundancy is **NOT** available.
Check and replace failed Supervisor.
Aug 6 17:10:53 bitumen AWST: %C4K_REDUNDANCY-6-MODE: ACTIVE supervisor
initializing for sso mode
Aug 6 17:10:53 bitumen AWST: %C4K_REDUNDANCY-3-COMMUNICATION:
Communication with the peer Supervisor has been established
Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The bootvar
has been successfully synchronized to the standby supervisor
Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The config-reg
has been successfully synchronized to the standby supervisor
Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The
startup-config has been successfully synchronized to the standby
supervisor
Aug 6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The
private-config has been successfully synchronized to the standby
supervisor
Aug 6 17:11:04 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC_RATELIMIT: The
vlan database has been successfully synchronized to the standby supervisor
The second thing was to see the diffs in Rancid that [BOB] was complaining
about:
http://cvs.ucc.asn.au/cgi-bin/viewvc.cgi/rancid/ucc/configs/bitumen.ucc.gu.uwa.edu.au?view=log&pathrev=387
Looks like on August 4, a whole bunch of files appeared (then disppeared,
then reappeared, etc. etc.) Revision 373 is the useful one.
bitumen#dir bootflash:
%Error opening bootflash:/ (No more memory for file record)
I tried running `squeeze bootflash`, but that didn't work either (no
deleted files to squeeze out).
The file list from revision 373 had a list, though:
bitumen#delete bootflash:
Delete filename []? post-2017.05.01.09.16.10-failed.txt
Delete bootflash:post-2017.05.01.09.16.10-failed.txt? [confirm]
(repeat 7000 times to remove half of the entries, running for four hours
at last count, done via vim and a tmux paste buffer)
Meanwhile, I realised that the oldest file clogging up the flash was from
May, so in fact that was not the problem.
The documentation for the current OS is at
https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst4500/12-2/54sg/configuration/guide/config.html
In particular, the chapter on supervisor engine redundancy describes two
modes: SSO (hot spare) and RPR (cold spare). Our system was set up with
SSO redundancy, but it looks like it is crashing continuously. Changing
the configuration to RPR mode meant that the second supervisor engine
didn't crash any more, which gave me a chance to inspect the POST logs in
slavebootflash:
---
bitumen# cd slavebootflash:
bitumen# dir
bitumen# more post-2017.08.06.03.48.54-failed.txt
Power-on-self-test for Module 2: WS-X4515
Port/Test Status: (. = Pass, F = Fail, U = Untested)
...
Switch Subsystem Memory ...
49: . 50: . 51: . 52: . 53: . 54: F 55: .
Module 2 Failed
---
So - I think the secondary supervisor module's memory is stuffed, which
means the supervisor engine won't start. I'm not sure if it's worth trying
to replace, or just removing the problem part. Setting it to cold standby
will stop it filling the disk with complaints, and shouldn't affect the
actual operation of the switch.
[DAA]
More information about the tech
mailing list