[tech] Bitumen supervisor module problems

David Adam zanchey at ucc.gu.uwa.edu.au
Sun Aug 6 21:42:51 AWST 2017


In short, Bitumen's second supervisor module (which runs the switch) has a 
memory problem and won't POST any more.

I don't know whether it's worth replacing or just removing it; we don't 
really use the redundant feature at all. I've set it to "boot as needed" 
(cold standby or RPR mode) instead of "be running and able to take over at 
any time" (hot standby or SSO mode), which will at least stop it crashing 
continuously.

---

In long:

On IRC, [BOB] asked:

> Any idea what's up with bitumen that's making rancid send out so many 
> emails? I see some <thing>failed.txt logs appearing on bitumen but 
> wouldn't know where to even start investigating

Rancid is the switch configuration tracker:
http://cvs.ucc.asn.au/cgi-bin/viewvc.cgi/rancid/ucc/
(I mucked up the metadata, so all the old switches look like their latest 
config is from today - the SVN log shows otherwise.)

The first place I looked is in the event log on bitumen (`show log`), 
which is synced across to Murasoi (/var/log/ucc/cisco.log). It's filled 
with entries like these:

Aug  6 17:10:37 bitumen AWST: %C4K_REDUNDANCY-6-DUPLEX_MODE: The peer 
Supervisor has been detected
Aug  6 17:10:51 bitumen AWST: %C4K_REDUNDANCY-2-POSTFAIL: POST failure on 
STANDBY supervisor detected. Supervisor redundancy is **NOT** available. 
Check and replace failed Supervisor.
Aug  6 17:10:53 bitumen AWST: %C4K_REDUNDANCY-6-MODE: ACTIVE supervisor 
initializing for sso mode
Aug  6 17:10:53 bitumen AWST: %C4K_REDUNDANCY-3-COMMUNICATION: 
Communication with the peer Supervisor has been established
Aug  6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The bootvar 
has been successfully synchronized to the standby supervisor
Aug  6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The config-reg 
has been successfully synchronized to the standby supervisor
Aug  6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The 
startup-config has been successfully synchronized to the standby 
supervisor
Aug  6 17:11:01 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC: The 
private-config has been successfully synchronized to the standby 
supervisor
Aug  6 17:11:04 bitumen AWST: %C4K_REDUNDANCY-5-CONFIGSYNC_RATELIMIT: The 
vlan database has been successfully synchronized to the standby supervisor

The second thing was to see the diffs in Rancid that [BOB] was complaining 
about:
http://cvs.ucc.asn.au/cgi-bin/viewvc.cgi/rancid/ucc/configs/bitumen.ucc.gu.uwa.edu.au?view=log&pathrev=387

Looks like on August 4, a whole bunch of files appeared (then disppeared, 
then reappeared, etc. etc.) Revision 373 is the useful one.

bitumen#dir bootflash:
%Error opening bootflash:/ (No more memory for file record)

I tried running `squeeze bootflash`, but that didn't work either (no 
deleted files to squeeze out).

The file list from revision 373 had a list, though:
bitumen#delete bootflash:
Delete filename []? post-2017.05.01.09.16.10-failed.txt
Delete bootflash:post-2017.05.01.09.16.10-failed.txt? [confirm]
(repeat 7000 times to remove half of the entries, running for four hours 
at last count, done via vim and a tmux paste buffer)

Meanwhile, I realised that the oldest file clogging up the flash was from 
May, so in fact that was not the problem.

The documentation for the current OS is at 
https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst4500/12-2/54sg/configuration/guide/config.html

In particular, the chapter on supervisor engine redundancy describes two 
modes: SSO (hot spare) and RPR (cold spare). Our system was set up with 
SSO redundancy, but it looks like it is crashing continuously. Changing 
the configuration to RPR mode meant that the second supervisor engine 
didn't crash any more, which gave me a chance to inspect the POST logs in 
slavebootflash:

---
bitumen# cd slavebootflash:
bitumen# dir
bitumen# more post-2017.08.06.03.48.54-failed.txt
Power-on-self-test for Module 2:  WS-X4515
 Port/Test Status: (. = Pass, F = Fail, U = Untested)
...
Switch Subsystem Memory ...
49: . 50: . 51: . 52: . 53: . 54: F 55: .

Module 2 Failed
---

So - I think the secondary supervisor module's memory is stuffed, which 
means the supervisor engine won't start. I'm not sure if it's worth trying 
to replace, or just removing the problem part. Setting it to cold standby 
will stop it filling the disk with complaints, and shouldn't affect the 
actual operation of the switch.

[DAA]


More information about the tech mailing list