9
Hector Martin (@marcan@treehouse.systems)
social.treehouse.systemsI think I never told this story here... how I fixed a server with a very precisely placed piece of tape.
So at Euskal Encounter we got shiny new servers a few years back, and they worked great except one of them developed a peculiar problem. It would not shut down.
When told to shut down, it would either hang, or boot back up, or power back up but then fail to boot. This was a problem, because we normally relied on servers shutting down and staying down during our shutdown procedure. Having to have someone babysitting the machine to yank the power is not great. Plus it meant that if we ever got into that state, we couldn't fix it remotely (and some events are run remotely). Once the problem happened, no amount of shutdown/power up/reboot commands to the BMC would fix it (eventually it would start logging power control errors).
So we pulled the server out after an event, and sent it for RMA. It came back saying the techs couldn't reproduce the problem. And indeed, we powered it up on the bench, and it seemed fine.
Stuck it back in the rack at the next event, and it stopped working again.
At this point I was thinking this must be some kind of electrical issue caused by mechanical stress, so we tore it apart and jiggled all the cables and made sure all the connections were tight.
No dice.
This whole thing took several years, since we could only really work on the machine during events (and I kind of live halfway across the world). It just kept on limping on with that bug since we couldn't find time to dive deep into the issue.
At one point I started thinking... What's the difference between the server being in the rack and not? That all the cables are plugged in, particularly USB and Ethernet cables.
Could it be Wake-on-LAN? So I checked the WoL settings, but it was indeed switched off on all the Ethernet interfaces. And besides, we had two identical servers and only one had the issue. I sniffed the network looking for stuff that might pass as a WoL magic packet, but came up empty.
Still, I couldn't find another explanation, so I did the logical test. Unplugged the Ethernet cables, and tested it. It worked fine. Plugged the cables in. The problem reappeared.
Oooookay.
In particular, it was the 4 cables connected to the add-on PCIe network card.
So I swapped the cards on both servers and guess which other server started having buggy shutdowns!
Just in case, I tried upgrading the firmware on the card, but that didn't help.
At this point I'm starting to think about RMAing the card, but that would take time and it'd be hard to explain what the problem is. Buying another card would be an extra expense, and cause us to have different configurations on both servers (which is less desirable).
And then I thought... I'm never going to use this feature, ever. These are servers with BMCs, we can turn them on over IPMI. So this Ethernet card is sending broken/random wake signals to the PCIe slot when it has an Ethernet link? Okay.
I asked for some tape and scissors, pulled the server out again, took the card out, carefully cut out a small sliver of tape, and placed it over the WAKE# pin on the PCIe edge connector. Put it all back together and tested it again.
Problem fixed.
You must log in or register to comment.