I’ve had several issues with PSOD on ESX hosts during SRM tests. This will have nothing to do with the PSOD or SRM at all but with my attempts to deal with VMware support.
I attempted to execute `vm-support` script to gather the logs while opening a support case with VMware. I noticed I wasn’t able to actually finish the script, it would create the directory and hang after collecting 28k of data. `ps` output showed two incidences of support scripts running and in an Uninterruptable Sleep state (STAT code D). Also in a zombie state was ‘modprobe usb-storage’ and ‘find /proc -type f’.
Hardware is IBM HS21XM Blades. I have quite a few of these blades, of slightly different hardware models, in use. I found that every single one of the blades in one particular chassis exhibited the same behavior. Fortunately, this chassis is not in my production data center. I updated firmware and BIOS on all of the blades with no change. I also disabled the onboard USB storage device, again with no change.
Going back to software, I started looking at the collection of zombie processes for clues. The modprobe process was zombied on all of my hosts. The find process consistently hung when attempting to execute `vm-support`. Reading the code on the support script I could see were it was collection files in /proc, this makes perfect sense. Turning towards USB, I chased down the usb-storage module and attempted to blacklist the module by creating /etc/modprobe.d/blacklist and making the blacklist entry. After reboot I found this is no longer a valid configuration, at least not in ESX. Some research turned up the gem that allowed me to not load the usb-storage module, entering the following at the end of /etc/modules.conf:
install usb-storage /bin/true
Reboot proved that this successfully blacklisted loading of the module. At this point I manually ran a `find` command in /proc and did not hang. Executing `vm-support` performed collection as expected.
Now, troubleshooting the PSOD remains but at least I have logs.
Initially VMware was suggesting this might be due to IRQ sharing as noted this a KB article. I noted that there was no large difference in the interrupt requests between cpu0 and any of the others. I did not remove reference to usb-uhci and am not experiencing any further issues.