Zombied modprobe and more on IBM HS21xm blades with internal flash drive

Looks like I’ve found yet another issues with the IBM HS21 blade, specifically the xm with the embedded USB flash module and ESX 3.5.  I don’t believe I had any issue with 3.0 and think this all might have started when I applied firmware in 2008.  I’ll try to get the exact firmware revision and update appropriately.

The first symptom presented as failure to patch via VMware Update Manager.  The update would fail in the midst of or just after applying a kernel module.  The problem occurred on reboot as the initrd image had apparently not been rebuilt.  I attempted to hack my way in via Grub but decided I wouldn’t be comfortable with the state of ESX after that and simply rebuilt.  I was able to repeat this process consistently on 3.5 but the issue never occurred on ESXi…obviously there’s no initrd in the non-existent service console.

Life was busy so patching took a low priority as I moved data centers and started a project deploying RecoverPoint and SRM.  During some testing with SRM I started experiencing PSOD during.  Twice I was using my current TEST cluster as the recovery target which definitely caused some issues when I crashed the host.  During a call with support I attempted to get vm-support scripts to run and they would consistently hang immediately.  After many hours of work we found that I had zombied modprobe processes upon boot of an ESX box and also zombied find processes after executing vm-support.  We noted that the find process appeared to be failing on /proc/usb and a manual execution proved that to be the case. I attempted to disable the USB modules in the BIOS and apply latest firmware with no effect.  I finally blacklisted the usb-storage module in (exactly one) host from loading on boot by inserting into /etc/modules.conf “install usb-storage /bin/true“.  This resolved the issue with executing the support script and research continued on my PSOD.  A patch was released shortly after my server crashes that resolved a storage issue during rescans.

The fact that a patch existed that engineering said would fix my issue was all good and cool but I destroyed my hosts when attempting to patch.  Open VMware support case number 2.  Again, many hours of research time spent the root problem proved to be elusive.  I forget exactly what led me to the action, makes this whole post somewhat silly then, but I offered to unload and blacklist usb-storage modules to assist in some troubleshooting.  Turn out this led to a complete resolution of the patching issue as well.  In retrospect it makes perfect sense, especially after tying the patch application failure to kernel modules.  Applying patches from 6/30/2009 resolved my host crash issue during SRM tests.

After that I ripped out all of the embedded flash modules and life was good.

Point of the whole story, there appears to be a perfect storm pairing of HS21, firmware revision, Flash module and ESX version that totally breaks the usb-storage module.  Worst part is that this is the first encounter VMware has had with this issue.  Also, I deployed some newer blades of the same model with current firmware and the issue does not exist.