Tuesday, July 2, 2013

ESXi 5.1 - network port lost connection without notification - Dell Poweredge

I ran into a very serious bug the other day which caused all our production Virtual Servers to loose network connectivity on a specific ESXi host without any notifications or alarms from VMware VCenter server.  Actually VMware was completed unaware of this problem and just continued working as normal with all our servers offline.

Debugging the problem:

Reviewed all the physical ports and found that a particular NIC lost its VLAN's and VMware did not recognize this so all VM's was left in disconnected state.  No network errors was detected on the Cisco physical switches as well as from VMware so no fail over took place



Quick fix:

When to teaming and failover within each Port Group where VM's are affected and removed the problem NIC from Active uplinks.

VMware support case opened:

Dell PowerEdge R720
BIOS Version 1.3.6
Firmware Version 1.23.23 (Build 01)

Broadcom Gigabit Ethernet BCM5719
Family Firmware Version 7.2.20

ESXi 5.1 799733

Problem seems with Broadcom Gigabit network cards and the current firmware version.

Resolution:

Patch the Broadcom tg3 firmware driver with latest Patch ID:  tg3-3.129d.v50.1
Upgraded ESXi hosts to latest 5.1.0 build 1117900 using Update Manager with Baseline group create including Critical and non-critical host patches as well as the Broadcom tg3 baseline.

or

To work around this issue, disable the NetQueue feature.
If there are 10 Gb NICs on the host in addition to the tg3 NICs, then only disable NetQueue for the tg3 driver.

Links:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2035701