Wednesday, April 13, 2016

ESXi 6: cannot synchronize host

Today had error pop up on vCenter server for vpxa service in yellow state, also found that ESXi host was showing errors for "Cannot synchronize host" as well as "quick stats on <host> is not up-of-date"

Troubleshooting:

After doing some investigation found that one of the new hosts did not have a DNS entry created.


Fix:
This should be by now a given but always test forward and reverse DNS lookup for ESXi host before adding to vCenter server!

vCenter Server 6.0U2 errors - "lost access to volume"

Recently upgraded environment from 5.5 to vCenter Server 6.0U2.
Hardware consists of Cisco UCS with B200M3 blades and XtremIO storage.

After the upgrade users complained about slow and dropped connections to their VM's.

Troubleshooting:

Installed a host with vCenter Server 6.0U1 and did not get the error message which was very strange, so what has changed between 6.0U1 and U2?
After reviewing the logs found that around every 30minute received errors "lost access to volumes".
Further troubleshooting on logs revealed that this only happens on the XtremeIO datastores.

Also following warning message within vmkernel log file on ESXi host:

WARNING: NMP: nmp_PathDetermineFailure:2872: Cmd (0x85) PDL error (0x5/0x25/0x0) - path vmhba4:C0:T0:L10 device naa.514f0c514ba0000e - triggering path evaluation


Found the following KB from EMC and VMware which relates to this issue:

https://support.emc.com/kb/467750  (need login to view)

vSphere 6 added new VMCP feature with clear distinction between PDL and APD SCSI sense codes.
Good KB from VMware:

This issue relates to the XtremIO firmware (< 4.0.1) that provides a response (illegal request) to the vSphere 6.0 host SMART data request which triggers path evaluation for PDL condition.


Fix:

Upgrade XtremIO firmware to 4.0.1 and above. Latest recommended.
This issue could also affect other storage arrays so please make sure to check with VMware on this and keep the VMware KB as a live bookmark.

At end of day make sure to check the VMware compatibility guide.

vCenter Server 6.0U1 & 6.0U2 - ESXi hosts disappear with installation of Netapp VSC plugin 6.1 or 6.2

Recently performed a vCenter Server upgrade from 5.1 to 6.0U2.
The Netapp VSC vCenter plugin was also being used for backups using the plugin through the VIC.

The greenfield installation went without a hitch for vCenter Server 6 environment and hosts were migrated over.
After I upgraded the hosts to ESXi 6.0U2 I had to perform the upgrade of Netapp VSC plugin.

The plugin upgrade process was pretty straight forward with moving the plugin to a new server and pointing to new vCenter server.
Netapp has a pretty good KB out on how to preserve existing repository information on app migration to new server - https://kb.netapp.com/support/index?page=content&id=1011871

Problem:

After VSC plugin upgrade I ran into a strange issue where ESXi hosts kept randomly disappearing from the vCenter Web client and nothing else getting affected.

Troubleshooting:

Restarted vCenter server and tested with different browsers with no affect.
Disabled the plugin and restarted browser which resolved the issue.
Uninstalled 6.2 and installed 6.1 which seems to have better experience and hosts do no disappear as frequently.
Netapp community has a couple of blog posts related to the issue with no real positive response from Netapp.

Fix:

Open a case with Netapp and was provide BUG ID 986313 related to issue with no estimate on fix.
The bug can be view on Netapp support -> Tools -> Bug Tools.

Workaround:

Workaround at this time seems to be downgrading or install VSC 6.0 which has the least amount of problems.