Thursday, July 21, 2016

VSAN upgrade - Dell Poweredge servers

I have been meaning to write up on a VSAN upgrade on a Dell R730xd's with PERC H730 which I recently completed at a customer.  This is not going to be lengthy discussion on this topic but primarily want to provide some information on tasks I had to perform for upgrade to VSAN 6.2

  1. The VSAN on-disk metadata upgrade is equivalent to doing a SAN array firmware upgrade and therefore requires a good backup and recovery strategy to be in place before you proceed.
  2. Migrate VM’s off of host.
  3. Place host into maintenance mode.
    1. You want to use whatever the quickest method is to update the firmware, for VSAN's sake. Normally Dell FTP update if network available to configure.
    2. When you put a host into maintenance mode and choose the option to "ensure accessibility", it doesn't migrate all the components off but just enough so that the policies will be in violation.  A timer starts when you power it off, and if the host isn't back in the VSAN cluster after 60 minutes, it begins to rebuild that host's data elsewhere in the cluster  If you know it will take longer than 60min or where possible select full data migration.
    3. You can view the resync using the RVC command "vsan.resync_dashboard <cluster/host>"
  1. Change advanced settings required for PERC H730
    1. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144936
    2. esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
    3. esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor
  2. Upgrade the lsi_mr3 driver. VUM is easy!
  3. Login to DRAC and perform firmware upgrade:
  4. Upgrade Backplane expander (BP13G+EXP 0:1)
    1. Firmware version 1.09 ->  3.03
  5. Upgrade DRAC H730 version
      1. 25.3.0.0016 ->  25.4.0.0017
  1. Login to lifecycle controller and set/verify BIOS configuration settings for controller
    1. https://elgwhoppo.com/2015/08/27/how-to-configure-perc-h730-raid-cards-for-vmware-vsan/
    2. Disk cache for non-raid = disabled
    3. Bios mode = pause on errors
    4. Controller mode = HBA (non-raid)
  2. After all hosts upgraded, verify VSAN cluster functionality and other prerequisites:
    1. Verify no stranded objects on VSAN datastores by running python script on each host.
    2. Verify persistent log storage for VSAN trace files.
    3. Verify advanced settings still set from task 3!
  3. Place each host into maintenance mode again.
  4. Upgrade ESXi host to 6.0U2.
  5. Upgrade the on-disk format to V3.
    1. This task runs for a very long time and has alot of sub-steps which takes place in the background.  It also migrates the data off of each disk group to recreate as V3 .  This has not impact on the VMs.
    2. This process is repeated for all disk groups.
  6. Verify all disk groups upgrade to V3.
  7. Completed

Ran into some serious trouble and had a resync task that ran for over a week due to a VSAN 6.0 KB 2141386 which appears on  heavy utilization storage utilization.  Only way to fix this was to put host into maintenance mode with full data migration, destroy and recreate the disk group.

Also ALWAYS check the VMware HCL to make sure your firmware is compatible. I can never say this enough since it is super important.

This particular VSAN 6.0 was running with outdated firmware for both backplane and PERC H730. Also found that controller was set to RAID for disks in stead of non-raid (passthrough or HBA mode).


Links:

VMware as a kick@ass KB on best practices for Dell PERC H730 for VSAN implementation. Link  provide below.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109665

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144614

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144936


https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2141386