vSphere Soft Affinity Rules

While working with a customer to setup a stretched metro vSphere cluster, we found a bug with VMware’s Soft Affinity rules. The customer environment consisted of two data centers, each with three ESXi hosts and 3PAR arrays setup with peer persistence. The idea was to create a DRS group for data center A and one for data center B. The VM’s would be added to the DRS group for data center A, with a should live rule. They would also have a DC and DNS server for the data center B side, so that during a complete data center A outage, the VMs could be powered up on the data center B side, and the AD and DNS services would be running and able to service the VMs that were being powered back on via HA.

 

When we did the actual HA testing, we notice that the soft affinity rules were not being honored and VMs were being powered back up on the wrong side.  This might not be a big issue for many customers, as most environments are setup with DRS set to Fully Automated, which would have migrated the VM after HA powered it one,  but this customer has an application which is extremely sensitive to any latency, and because of that, they keep DRS to either Manual or Partially Automated. After a lot of troubleshooting we opened a support case with VMware’s support and they discovered that this bug exists in vSphere 6.0 and vSphere 6.5. The only work around is for the VM to be added to the DRS rules when the VM is powered off. If the VM is powered on when it is added to the soft affinity rule, the rule is ignored. The support engineer said that this will be fixed in the vSphere 6.7 release, but they had no timeline or any way to promise that this bug would be addressed in 6.0 or 6.5 builds.