Tag Archives: esxi

Storage vMotion operations timing out

I recently ran across an issue during a ESXi cluster/SAN migration where we were down to a handful of VMs that were failing when trying to move them to the new cluster/SAN (using simultaneous compute/storage vMotion operations). I’d like to note that this was on vCenter 6.7 and ESXi 6.5.

The errors were:

  • Timed out waiting for migration data. The source detected that the destination failed to resume.
  • Operation timed out.
  • Timed out waiting for migration data. vMotion migration {#######################} failed to read stream keepalive: connection closed by remote host, possibly due to timeout.

I looked at all the standard issues (storage issues, vMotion connectivity issues, etc.) When looking at the VMs the only thing that made them different compared others in the cluster was the number of virtual disks attached to them. All four of the VMs were SQL Server Availability Group members and had a larger number of disks (5+). When looking into timeouts related to the number of disks I came across this VMware article: Using Storage vMotion to migrate a virtual machine with many disks timeout (1010045). The errors in the article were not the same, but it it aligned with my suspicion about the number of disks. I couldn’t look at the kernel vpxd logs because they had already rolled over, but I decided to give it a shot. I shutdown the problem VMs, set the fsr.maxSwitchoverSeconds configuration parameter to 900 for each one, powered them on, and retried the compute/storage vMotion operations. All vMotion operations completed successfully after this change.

I would like to note that there is a separate configuration parameter called vmotion.maxSwitchoverSeconds which controls the compute side of things. You can try adjusting this as well when having vMotion timeout issues.

OVA deployment issues in vCenter 6.5+

A team member was recently tasked with deploying a number of OVA templates provided by a vendor. There was difficulty with the OVA deployment failing after sitting on “Validating” for a long time. This would usually happen after selecting a compute resource in vCenter. The vendor stated they have seen this numerous times with vCenter 6.5 clients. They advised to remove a host from the cluster and deploy directly to that host. Being a person that cannot accept hacky workarounds I decided to dive into it. We are currently on vCenter 6.7 U1 with 6.5 ESXi U3 hosts. I extracted the OVA and started looking into the OVF XML. Everything looked to be formatted correctly, but I still felt vCenter wasn’t liking something in the XML. I began troubleshooting by commenting out entire <ProductSections> elements of the XML. Commenting out the first set of options did not work, but the second did. Looking closer at the second showed a very long ‘ValueMap’ string for the time zone selector in the ovf:qualifiers attribute. The most likely scenario was this this causing the issue with its length and complexity. I decided to clear out the entire ovf:qualifiers attribute (empty quotes) and hard code the value to be ‘America/New_York’. I then saved the OVF, initiated a new deployment (selecting all VMDKs, the OVF, but excluding the MF file as that would cause a checksum error), and hitting next… VOILA! I was able to successfully deploy this OVF without any errors. I also performed the same action for all of the other vendor templates.

Original time zone property:

Modified time zone property:

I didn’t dig further, but I imagine the vendor’s standalone host hack worked because the web GUI on the host has different code (maybe missing a bug) than vCenter. I’d also like to note that this could be accomplished by using the Import-VApp PowerCLI PowerShell cmdlet (without modifying any files), but you’d also have to create a OvfConfiguration hashtable object to pass as a parameter which may be more work than it is worth in.