vsphere

I recently did a complete overhaul of my home ESXi home lab environment. With the new capacity/reliability came the desire to move as much onto it as possible. One of these items was my Home Assistant Hass.io instance which was running on a Raspberry Pi 3 (and originally a Raspberry Pi 1 B). Running it on the Pi has always come with painfully slow reboot and update times. With VM real estate available I see no reason to rely on mini computers to run various workloads around my home. I can re-purpose these boards for other projects. ESXi will also bring the ability to do machine-level snapshots which will be more complete and easier to revert to than the snapshot mechanism within Hass.io.

The main issue I ran into was with the VMDK. The way the VMDK was created it split into multiple files and I couldn’t consolidate or delete snapshots. The VMDK was also getting locked preventing vMotions. To get around this I cloned the VMDK in shell using vmkfstools. I also had to use the proper storage controller, network adapter, and firmware settings. I’ve listed all steps below:

Create a new VM with the following parameters:
- Guest OS – CentOS 7 (64-bit) (The VM will adjust this automatically later)
- 1 vCPU
- 2GB RAM
- 1x E1000 NIC (NIC will not be usable as VMXNET 3)
- Remove any other devices like CD-ROM, hard disk, SCSI controller, etc.

Download the latest stable HassOS VMDK from https://github.com/home-assistant/hassos/releases
Copy this VMDK up to the VM directory in ESXi/vSphere
Open an SSH session to the ESXi host and change directory to the location of the VMDK you just copied (ex. cd /vmfs/volumes/datastore1/HASSOSVM)
Clone the VMDK using the following command: vmkfstools -i “hassos_ova-2.8.vmdk” “hassos_ova-2.8_new.vmdk” (This creates a thick copy of the disk and avoids locking/snapshot issues with the virtual disk)
Delete the original VMDK from the datastore as it is no longer needed
Edit the VM and add an existing hard drive selecting the VMDK you just cloned
Change the controller for the disk from “New SCSI controller” to “IDE 0”
Remove the newly added SCSI controller as it is not needed for the IDE virtual disk

Go to “VM Options” and change the Firmware from “BIOS” to “EFI”

After all this has been completed you just need to power on the VM. Assuming DHCP is configured properly on the network Hass.io is using it will pick up an IP and start configuring. With Hass.io running on an older Xeon-powered host I have never seen the VM get over 50-60 percent CPU utilization and even then I’ve only seen those spikes when running an update. Updates of HassOS and Hass.io take a minute or two when they would sometimes take up to 10-15 minutes when running on a Pi.

I recently ran across an issue during a ESXi cluster/SAN migration where we were down to a handful of VMs that were failing when trying to move them to the new cluster/SAN (using simultaneous compute/storage vMotion operations). I’d like to note that this was on vCenter 6.7 and ESXi 6.5.

The errors were:

Timed out waiting for migration data. The source detected that the destination failed to resume.
Operation timed out.
Timed out waiting for migration data. vMotion migration {#######################} failed to read stream keepalive: connection closed by remote host, possibly due to timeout.

I looked at all the standard issues (storage issues, vMotion connectivity issues, etc.) When looking at the VMs the only thing that made them different compared others in the cluster was the number of virtual disks attached to them. All four of the VMs were SQL Server Availability Group members and had a larger number of disks (5+). When looking into timeouts related to the number of disks I came across this VMware article: Using Storage vMotion to migrate a virtual machine with many disks timeout (1010045). The errors in the article were not the same, but it it aligned with my suspicion about the number of disks. I couldn’t look at the kernel vpxd logs because they had already rolled over, but I decided to give it a shot. I shutdown the problem VMs, set the fsr.maxSwitchoverSeconds configuration parameter to 900 for each one, powered them on, and retried the compute/storage vMotion operations. All vMotion operations completed successfully after this change.

I would like to note that there is a separate configuration parameter called vmotion.maxSwitchoverSeconds which controls the compute side of things. You can try adjusting this as well when having vMotion timeout issues.

Maniacal Methods

A repository of my various technical projects and findings

Tag Archives: vsphere

Deploying Home Assistant Hass.io to ESXi 6.5

Storage vMotion operations timing out