Category Archives: VMware

Citrix NetScaler – Fatal trap 9: general protection fault while in kernel mode

The other day one of our NetScaler appliances was unable to boot up after a power down. It was getting stuck during the FreeBSD bootup phase (before the NetScaler software actually loads) with the error:

Fatal trap 9: general protection fault while in kernel mode

The only information I could find on this specific issue was here: https://support.citrix.com/article/CTX238252, but this was not relevant to us. I could not find anything else online talking about receiving this error on a NetScaler appliance. Restoring to previous snapshots of the appliance didn’t resolve the issue. After some digging I found that this VM was set to the highest VM compatibility level. At some point someone had set the comparability level of the VM to be upgraded to version 15, but this didn’t take effect until the VM was actually powered down (it had been rebooted many times since without issues).

To remediate this issue I did the following:

  • Removed the VM from inventory
  • Manually edited the vmx file ‘virtualHW.version‘ line to say virtualHW.version = “4”. I chose a lower version, so that I could use the GUI to upgrade the version later. This can be done using WinSCP or something similar to download/edit the file
  • Added VM back to inventory
  • Upgraded VM compatibility to version 7 in vCenter to let the system actually run through the VMX and check settings

After doing all of the above I was able to successfully boot up the NetScaler appliance. The main takeaway here is that the ‘fatal trap’ error was directly related to the VM compatibility setting in ESXi in this particular case.

Deploying Home Assistant Hass.io to ESXi 6.5

I recently did a complete overhaul of my home ESXi home lab environment. With the new capacity/reliability came the desire to move as much onto it as possible. One of these items was my Home Assistant Hass.io instance which was running on a Raspberry Pi 3 (and originally a Raspberry Pi 1 B). Running it on the Pi has always come with painfully slow reboot and update times. With VM real estate available I see no reason to rely on mini computers to run various workloads around my home. I can re-purpose these boards for other projects. ESXi will also bring the ability to do machine-level snapshots which will be more complete and easier to revert to than the snapshot mechanism within Hass.io.

The main issue I ran into was with the VMDK. The way the VMDK was created it split into multiple files and I couldn’t consolidate or delete snapshots. The VMDK was also getting locked preventing vMotions. To get around this I cloned the VMDK in shell using vmkfstools. I also had to use the proper storage controller, network adapter, and firmware settings. I’ve listed all steps below:

  • Create a new VM with the following parameters:
    • Guest OS – CentOS 7 (64-bit) (The VM will adjust this automatically later)
    • 1 vCPU
    • 2GB RAM
    • 1x E1000 NIC (NIC will not be usable as VMXNET 3)
    • Remove any other devices like CD-ROM, hard disk, SCSI controller, etc.
  • Download the latest stable HassOS VMDK from https://github.com/home-assistant/hassos/releases
  • Copy this VMDK up to the VM directory in ESXi/vSphere
  • Open an SSH session to the ESXi host and change directory to the location of the VMDK you just copied (ex. cd /vmfs/volumes/datastore1/HASSOSVM)
  • Clone the VMDK using the following command: vmkfstools -i “hassos_ova-2.8.vmdk” “hassos_ova-2.8_new.vmdk” (This creates a thick copy of the disk and avoids locking/snapshot issues with the virtual disk)
  • Delete the original VMDK from the datastore as it is no longer needed
  • Edit the VM and add an existing hard drive selecting the VMDK you just cloned
  • Change the controller for the disk from “New SCSI controller” to “IDE 0”
  • Remove the newly added SCSI controller as it is not needed for the IDE virtual disk
  • Go to “VM Options” and change the Firmware from “BIOS” to “EFI”
image

After all this has been completed you just need to power on the VM. Assuming DHCP is configured properly on the network Hass.io is using it will pick up an IP and start configuring. With Hass.io running on an older Xeon-powered host I have never seen the VM get over 50-60 percent CPU utilization and even then I’ve only seen those spikes when running an update. Updates of HassOS and Hass.io take a minute or two when they would sometimes take up to 10-15 minutes when running on a Pi.

A NIC device is tied to a disallowed network

I recently received a call from a former colleague where they were unable to update a machine catalog. They stated nothing had changed in vCenter, Citrix, or the master image. The error they were receiving was:

Error Id: XDDS:919D761E

Exception: Citrix.Console.Models.Exceptions.ProvisioningTaskException Create Catalog failed with an unknown reason, see terminating error for more details. at Citrix.Console.PowerShellSdk.ProvisioningSchemeService.BackgroundTasks.ProvisioningSchemeTask.CheckForTerminatingError(SdkProvisioningSchemeAction sdkProvisioningSchemeAction) at Citrix.Console.PowerShellSdk.ProvisioningSchemeService.BackgroundTasks.ProvisioningSchemeTask.WaitForProvisioningSchemeActionCompletion(Guid taskId, Action`1 actionResultsObtained) at Citrix.Console.PowerShellSdk.ProvisioningSchemeService.BackgroundTasks.ProvisioningSchemeCreationTask.StartProvisioningAction() at Citrix.Console.PowerShellSdk.ProvisioningSchemeService.BackgroundTasks.ProvisioningSchemeCreationTask.RunTask() at Citrix.Console.PowerShellSdk.BackgroundTaskService.BackgroundTask.Task.Run()


DesktopStudio_ErrorId : ProvisioningTaskError
ErrorCategory : NotSpecified
ErrorID : NetworkNotPermitted
TaskErrorInformation : Terminated
InternalErrorMessage : A NIC device is tied to a disallowed network.
DesktopStudio_PowerShellHistory : Create Machine Catalog 'XenApp - WSRV12 - DAPPS - DR'
11/25/2018 7:09:42 AM

The key error here is ‘A NIC device is tied to a disallowed network’. If you do a quick search you will find an article referencing this error: CTX139460. This points to a change in the vCenter networking config, but supposedly there weren’t any changes. Time to do some digging. I asked them to get networking info from both vCenter and CItrix using PowerShell.

To get the hypervisor networking I asked him to log in to one of the delivery controllers, launch PowerShell as administrator, and run the following:

Add-PSSnapin Citrix*
dir XDHyp:\HostingUnits | Select PSPath,HostingUnit*,*Network* | Format-List

The output of this was:

PSPath : Citrix.Host.Admin.V1\Citrix.Hypervisor::XDHyp:\HostingUnits\DR_Cluster-vm_dr
HostingUnitName : DR_Cluster-vm_dr
HostingUnitUid : bddd641a-a55c-4f0e-bd62-9331502fd908
NetworkId : Network:network-641
NetworkPath : XDHyp:\Connections\PRDVCENTER01\DR.datacenter\DR Cluster.cluster\VM Network 201.network
...

The thing to take note of is the ‘NetworkId‘ for the DR hosting connection. This Id is the vCenter MoRef (Managed Object Reference) for the VM network. I then had him pull the VM networks from vCenter using PowerCLI.

To get the VM networks (with MoRef) from vCenter I asked him to launch VMware PowerCLI as administrator and run the following:

Connect-VIServer prdvcenter01.domain.com
Get-View -ViewType Network | Select Name,MoRef

The output of this was:

Name                                                        MoRef
---- -----
VM Network 201 Network-network-4790

...

The MoRef was network-641 in MCS, but network-4790 in vCenter even though the VM network names were the same. From this it was clear there was a networking change performed on the vCenter side at some point. After stating this, it was revealed that port groups were deleted and recreated (which generated new MoRef ids) in this DR cluster. At this point we have to reconfigure the hosting connection networking with the new MoRef and this cannot be done in Citrix Studio. To do this we have to reconfigure the ‘NetworkPath‘ in the hosting connection, but use the same ‘NetworkPath‘ since the network name did not change. Running this will force the network MoRef to be queried and updated in the MCS hosting connection.

To reset (or change if needed) the ‘NetworkPath‘ in the hosting connection you take the ‘PSPath‘ from the first command and copy everything starting with ‘XDHyp‘. I took that path and provided them with this command to run:

Set-Item –Path ‘XDHyp:\HostingUnits\DR_Cluster-vm_dr’ –NetworkPath ‘XDHyp:\Connections\PRDVCENTER01\DR.datacenter\DR Cluster.cluster\VM Network 201.network’

Finally, I asked them to re-run the first ‘dir‘ command again to verify the network MoRef updated. After doing this they were able to successfully update the machine catalog.

Storage vMotion operations timing out

I recently ran across an issue during a ESXi cluster/SAN migration where we were down to a handful of VMs that were failing when trying to move them to the new cluster/SAN (using simultaneous compute/storage vMotion operations). I’d like to note that this was on vCenter 6.7 and ESXi 6.5.

The errors were:

  • Timed out waiting for migration data. The source detected that the destination failed to resume.
  • Operation timed out.
  • Timed out waiting for migration data. vMotion migration {#######################} failed to read stream keepalive: connection closed by remote host, possibly due to timeout.

I looked at all the standard issues (storage issues, vMotion connectivity issues, etc.) When looking at the VMs the only thing that made them different compared others in the cluster was the number of virtual disks attached to them. All four of the VMs were SQL Server Availability Group members and had a larger number of disks (5+). When looking into timeouts related to the number of disks I came across this VMware article: Using Storage vMotion to migrate a virtual machine with many disks timeout (1010045). The errors in the article were not the same, but it it aligned with my suspicion about the number of disks. I couldn’t look at the kernel vpxd logs because they had already rolled over, but I decided to give it a shot. I shutdown the problem VMs, set the fsr.maxSwitchoverSeconds configuration parameter to 900 for each one, powered them on, and retried the compute/storage vMotion operations. All vMotion operations completed successfully after this change.

I would like to note that there is a separate configuration parameter called vmotion.maxSwitchoverSeconds which controls the compute side of things. You can try adjusting this as well when having vMotion timeout issues.

Google Voice + Incredible PBX + OBi 110


UPDATE #3 (6/21/2019):
I did some digging and found the core issue (which has been talked about on other forums). In March Google disabled Google Voice access via the ‘https://www.googleapis.com/auth/googletalk’ API. They are only allowing access via a private ObiTalk API now. You basically have to get the OAuth client ID, client secret, and a refresh token for this ObiTalk app. Once you have these three things they can be used in the procedure below. After this you will be up and running again. Thank you again, naf!

UPDATE #2: As of 3/25/2019 my GVSIP trunk is showing a rejected status again. This time it seems the actual server is rejecting authentication. This is the error I am receiving in the Asterisk logs:

[2019-03-25 11:26:42] DEBUG[7278] res_pjsip_outbound_registration.c: Registration re-using transport 0x7ff23001f748
[2019-03-25 11:26:42] WARNING[7278] res_pjsip_outbound_registration.c: Fatal response '403' received from 'sip:obihai.sip.google.com' on registration attempt to 'sip:gv1143131xxxxxxxxxx@obihai.sip.google.com', stopping outbound registration

UPDATE: As of 3/19/2019 my GVSIP trunk was showing a rejected status. It seems that this version of IncrediblePBX was using a static IP for the outbound proxy which just stopped working. After looking into it more I realized the ‘nafkeys-update’ script updates this for you, but I didn’t use that script since it can no longer download the certificates. I have updated the steps (step 5) below to show how to resolve this. This was the error I was receiving in the Asterisk logs:

[2019-03-19 08:32:52] WARNING[3715] pjproject:                        SSL SSL_ERROR_SSL (Handshake): Level: 0 err: <336151568>  len: 0
[2019-03-19 08:32:52] WARNING[3716] res_pjsip_outbound_registration.c: No response received from 'sip:obihai.sip.google.com' on registration attempt to 'sip:gv1143131xxxxxxxxxx@obihai.sip.google.com', retrying in '60'

If you were already running this setup and you are now having this issue run the following commands to replace the static IP with the proper address:

  • sed -i ‘s|64.9.242.172|obihai.telephony.goog|’ /etc/asterisk/pjsip_custom.conf
  • amportal restart

 

I have been using a cordless house phone in conjunction with Google Voice ever since purchasing a home a number of years ago. This was accomplished using an OBi 110 device, but recently stopped working last year because Google updated certificates on their end and this model device did not trust this new certificate chain. OBi stated the box was end of life and they would not offer an update. At that time I setup a FreePBX instance between my OBi 110 and GV, but recently has Google ended XMPP support which nuked my new configuration. OBi now has a relationship with Google and conveniently offers a new model that works with Google Voice’s new SIP protocol (GVSIP). I have already had to replace my OBi once due to a hardware issue and they did not offer a simple update to fix the certificate issue last time. I am not willing to shell out more money to get this working. I searched the internet for options (surly if Google Voice is using SIP now there is a way to connect to it). A contributor to the Asterisk  project (the core of FreePBX/Incredible PBX) named naf has already developed a way to connect Asterisk to GVSIP. The problem is forums are removing references to files and instructions because of Google ToS violations. I spent some time figuring all this out and was able to hook my OBi 110 device up to an Incredible PBX VM. The VM has a GVSIP trunk and everything is working perfectly (hopefully for a long time). The steps are basically installing CentOS, installing Incredible PBX, obtaining a OBi certificate/private key pair (Google is using it to whitelist SIP connections), configuring the PBX trunks/routes, and configuring the OBi to talk to the PBX over SIP. Here are the steps I took to get this running:

  1. Download/install CentOS 6.10 minimal and Incredible PBX 16-15 (http://nerdvittles.com/?p=27089). As I said before I deployed this to my ESXi cluster at home, but this could easily run on a local VM. I’m looking into translating this setup into a Raspberry Pi deployment
  2. Obtain Obi client certificate/private key pair
    • All references to the certificates have been removed. You can actually pull the certificate and private key off of an OBi 110 and use this on Incredible PBX. This requires modified firmware (using a firmware patch). This patch process involves using bsdiff to patch the original firmware file with the aforementioned patch file, flashing the device with the patched firmware file, setting up a syslog server, configuring the syslog server on the OBi, initiating a device backup (which will send the cert/key in binary DER format to the syslog server), and converting cert/key pair to PEM
    • I will save you some time and provide you with the already patched firmware here, but I cannot publicly provide the certs for legal reasons (you can try dropping me a message). OBi 110 devices are available extremely cheap on eBay as they are now defunct. An OBi type device will be necessary for connecting GVSIP to a traditional phone anyway
  3. Copy the certificate pair to Incredible PBX (obihainaf.crt + obihainaf.key) under /etc/asterisk/keys
  4. Create a Google Voice refresh_token for your account. You will use the Incredible PBX OAuth Client ID and OAuth Secret found at /etc/asterisk/pjsip_custom.conf
  5. Obtain an ObiTalk-based OAuth client ID, client secret, refresh token. If you have access to a newer Obi device you can possibly install custom firmware that will allow you to SSH to it and pull these values for reuse. You can technically recreate the ObiTalk OAuth exchange and get a refresh token that way, but that is something that can’t be publicly posted. I will say that the ObiTalk OAuth client ID is: 565436231175-v2ed4f21arki7dt43pks0jlsujatiq2i.apps.googleusercontent.com
  6. Run the following commands to update the install-gvsip script with the ObiTalk OAuth client ID and client secret from step 5 (you can manually update the file if you don’t want to run the ‘sed’ commands):
    • sed -i ‘s|466295438629-prpknsovs0b8gjfcrs0sn04s9hgn8j3d.apps.googleusercontent.com|<CLIENT ID YOU OBTAINED IN STEP 5>|’ /root/gvsip-naf/install-gvsip
    • sed -i ‘s|4ewzJaCx275clcT4i4Hfxqo2|<CLIENT SECRET YOU OBTAINED IN STEP 5>|’ /root/gvsip-naf/install-gvsip
  7. Run the following command to update the install-gvsip script with the proper proxy address (you can manually update the file if you don’t want to run the ‘sed’ commands):
    • sed -i ‘s|64.9.242.172|obihai.telephony.goog|’ /root/gvsip-naf/install-gvsip
  8. Create your GVSIP trunk(s). You will be prompted for the refresh token you retrieved in step 5
  9. Verify trunks registered and working
    • cd /root/gvsip-naf/
    • ./show-trunks
  10. Setup the OBi 110 as an extension in Incredible PBX and set the OBi 110 to connect to Incredible PBX
  11. Configure inbound and outbound routes to the OBi 110 extension
  12. Configure SIP local networks in CIDR format under Settings -> General SIP Settings -> NAT Settings -> Local Networks. I had issues with my SIP services shutting down right after boot up in Incredible PBX and traced it back to this in the logs. I added my home subnet (192.168.0.0/24) and it stopped going down
  13. Test inbound and outbound calls

Note: This could also be deployed to a Raspberry Pi 3. I just didn’t have an unused Pi 3 handy, so I went with a VM. I was running my previous FreePBX instance on an older Pi B+ for over a year.

OVA deployment issues in vCenter 6.5+

A team member was recently tasked with deploying a number of OVA templates provided by a vendor. There was difficulty with the OVA deployment failing after sitting on “Validating” for a long time. This would usually happen after selecting a compute resource in vCenter. The vendor stated they have seen this numerous times with vCenter 6.5 clients. They advised to remove a host from the cluster and deploy directly to that host. Being a person that cannot accept hacky workarounds I decided to dive into it. We are currently on vCenter 6.7 U1 with 6.5 ESXi U3 hosts. I extracted the OVA and started looking into the OVF XML. Everything looked to be formatted correctly, but I still felt vCenter wasn’t liking something in the XML. I began troubleshooting by commenting out entire <ProductSections> elements of the XML. Commenting out the first set of options did not work, but the second did. Looking closer at the second showed a very long ‘ValueMap’ string for the time zone selector in the ovf:qualifiers attribute. The most likely scenario was this this causing the issue with its length and complexity. I decided to clear out the entire ovf:qualifiers attribute (empty quotes) and hard code the value to be ‘America/New_York’. I then saved the OVF, initiated a new deployment (selecting all VMDKs, the OVF, but excluding the MF file as that would cause a checksum error), and hitting next… VOILA! I was able to successfully deploy this OVF without any errors. I also performed the same action for all of the other vendor templates.

Original time zone property:

Modified time zone property:

I didn’t dig further, but I imagine the vendor’s standalone host hack worked because the web GUI on the host has different code (maybe missing a bug) than vCenter. I’d also like to note that this could be accomplished by using the Import-VApp PowerCLI PowerShell cmdlet (without modifying any files), but you’d also have to create a OvfConfiguration hashtable object to pass as a parameter which may be more work than it is worth.