Friday, March 6, 2009

Calculation of Max LUN Supported in ESX Server

I found my ESX servers could not recover the 65th LUNs that I tried to present to it and myself did log a support call and still pending the reply from VMWare. Beside that, I found another interesting article with the details below.

Article Copy from VMWare

In Multipathing Configurations the Number of Paths Per LUNIs Inconsistent
The hpsa driver in ESX Server might reduce the number of supportable LUNs below the expected maximum limit of 256 when the controller is used in multipath configurations. In multiplath configurations, if all four paths are configured, the total supportable LUNs is reduced to 64. In certain multipath configurations, because each target path consumes an available LUN slot, the total number of supportable LUNs might be reduced to 60.



Workaround
Reduce the number of LUNs on a server until the product of LUNs and paths is less than 256 (LUNs * Number of paths < 256), and if necessary, reduce the LUN count depending on use of multipath until each LUN has the expected number of paths.
The following example shows a configuration with the maximum supportable LUNs presented to an ESX Server installation on four paths, providing all LUNs with the expected number of usable paths:
Path 1: 63 LUNs seen through this path; Total LUN count (63 + 1 path) is less than 256
Path 2: 63 LUNs seen through this path; Total LUN count (63 + 63 + 2 paths) is less than 256
Path 3: 63 LUNs seen through this path; Total LUN count (63 + 63 + 63 + 3 paths) is less than 256
Path 4: 63 LUNs seen through this path; Total LUN count (63 + 63 + 63 + 63 + 4 paths) = 256

If I do use the formula above to calculate my environment, yes, I am at the full limit of 256 LUNs. I have 2 ESX servers which only have 2 HBA connection, and had no problem to present more than 67 physical LUNs to it until now. What I had done now is, I removed 2 HBA connection from each of my ESX servers, and run the rescan, and I found that the LUN is presented as I expected. Again, I am not confirmed with the solution yet and will do another round confirmation with the VMware engineer on this.

Manual commit snapshots delta file to vmdk flat file

I had a tough time this week to deal with the snapshot issue with one of the VM. The VM is containing an important snapshot that previously taken for system restoration. When I browsed through the snapshot manager from vCenter, the system show my VM was running without any snapshots. Here was the kicked start of my problem and excited journey until I managed to recover it this morning.

I tried to SSH to the ESX host and browse to the specified datastore, and I found the snapshot file which end with file extension .vmsn were available in the correct location. No matter how many times I tried and rebooted my Virtual center, the snapshot were not visible to the snapshot manager still.

I read through some articles and forums which suggested to clone the snapshot by using vmkfstools -i option, but it didn't success in my case here, and I continue my research and here I found a useful blog post from 1 of the blogger Oliver O'Boyle who experienced similar issue previously.

After I read through his article, which explained the chain within the CID and parent CID, it does help me to resolve my issues. I found that the root cause of my VM was due to the snapshot problem & vmdk config file corruption. For snapshot issues, we can recreate a new snapshots and we select to delete all snapshot afterward, it should force the vmdk flat files and delta files to be committed. In 1 of the virtual hard disk, we experience difficulty as the ESX servers will force the virtual HDD to be detached from the VM. The root cause of that was caused by the file missing on the parent file which should be VMxxxx.vmdk.

During this troubleshooting, you should ensure that the delta files and flat files are always retained and not overwritten. There are 2 delta files which end with VMxxxxx-000001.vmdk and VMxxxxx-000001-delta.vmdk. Your flat file should end with VMxxxxx-flat.vmdk. The 1st thing I did, was to ensure the virtual disk was able to re-attached the vm. I had manually created a new vmdk config file follow the guide from the Oliver O'Boyle, and I copy the parent CID and virtual disk value number require. I had manually configured the link within .vmdk and the flat file. After that, I was able to attach the virtual disk back to the VM from virtual center. Please take note that the virtual center will not see the flat files as the attachable virtual disk, as vCenter recognize the virtual disk base on the location of .vmdk. Recommended to keep the .vmdk and flat file within same datastore. You can also relocate the vmdk files to different datastore if you wish to do so.

Once the virtual disk had been attached to the VM, boot up the VM immediately. Please log in to the system and ensure everything is in normal and functioning correctly. The data I contained now, wasn't the latest data I needed as the result of the missing snapshot which was not committed by the system. Now, I take a new snapshot for my entire VM. Once I had done that, datastore in SSH showed up with plenty of delta files and newly created VMDK files which end with VMxxxxxxx-000003.vmdk and so on.

Here are the steps been taken to commit the snapshots manually

  1. Power off the VM
  2. Right click the VM and select edit settings from vCenter and select the virtual disk that you are trying to recover. The system will show which vmdk files this virtual disk is pointing to
  3. Copy down the file names and go back to your SSH screen
  4. Replace the VMDK and delta files that you previous retain from your original snapshots which you are recovering with the FILE NAMES that you copy on step 2
  5. Open up the snapshot manager for the VM, and select delete all snapshots option. This process will take time as it depend the size of your delta files require to be committed.
  6. It should stuck at 95 % or time out, but the system will still continue to commit the delta files back to the flat files. In my case, it took more than 2 hours to delete the snapshot
  7. I noticed the ESX server load and disks activity increased from the performance chart
  8. Once it completed, all the delta files will be deleted and everything should be back to normal
  9. Power on the VM and double check all the data and mount point and I found the system was back to normal
 
Site Meter