Initial ESXi install success then subsequent installs failed
I just wanted to post an interesting story of setting up an ESXi 4.0 whitebox. The initial install was flawless but when I tried to reinstall ESXi, the server would hang when starting ESXi. I chose components that were listed on the Whitebox HCL.
For the record am using the following
* Intel DP45SG motherboard
* Adaptec 2410SA SATA RAID
I can confirm that the Intel on-board NIC (82567LF) is supported in ESXi 4 as well as the on-board SATA (AHCI mode works as well as compatibility/IDE). Its using the ICH10 chipset.
I setup a RAID5 on 4 x 500GB SATA disks on the Adaptec card. When building this array I opted for the FAST INIT option, thinking that it would slowly rebuild over time. I installed ESXi and all was well until I realised I could not get the array out of FAST INIT. To cut a long story short, there was no way I could get out of this mode. I decided that I would destroy the array and then rebuild it using the proper rebuild mode and then reinstall ESXi.
All was fine as the array started to build itself, allowing me to continue the reinstall. I reinstalled ESXi but after it was installed and rebooted, the server would not get past nfsclient loaded successfully
. I must have reinstalled about 5 times, deleting the array, adding it again, removing any extra drives I had connected to the on board SATA, removing the extra network cards I had installed.
Google turned up only a single page which was describing my exact problem: http://communities.vmware.com/thread/223943#223943
. In this instance, they had a Dell server that was installed with ESXi but had to be reinstalled. The reintall did not work and there are posts of others in that topic with the same problem. Unfortunately no one came up with a solution.
When I figured out how to see the kernel messages (F12 console), exscfg-init -s
was the last command that was executed in the init process before I received repeating lines of SCSILinuxAbortCommands Failed
and aacraid host adapter abort request
For some reason I decided to try building the array using the CLEAN method where the array is not available until it is built, zeroing out all
the disks. After I did this and installed ESXi it worked
Now I am trying to come up with reasons why this worked. My best answer is that ESXi was somehow still detecting data from the old broken arrays and somehow this was breaking the boot. It is also possible that there are some issues with the accraid driver when the array is being built
and how this is presented to ESXi. I dismissed this theory as I tried initilising the array as FAST INIT a couple of times in my reinstall attempts but I could still not get ESXi to boot like it did initially on FAST INIT.
So my "reccomendation" to anyone that has had this problem is to whipe the entire disk before a reinstall is done. Use something like DBAN to zero out the disk before you reinstall if you are not using a RAID utility that allows you to zero the disks out.
Maybe this is only relevant to this particular RAID controller, maybe all adaptec controllers or maybe it is a problem in general for non-HCL ESXi devices.
Ill be interested to know if anyone else has had a problem along these lines.