Thursday, January 17, 2013

linux servers that do not boot up on /dev/sda make me grumpy!

We are working on a set of upgrades to our environment and replacing a set of very stable but now old IBM x3650 servers ( currently running 5.7 ) with a set of new Dell R 710 servers.

New Dell servers on Oracle Linux 6.2 ... using red hat compatible kernel aka:
2.6.32-220.el6.x86_64
The new dell boxes have an internal raid controller ( Perc H700 ? ) and are connected to EMC direct attached storage using emulex HBA's.  All operating system and linux software installed on internal disks ( mirrored ) ... all database stuff going to be on EMC storage.

Our new servers had a very strange set of behaviors when booting up from internal disks.  Most of the time they would boot up and see the first internal raid drive as /dev/sda ( so /boot partition is on /dev/sda1 ) ... but at other times they would see /boot on a different device ( for example /boot on /dev/sdi1 ).

The entries in /etc/fstab for 6.x systems now apparently use UUID entries ... ( for example ):
UUID=e6964e7e-62a9-450c-a66e-a411b40a4ed9 / ext4 defaults 1 1
So when the servers came up on a different boot drive they would run ok ... looking strange ... but we ran into a different problem using ( still trying to use ... don't get me started ) a backup linux imaging product ( Acronis ) that just did not understand at all backing up or restoring a system when it was not running from /dev/sda.

Logically it seemed pretty straight forward.  Force a way somehow so that first internal drive is always on /dev/sda.

We pay Oracle for linux support so open a ticket with them.  We now have a solution but it took a very very long time for oracle linux support to come up with solution.  Might be a by product of working with a junior level person ... might be from a strange new problem.  Tried all sorts of stuff initially with udev rules ... nope none of this worked at all.

Eventually the solution that is now deployed and working involved removing lpfc ( emulex ? HBA support ? ) modules from the initramfs image that is invoked on first boot up.  Of course we run stuff on EMC storage and yes eventually after booting our HBA's are working just fine.

Anyway here is what we had to do to get this working in our 6.2 redhat compatible kernel environment.  It is some low level pretty esoteric linux stuff and well beyond what I wanted to have to deal with ... but it is working nicely.

Step 1: get the latest available dracut rpm's and stick them into directory for updating:
dracut]# ls -ltr | more
total 140
-rw-r--r-- 1 root root 114884 Jan 11 13:29 dracut-004-284.0.1.el6_3.1.noarch.rpm
-rw-r--r-- 1 root root  21524 Jan 11 13:29 dracut-kernel-004-284.0.1.el6_3.1.noarch.rpm

Step2: Update to latest rpm's ... ( not sure why the 100% 50% 100% stuff gone from below )
rpm -Uvh dracut*.rpm | more
warning: dracut-004-284.0.1.el6_3.1.noarch.rpm: Header V3 RSA/SHA256 Signature,Y
Preparing...                ##################################################
dracut                      ##################################################
dracut-kernel               ##################################################

Step 3: Verify installation of new dracut rpms
# rpm -qa | grep dracut
dracut-kernel-004-284.0.1.el6_3.1.noarch
dracut-004-284.0.1.el6_3.1.noarch

Step 4: Now change to the /boot directory and create a new initramfs image file.
Use this command: dracut --omit-drivers lpfc initramfs-$(uname -r)-no-lpfc.img
# dracut --omit-drivers lpfc initramfs-$(uname -r)-no-lpfc.img

Step 5: Check img file created ...
# ls -ltr *.img | more
 -rw-r--r--  1 root root 15875365 Jan 11 13:39 initramfs-2.6.32-220.el6.x86_64-no-lpfc.img

Step 6: Verify that no lpfc moduels are in the new initramfs image file
# zcat *no-lpfc.img | cpio -t | grep lpfc | more
87575 blocks

Agove output is correct ... if you see something like this ... lpfc is still in the img file:

lib/modules/2.6.32-220.el6.x86_64/kernel/drivers/scsi/lpfc
lib/modules/2.6.32-220.el6.x86_64/kernel/drivers/scsi/lpfc/lpfc.ko

Final step ... create an entry in /etc/grub.conf to point to the new initramfs img file.

Copy the current /etc/grub.conf to something else.

Change the default= value to point to new lines at the end of the /etc/grub.conf file.  My change was to change default=1 to default=2.

Add in new lines at the end of grub.conf ... my entries looked like this ( this is just part of my grub.conf file ).

title Oracle Linux Server (2.6.32-220.el6.x86_64)
root (hd0,0)
kernel /vmlinuz-2.6.32-220.el6.x86_64 ro root=UUID=e6964e7e-62a9-450c-a66e-a411b40a4ed9 rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
initrd /initramfs-2.6.32-220.el6.x86_64-no-lpfc.img

***

At this point the change should be complete ... start rebooting and test ... do we always come up on /dev/sda?

For me yes this finally fixed the problem.

My guess is that I will have to revisit all of this when doing next OL linux update.  Probably going to sit out 6.3 and eventually move from 6.2 up to 6.4 ... probably will have to rebuild new initramfs image and of course test.

I hope this saves some other poor geek time ... it sure took us and oracle support a long time to get this working correctly!


No comments:

Post a Comment