Xen and HP conflict

We have tried to configure two recently bought HP Proliant DL165 G7 servers for Linux Xen virtualization. We had problems with software raid when running in dom0. Everything started, when we tried to attach two hard disks in softRAID matrix using mdadm (v3.1.4) tool. Our goal was to build a software RAID like this:

Personalities : [raid1]

md0 : active raid1 sda1[4] sdb1[5]

41941944 blocks super 1.2 [2/2] [UU]

unused devices: < none >

When we added a disk in order to get the rebuilding process started, everything seemed to be fine:

md0 : active raid1 sda1[4] sdb1[5]

101594944 blocks [2/1] [_U]

[====>…………….] recovery = 20.2% (20581248/101594944) finish=69.4min speed=19430K/sec

Altought suddenly, in a random moment (sometimes 13%, sometimes even 90% of recovery process) OS logs are filled with strange I/O errors (end_request: I/O error dev sda).

dmesg shows, that disks were properly detected:

[ 0.000000] Xen: 00000000bfeb0000 – 00000000bfec0000 (ACPI data)

[ 0.000000] #0 [0000000000 – 0000001000] BIOS data page ==> [0000000000 – 0000001000]

[ 0.000000] Memory: 1957028k/2097152k available (3149k kernel code, 396k absent, 139728k reserved, 1906k data, 604k init)

[ 0.077837] _OSC request data:1 7

[ 0.374599] Write protecting the kernel read-only data: 4332k

[ 0.700417] libata version 3.00 loaded.

[ 0.716706] pata_atiixp 0000:00:14.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16

[ 0.716742] pata_atiixp 0000:00:14.1: setting latency timer to 64

[ 0.716871] scsi0 : pata_atiixp

[ 0.717019] scsi1 : pata_atiixp

[ 0.718761] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xff00 irq 14

[ 0.718765] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xff08 irq 15

[ 0.829720] ata3: SATA max UDMA/133 irq_stat 0x00400000, PHY RDY changed

[ 0.829724] ata4: SATA max UDMA/133 abar m1024@0xfe9fe400 port 0xfe9fe580 irq 22

[ 0.829728] ata5: SATA max UDMA/133 abar m1024@0xfe9fe400 port 0xfe9fe600 irq 22

[ 0.829731] ata6: DUMMY

[ 0.893396] ata1.01: ATAPI: hp DVD RAM UJ892, 1.23, max UDMA/100

[ 0.909406] ata1.01: configured for UDMA/100

[ 1.317050] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[ 1.317075] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[ 1.323275] ata4.00: ATA-8: SAMSUNG HD204UI, 1AQ10001, max UDMA/133

[ 1.323279] ata4.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA

[ 1.323305] ata5.00: ATA-8: SAMSUNG HD204UI, 1AQ10001, max UDMA/133

[ 1.323308] ata5.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA

[ 1.329588] ata4.00: configured for UDMA/133

[ 1.329613] ata5.00: configured for UDMA/133

[ 1.724056] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[ 1.730238] ata3.00: ATA-8: SAMSUNG HD204UI, 1AQ10001, max UDMA/133

[ 1.730243] ata3.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA

[ 1.736505] ata3.00: configured for UDMA/133

[ 2.858410] EXT4-fs (md0): mounted filesystem with ordered data mode

Below we can see the very first errors, which we were able to catch:

[ 0.000000] ERROR: Unable to locate IOAPIC for GSI 2

[ 0.000000] ERROR: Unable to locate IOAPIC for GSI 9

[ 0.074412] ERROR: Unable to locate IOAPIC for GSI 9

[ 0.078880] ACPI Error (psargs-0359): [ECEN] Namespace lookup failure, AE_NOT_FOUND

[ 0.078888] ACPI Error (psparse-0537): Method parse/execution failed [\] (Node ffffffff816888e0), AE_NOT_FOUND

[ 5.623274] Error: Driver ‘pcspkr’ is already registered, aborting…

[ 1130.801249] ata4: SError: { HostInt }

[ 1130.803209] ata5: SError: { HostInt }

[ 1136.288030] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)

[ 1136.292082] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)

[ 1137.284026] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)

[ 1146.772031] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)

[ 1146.776093] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)

[ 1147.768030] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)

*It’s definitely none of the disks failure. We have already tried 4 Samsung HD204UI disks, one WD 5003ABYX and one Seagate ST3500514NS and error still occurred

*Error occurred ALWAYS, when we were running dom0 of the XEN kernel, as line below

root@our_machine:~# xm info

host : xs9

release : 2.6.32-5-xen-amd64

version : #1 SMP Tue Jun 14 12:46:30 UTC 2011

machine : x86_64

*we had no issues with the same disks and softRAID when we tried to assemble it kernel (without xen)

So what can cause a problem? We think that it may be either chipset or SATA controller (or CPU, mainboard)… We need Help as we are out of the options and using xen is crucial for our infrastructure. Right now we are waiting for responses from both XEN and HP. Have you ever faced similar problem? Share it with us!

Leave a Reply

Your e-mail address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.