Unrecovered read errors on Linux RAID10 device

Posted on

Unrecovered read errors on Linux RAID10 device – Managing your servers can streamline the performance of your team by allowing them to complete complex tasks faster. Plus, it can enable them to detect problems early on before they get out of hand and compromise your business. As a result, the risk of experiencing operational setbacks is drastically lower.

But the only way to make the most of your server management is to perform it correctly. And to help you do so, this article will share nine tips on improving your server management and fix some problem about linux, ubuntu, hp, device, .

I have an HP DL380p Gen8 running Ubuntu 14.04 and apparently it’s been having some trouble with its RAID10 filesystem for almost a month, despite everything seeming to be okay otherwise. I’m seeing a lot of these messages in dmesg/syslog/etc. though the hex values in the Read lines do vary a bit.

Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb]  
Nov 18 08:09:25 server03 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb]  
Nov 18 08:09:25 server03 kernel: Sense Key : Medium Error [current] 
Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb]  
Nov 18 08:09:25 server03 kernel: Add. Sense: Unrecovered read error
Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb] CDB: 
Nov 18 08:09:25 server03 kernel: Read(16): 88 00 00 00 00 03 f8 48 f5 38 00 00 00 80 00 00

The iLO and hpssacli both report all disks are fine and the filesystem is not read-only. The /dev/sdb device is a RAID10 using the server’s RAID controller, consisting of 20 x 900 GB disks.

This is a production server and while I’ve rebooted it once to try to clear this up, I’m reluctant to try an fsck without trying to determine what these messages mean when there are no other apparent issues.

So, any thoughts on what might be wrong here?

Solution :

Okay, I’ll answer with the normal troubleshooting techniques, but here’s my disclaimer:

  • I really don’t advocate running Ubuntu on bare metal hardware; especially HP ProLiant systems.
  • The support ecosystem is just not there for Ubuntu when it comes to HP systems, drivers, monitoring and value-add software.
  • The HP firmware packages are not built for Ubuntu, so god knows what firmware revisions you’re running on.
  • Ubuntu tends to introduce some quirky bugs that I never see with more commercial Linux distributions.

Please provide the following in your question or a separate pastebin.

  • I’d like the output of hpssacli ctrl all show config.
  • I’d like the output of hpssacli ctrl all show config detail.
  • Please give the output of df -h and fdisk -l.
  • Please post the output of lsscsi.

Since you’re on Ubuntu, you probably don’t have the HP Management Agents installed. While hpssacli can provide a spot check of the array health, the hp-snmp-agents package is what provides actual continuous monitoring.

If you do have some of the HP Health Agents installed, please run hplog -v to extract the IML log.

My guess is that you’re running an HP ProLiant DL380p Gen8 25-bay SFF server. Unpatched, many of those units suffered from Smart Array controller and controller cache failures. There are also some critical expander backplane updates that need to be run on that platform.

I ended up fixing this by unmounting and recreating the filesystem and I’ve not seen any error messages since re-enabling the database application on the server, even after it recreated nearly 4 TB of data from other cluster nodes. (I’m wondering if the past disk replacements in this server somehow contributed to the filesystem getting corrupted.)

Leave a Reply

Your email address will not be published.