Monitoring HDD health on Linux (smartmontools)


To totally unlock this section you need to Log-in


Login

S.M.A.R.T., that means "Self-Monitoring, Analysis, and Reporting Technology", is a system in modern hard drives designed to report conditions that may indicate impending failure. smartmontools is a free software package that can monitor S.M.A.R.T. attributes and run hard drive self-tests. Although smartmontools runs on a number of platforms, we will only cover installing and configuring it on Linux.

Attributes, Values, Thresholds and S.M.A.R.T. examples

Attributes describes the measured value of hard drive controller operations.

The values of an attribute are: current, worst, threshold and raw. Values are normalized to a vendor specific scale. Scales could be ranged up to 100, 200 or 253.

Often higher values are better than lower values. The threshold marks the value at which the hard drive could fail. The worst value is the baddest value seen for this drive at this attribute.

The raw value is a vendor coded count that give, after decoding, the normal values like current, worst and threshold.

S.M.A.R.T. Interpretation

First some important knowledge about threshold values. If threshold is 0 the attribute has only information character. If threshold is 253 the attribute is only for testing reason. A typical attribute set could be:

Attribute name: "Read Error Rate"
Current: 253
Worst: 253
Threshold: 63
Raw: 0

All right , this is a nice set. Nothing happens at this attribute. Only if this attribute reach the threshold value 63, we should have to change the hard drive.

Let's look on a attribute with a warning status:

Attribute name: "Read Error Rate"
Current: 113
Worst: 85
Threshold: 63
Raw: 1234567

The hard drive have sector read errors in the past, but work fine for now and (perhaps) work fine in the near feature. However, we would now start to make more often backups and begin to plan a hard drive change.

It is difficult to make correct interpretations in general, because different vendors normalize values in different way. We can recommend to ask in vendors forum for interpretations if you are unsure.

Attribute Hit List

We will give a list of important attributes. We highly recommend to look at these SMART attributes first.

  • Read Error Rate: stores data related to the rate of hardware read errors that occurred when reading data from a disk surface.
  • Reallocated Sector Count: when the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area).
  • Spin Retry Count: stores a total count of the spin start attempts to reach the fully operational speed.
  • End to End Error: after transferring through the cache RAM data buffer the parity data between the host and the hard drive did not match.
  • Command Timeout: the count of aborted operations due to HDD timeout.
  • Reallocation Event Count: count of sector remap operations.
  • Current Pending Sector Count: count of "unstable" sectors (waiting to be remapped, because of read errors).
  • Uncorrectable Sector Count: the total count of uncorrectable errors when reading/writing a sector.
  • Soft Read Error Rate: count of off-track errors.

You can see that most of this attributes counts errors reading or writing sectors on the hard drive surface. If the attribute values rise (reach threshold), make a backup and change the hard drive.

Installation

On Debian or Ubuntu systems:

$ sudo apt-get install smartmontools

On Fedora:

$ sudo yum install smartmontools

Capabilities and Initial Tests

smartmontools comes with two programs: smartctl which is meant for interactive use and smartd which continuously monitors S.M.A.R.T.. Let’s look at smartctl first:

$ sudo smartctl -i /dev/sda

Replace /dev/sda with your hard drive’s device file in this command and all subsequent commands. If there’s only one hard drive in the system, it should be /dev/sda or /dev/hda. If this command fails, you may need to let smartctl know what type of hard drive interface you’re using:

$ sudo smartctl -d TYPE -i /dev/sda

Where TYPE is usually one of ATA, SCSI, or SAT (for serial ata). See the smartctl man page for more information. Note that if you need -d here, you will need to add it to all smartctl commands. This should print information similar to:

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint T133 series
Device Model:     SAMSUNG HD300LJ
Serial Number:    S0D7J1UL303628
Firmware Version: ZT100-12
User Capacity:    300,067,970,560 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Fri Jan  2 03:08:20 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Now that smartctl can access the drive, let’s turn on some features. Run the following command:

$ sudo smartctl -s on -o on -S on /dev/sda

  • -s on: This turns on S.M.A.R.T. support or does nothing if it’s already enabled.
  • -o on: This turns on offline data collection. Offline data collection periodically updates certain S.M.A.R.T. attributes.

Theoretically this could have a performance impact. However, from the smartctl man page:

Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so in practice it has little effect.

  • -S on: This enables “autosave of device vendor-specific Attributes”.

The command should return:

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.

Next, let’s check the overall health:

$ sudo smartctl -H /dev/sda

This command should return:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

If it doesn’t return PASSED, you should immediately backup all your data. Your hard drive is probably failing. Next, let’s make sure that the drive supports self-tests. We have yet to see a drive that doesn’t, but the following command also gives time estimates for each test:

$ sudo smartctl -c /dev/sda

We won’t list the complete output because it’s somewhat lengthy. Make sure “Self-test supported” appears in the “Offline data collection capabilities” section. Also, look for output similar to:

Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 127) minutes.

These are rough estimates of how long the short and long self-test’s will take respectively. Let’s run the short test:

$ sudo smartctl -t short /dev/sda

On our drive, this test should take 2 minutes, but this obviously varies. You can run:

$ sudo smartctl -l selftest /dev/sda

To check the results. Unfortunately, there’s no way to check progress, so just keep running that command until the results show up. A successful run will look like:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     21472         -

Now, do the same for the long self-test:

$ sudo smartctl -t long /dev/sda

The long test can take a significant amount of time. You might want to run it overnight and check for the results in the morning. If either test fails, you should immediately backup all your data and read the last section of this guide.

Configuring smartd

We have now enabled some features and run the basic tests. Instead of repeating the previous section daily, we can setup smartd to do it all automatically.

If your system has an /etc/smartd.conf file, check for a line that begins with DEVICESCAN. If you find one comment it out by adding a ‘#’ to the beginning of the line. DEVICESCAN doesn’t work on our test system and specifying a device file is easy. Add the following line to /etc/smartd.conf:

/dev/sda -a -d sat -o on -S on -s (S/../.././02|L/../../6/03) -m root -M exec /usr/share/smartmontools/smartd-runner

Here’s what each option does:

  • /dev/sda: Replace this with the device file you’ve been using in smartctl commands.
  • -a: This enables some common options. You almost certainly want to use it.
  • -d sat: On our system, smartctl correctly guesses that we have a serial ata drive. smartd on the other hand does not. If you had to add a “-d TYPE” parameter to the smartctl commands, you’ll almost certainly have to do the same here. If you didn’t, try leaving it out initially. You can add it later if smartd fails to start.
  • -o on, -S on: These have the same meaning as the smartctl equivalents.
  • -s (S/../.././02|L/../../6/03): This schedules the short and long self-tests. In this example, the short self-test will run daily at 2:00 A.M. The long test will run on Saturday’s at 3:00 A.M. For more information, see the smartd.conf man page.
  • -m root: If any errors occur, smartd will send email to root. On my system, mail for root is forwarded to my normal email account. If you don’t have a similar setup, replace root with your normal email address. This option also requires a working email setup. Most Linux distributions automatically have working outbound email.
  • -M exec /usr/share/smartmontools/smartd-runner: This last part may be specific to the Debian and Ubuntu smartmontools packages. Check if your system has /usr/share/smartmontools/smartd-runner. If it doesn’t, remove this option. Instead of sending email directly, “-M exec” makes smartd run a different command when errors occur. On Debian, smartd-runner will run each script in /etc/smartmontools/run.d/, one of which emails the user specified by the “-m” option.

If you have more than one hard drive in your system, add a line for each one replacing /dev/sda with a different device file.

If your system has the file /etc/default/smartmontools, uncomment the “#start_smartd=yes” line by removing the “#”.

Finally, restart smartd:

$ sudo /etc/init.d/smartmontools restart

If this command fails, the end of /var/log/daemon.log should have some diagnostic information. If smartd started fine, we should still test that email notifications are working.

Add “-M test” to the end of the configuration line in /etc/smartd.conf. This will make smartd send out a test notification when it’s next started.

Once again, restart smartd:

$ sudo /etc/init.d/smartmontools restart

You should receive an email similar to:

This email was generated by the smartd daemon running on:

  host name: polar
  DNS domain: shadypixel.com
  NIS domain: (none)

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/sda

For details see host's SYSLOG (default: /var/log/syslog).

Afterward, you can delete “-M test”.

What To Do If smartd Detects Problems

First, immediately backup everything. Depending on the error, your drive might be close to death or it may still have a long life ahead.