Fault Tolerance

Welcome to our Blog! Our aim with these blog posts; is to improve your understanding of technology (in the IT industry) and increase your efficiency in its use.

Our last blog introduced the concept of Disaster Recovery and Fault Tolerance (DR/FT), one of the most important parts of what we do as your one stop shop for IT management. You can read that article to get some basic definitions of Disaster Recovery and Fault Tolerance.

Fault tolerance would be our first step in guiding you (company or individual) to developing a good DR/FT plan. Our rationale is simple, when your IT is down; your business is more than likely down…not good.  With this post, we want to give you some insight into how you can achieve Fault Tolerant computing. Remember, fault tolerance is reducing unscheduled downtime (maintaining Uptime).

Maintaining Uptime means keeping your computer running and working well. That requires looking at the sources of failure for computers.  This can be taken to the “Nth” degree, so let’s keep it simple for our introduction to FT. Let’s look at a few sources for system failure (from a Hardware perspective) and examine some action steps you can take to reduce downtime.

In our experience power supplies are usually the first thing we check when a system fails. A power supply does exactly what the name indicates; it supplies power to the different components of the computer (converts ac power to dc).  The power supply can fail for many reasons, but the most likely culprit is dust and dirt. The fan that cools the power source can get gunked up and not provide enough air to keep it cool, hence the failure. Easy fix for that…clean out your system (including your power supply) with a can of condensed air. You can also make sure your computer is in an environment that is as dust/dirt free as possible. Taking these simple steps will extend the life of your power supply and lessen the chance for failure.

Another source for unscheduled downtime is a failed processor fan (sometimes referred to as a CPU fan). The Proc fan (that’s what the Techs call it) works in tandem with a heatsink to transfer heat from the central processor-the computers brain. If the CPU becomes too hot your computer will shut down to protect itself. These fans are also assaulted by dust and dirt which gum up the bearing in the fan. As the fan becomes less efficient or fails, heat increases and BAM! You have a bit of unscheduled downtime…boo!! You definitely want to avoid this and you can, with the same precautions given to reduce power supply failure.  Carefully clean out the CPU fan with compressed air and place computer in a clean, dust/dirt free location. If you are in a very dusty, dirty environment – stay on top of the cleaning!

The last most common source of failure (Hardware) that needs to be addressed for FT computing is the hard drive. There are a number of things that can cause a hard drive to quit working. Did you know it has moving parts? It does so a shock (physical blow) can cause problems, as can electronic issues, moisture/water, etc. Some of these challenges you can address, such as placing the computer on a UPS (uninterpretable power supply -not an internal computer component) and a Surge Suppresser/Voltage Regulator. This will help with some of the electrical issues that can arise.
Some other hard drive failures can’t be avoided and there are a few ways to deal with these.

One method of FT is to have your computer designed with a RAID. RAID stands for (Redundant Array of Independent Disks) a disk subsystem that increases performance or provides fault tolerance or both. RAID uses two or more regular hard drives and a RAID controller, which is plugged into motherboards that do not have RAID circuits. Today, most motherboards have built-in RAID but not necessarily every RAID configuration (see below). What happens is that you have a system of two or more hard drives set up to duplicate data/functionality across the different hard drives. There are a number of configurations and your IT professional will be able to give you a “best fit” for your situation.
That way you can have one hard drive fail and still stay “Up” on your computer. No downtime…yeah!!

These are some basic considerations when it comes to fault tolerance.  Much of what needs to be done for fault tolerant computing can vary with each situation. Google, for example, cannot have down time. So their fault tolerance runs into the millions of dollars to operate. A small business more than likely will not need that level of fault tolerance. Therefore it behooves you to connect with a high quality IT professional to discuss your FT needs, someone like…CSI Onsite! They don’t call us Techs you can trust for nothing, give us a call we will help you out.

This entry was posted in Tech Tips. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Comment

  1. Posted November 6, 2011 at 9:42 AM | Permalink

    Hi, this is a great post! Thanks..

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>