While we do have a formula that we use for designing and building our servers, each server is custom-made for each of our clients' needs. Here are some principles that we've created or subscribed to when designing the hardware and software setup of our servers:
Right now, storage is at an interesting point. It used to be there were hard drives, and, well, other hard drives. But today, we have solid-state drives. These drives are very fast, which is great for running operating systems and other applications, but does nothing for a file server (because the network, not the storage, is the bottleneck). They're also very expensive. So if you have a large quantity of data that's just going to sit there until you access it, that's not a good use of money. At some point in the future, SSDs may reach price parity with HDDs. But for now, we need to find the best mix of performance and economy.
This is why we use a mix of SSDs and HDDs in most of our servers. For instance, our standard file server will include an SSD array for the OS and applications, and an HDD array for file data storage. An application server, on the other hand, which stores a lot less data but needs to access and process it very quickly, might contain only the SSD array.
It's been many years since we've recommended building a server on top of "bare metal" -- That is, loading the server's operating system directly on to the server. And except for niche cases, we virtualize all of our servers. We do this by creating a base installation of Linux, then installing KVM (Kernel Virtual Machine) on top of that. Then from there, we can create the Linux or Windows server(s) for use on the network.
Why virtualize? Virtualization abstracts the hardware from the server, which gives us a couple of advantages:
- A single physical server can run multiple virtual servers, which can save on costs by needing to purchase and maintain less hardware. New servers can also be created dynamically, often times without having to purchase new hardware.
- Because the virtual server is not aware of the underlying hardware, a virtual server can be moved to a different physical server in the event that a hardware issue is going to take an extended period of time to repair, in order to reduce downtime.
Redundancy comes in many forms, and it's all about maintaining the highest levels of uptime possible, but within budget! Some businesses rely very little on real-time access to the network, and could afford to be down for an hour every few months. Others are much more reliant on full uptime where a few minutes of down time per year can be very costly. There are solutions available to fit all of these needs, and the selected solution should fit the need. It's costly to pay for too much guaranteed uptime if it's not needed. But it's also costly to not have it in place if it is needed.
At a very minimum, all of our servers use RAID (Redundant Array of Independent Disks) to store data and run the OS. RAID comes in a variety of levels. Some of the most common ones are:
- RAID 1: A pair of drives that are mirrored real-time. One drive can fail, and the other drive continues to function and contains all of the data. RAID 1 is good for arrays where you're virtually certain you won't need to upgrade the space before the server reaches end-of-life, because they can be time-consuming to upgrade (both drives in the array need to be replaced). This works well for an array where the operating system is stored.
- RAID 5: A minimum of three drives where the data is spread across the drives in real-time. Data is broken into "chunks". One chunk is written to one drive, then another chunk to the second drive, then a calculation occurs that compares chunks 1 and 2, and that result is written to the third drive. If any one drive fails, the two remaining can recalculate the lost values, and no data is lost. The total space in a RAID 5 array is the size of the smallest drive multiplied by one less than the total number of drives. These arrays are simple to upgrade by adding more drives. However, they are dangerous with large drives. When a drive fails, it needs to be replaced. Rebuilding the array onto the new drive can take days. And during this time, your array is vulnerable to data loss if a second drive fails.
- RAID 6: This works like RAID 5, except it requires a minimum of four drives, and can sustain the loss of two drives without any data loss. This makes it much more resilient to drive failures. The total space available on a RAID 6 array is the size of the smallest drive multiplied by two less than the total number of drives.
RAID arrays have been around for a long time and are a solid technology. One of their downsides is upgradability. With RAID 5 and 6, you can upgrade by adding more drives. But you don't benefit by adding larger drives. For instance, let's say you have a RAID 5 array with 3 1TB drives. If you add a 2TB drive, half of that new drive will go to waste. If you want to use larger drives, you have to replace all drives in the array, and this is very time-consuming.
Additionally, it should be noted that RAID IS NOT BACKUP! There are many ways to incur data loss, such as drive failure, software infection, data corruption, theft, and user error. RAID protects you only from drive failure. RAID should not be counted as a backup method, and it should not take the place of proper data backups. RAID is intended as a downtime prevention method. With a single drive, if that drive fails, you are down until the system can be rebuilt and restored from backup. With a RAID setup, a drive failure will not usually have that impact, and the failed drive can be scheduled for replacement outside of normal business hours.
A distributed filesystem is one in where a grouping of drives spanned over mulitple servers appears logically as one large filesystem where data can be stored. Our distributed filesystem of choice is LizardFS, because:
- It is open source, and thus the source code is open to a wide range of scrutiny to make sure it's solid.
- It is highly flexible.
- It is simply to maintain.
- It is widely supported.
LizardFS allows us to distribute data over as many servers as we want, with each server having an uneven number of drives of all different sizes. Literally, it adds all of the drives together, and that's how much space is available. For protection, we can specify a "goal", which is a count of the number of "chunks" we want any given file or directory to have. This creates the redundancy. In a properly functioning LizardFS setup, you can unplug the power cable from one of the servers, and nothing goes down.
For applications where preventing downtime is a primary concern, a distributed filesystem combined with server virtualization is the ultimate.
Windows or Linux
We're big fans of using the right tool for the job, whether that's Windows or Linux. Someone looking at us from the outside might say that we're slanted towards Linux. But it's not our fault that Linux just so happens to usually be the right tool for the job. Linux has many advantages over Windows:
- Linux is open source, which means bugs are found and fixed far more quickly. And this is frequently audited by third-parties. Microsoft, being closed-source, cannot have its source code verified in the same way.
- It's infinitely expandable without having to cope with ever-changing licensing schemes.
- It has a huge amount of flexibility, because it's written as many tiny stackable components that can be organized to fit the situation rather than as a monolithic "take it or leave it" approach.
- It is far more secure.
- It's simpler to use in the creation of a high-availability (low-downtime) structure.
There are some cases where Windows is required. For instance, some applications are Windows-only, and require a Windows server in order to run properly. In this case, we can use virtualization to create a Windows server specifically for that purpose, while relegating all non-Windows-only tasks to one or more Linux servers.
All of your critical back-office equipment should be on battery backup, and servers are no exception. The big issue is not powering your server for hours when the power goes out for an extended period of time, but rather preventing your server's operation from being interrupted because the power flickered out for half a second. This is enough time to turn your server off as if someone had pushed the power button, which means you're not just down for that half a second, but for the amount of time it takes to bring the server back up. And shutting down a system uncleanly like this can sometimes mean that starting back up is not a simplistic process.
A proper battery backup solution will:
- Keep your server running during brown-outs (when the power dips or briefly flickers out).
- Will protect your server from spikes, prolonging the life of the hardware.
- Tell your server when the power is out for an extended period of time, and initiate an orderly shutdown, and then bring the server back up normally once the power is restored.
We recommend and implement the 3-2-1 Backup Strategy. This means:
- 3 Copies of your data.
- 2 Different media.
- 1 Offsite.
Using this strategy, it would be almost impossible to suffer a major data loss. Given that 60% of companies shut down within 6 months of a catastrophic data loss, this is important! Here's how we implement this for you.
First, we set up an onsite backup. This is usually an external drive connected to your server. It is scheduled to run a backup every few hours so that it runs several times per day. Second, we configure an offsite backup, which runs overnight daily and copies all of your data over an encrypted connection to an offsite server. Using this setup, your network then satisfies the 3-2-1 Backup Strategy:
- 3 Copies of your data: The live copy, the local backup, and the remote backup.
- 2 Different media: Onsite and offsite.
- 1 Offsite: The daily automated backup.