Table of Contents
Content Error or Suggest an Edit
Notice a grammatical error or technical inaccuracy? Let us know; we will give you credit!
Draft Warning
You’ve reached a draft 🤷♂️ and unfortunately, it’s a work in progress.
Introduction
Bare-metal has become increasingly more available and affordable as an option to host single or multiple WordPress sites, similar to a VPS instance. There are many pros and cons to using bare-metal.
The pros are dedicated hardware that isn’t shared or virtualized and doesn’t suffer from over-committing resources and CPU steal, the choice of processor and other hardware components. The cons are a slower time to deploy, hardware issues that may result in extended downtime or time-consuming troubleshooting, less automation, and no snapshot mechanism.
This article aims to lay it all on the table and covers the most concerning topics about using bare-metal with WordPress.
Terminology Bare-metal vs Dedicated Servers
The terminology bare-metal is essentially the same as dedicated servers; you’re renting hardware. However, LeaseWeb does have a good explanation for the differences.
Bare Metal Cloud has flexible billing options, allowing you to pay per-hour, monthly, or reserve resources in advance. Dedicated servers, however, require signing monthly or yearly contracts. Although both represent single-tenant environments, Bare Metal Cloud provides easier access to the latest generation hardware.
https://www.leaseweb.com/dedicated-servers
LeaseWeb says Bare Metal Cloud, so again, we get into this tussle with terminology. Bare-metal Cloud is a product offering from LeaseWeb, that is also available from other service providers. Instead of month-to-month, you can rent bare-metal on-demand, which will most likely cost more than month-to-month.
I can only theorize that bare-metal started becoming popular due to the cloud boom and being able to pay per hour, monthly on-demand. I would love to be corrected here, so feel free to leave comments.
1. Bare-Metal vs Virtualization (VPS and Cloud) Performance
Bare-metal and virtualization (VPS and cloud) platforms differ in performance. While virtualization is commonly used to power most VPS and cloud platforms, bare-metal offers superior performance. Virtualization allows on-demand instances to be scaled up or down to meet customers’ needs. In contrast, bare-metal servers are powerful physical servers that lack the virtualization layer and provide direct access to the underlying hardware. There is no sharing of resources with bare-metal, and you’re guaranteed 100% of the resources available. Although virtualization can be configured the same when offered by VPS and Cloud providers, this is not the case unless you pay for dedicated resources.
Bare-Metal Performance
One of the benefits of bare-metal is performance compared to virtualization, which is used to power most VPS and Cloud platforms. These platforms use virtualization to provide their customers with on-demand instances that can be scaled up and down. A node is a powerful bare-metal virtualized server with the appropriate hardware and sometimes software redundancies to operate unaffected if a component fails.
A virtual instance or guest runs on the node, and the node has a hypervisor that works to provide resources and coordinate the virtualization aspect of the node.
A hypervisor, also known as a virtual machine monitor or VMM, is software that creates and runs virtual machines (VMs). A hypervisor allows one host computer to support multiple guest VMs by virtually sharing its resources, such as memory and processing.
https://www.vmware.com/topics/glossary/content/hypervisor.html
Virtualization Performance
Virtualized servers rely on the hypervisor layer to allocate resources and manage virtual machines, which can introduce overhead and impact performance. One of the main reasons for adopting virtualization was to eliminate the need for multiple bare-metal servers running workloads that only peak 10% of the time. It can run multiple workloads efficiently by virtualizing a single bare-metal server.
Virtualization enables overcommitting of resources, providing additional resources to virtual guest instances than what is available on the physical node. An ideal overcommit ratio would ensure that all virtual guest instances can efficiently utilize the physical node’s resources. With some guest instances using the CPU infrequently, a 1:1 ratio would result in times when the physical node’s processor remains unused. Increasing the overcommit ratio allows more guest instances to be created on the physical node, and the unused processor can be fully utilized. However, during periods of peak demand, resource constraints can occur, resulting in reduced performance and CPU steal.
OpenStack allows you to overcommit CPU and RAM on compute nodes. This allows you to increase the number of instances running on your cloud at the cost of reducing the performance of the instances.
The default CPU allocation ratio of 16:1 means that the scheduler allocates up to 16 virtual cores per physical core. For example, if a physical node has 12 cores, the scheduler sees 192 available virtual cores. With typical flavor definitions of 4 virtual cores per instance, this ratio would provide 48 instances on a physical node.
https://docs.openstack.org/arch-design/design-compute/design-compute-overcommit.html
This quote is from OpenStack; it’s not specific to OpenStack and is available in multiple virtualization stacks.
VPS and Cloud Providers Overcommit Ratio Abuse
Although an excellent feature, VPS and Cloud providers sometimes misuse the overcommit ratio. VPS and Cloud providers will increase their overcommit ratio to stretch how many guest instances their bare-metal nodes can host, effectively making more money per node. This leads to resource contention issues if every guest instance on a bare-metal node asks for all the CPU resources they’ve been allotted. When this situation occurs, every guest is now waiting for their requests to be serviced. This might not be an issue for some guest instance applications, but others might see direct slowdowns within their application. WordPress suffers from this fate due to many reasons that are outside of this articles scope.
2. Bare-Metal Hardware
With bare metal, you rent hardware, and the service provider supports the hardware. If a piece of hardware malfunctions, they will investigate and replace the affected hardware. This isn’t proactive; typically, the onus of monitoring and reporting is on the customer renting the hardware unless you purchase additional support and monitoring through the same service provider or third party.
Bare-metal versus Virtulizaiton (VPS and Cloud) Hardware
To ensure high availability and resilience, some bare-metal servers have redundant hardware and software components that allow them to continue operating seamlessly even if a component fails. On the other hand, virtualized servers rely on the hypervisor layer to allocate resources and manage virtual machines, which can introduce overhead and impact performance. As a result, bare-metal servers are often favoured for workloads that require high performance and low latency, while virtualization is preferred for its flexibility and ease of management.
Choosing a Processor
In one of my responses, I mentioned looking out for Ryzen processors if they were available because the price per performance was great. I then reference this AirTable I made.
https://wpmoarspeed.com/bare-metal-cpu-performance-by-provider-airtable/
It allows you to compare commonly offered bare-metal processors using metrics from cpubenchmarks.net for single and multiple-core tests from online submissions. WordPress is not multi-threaded, so single-core metrics are essential.
Processor clock speeds and the processor’s age must also be considered when choosing a processor.
A CPU’s clock speed represents how many cycles per second it can execute. Clock speed is also referred to as clock rate, PC frequency and CPU frequency. This is measured in gigahertz, which refers to billions of pulses per second and is abbreviated as GHz.
https://www.tomshardware.com/news/clock-speed-definition,37657.html
A processor’s overall performance isn’t defined by its clock speed. A processor from 10 years ago can have the same clock speed as a modern processor today but not perform at the same level. Why?
As processors advance, they execute more efficiently during a clock cycle. They are effectively allowing more work to be done per clock. Here are two articles laying out all the improvements over the years to improve processors’ efficiencies.
https://www.howtogeek.com/177790/why-you-cant-use-cpu-clock-speed-to-compare-computer-performance/
Hence, I created the AirTable and why tests are important to identify how well a processor performs. Especially for single and multi-core operations.
Choosing Memory
ECC Memory
ECC memory. It’s not a hard requirement, but it’s something that I try and consider when choosing a bare-metal server from a service provider. Not all processors support ECC memory; for instance, Intel’s desktop CPUs have had ECC memory disabled as a default. However, they’re reversing this decision on their 12th gen Alder Lake CPU’s https://www.tomshardware.com/news/intel-enables-ecc-on-12th-gen-core-cpus
You might even find Intel Workstation processors that support ECC memory https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&0_StatusCodeText1=3,4&0_ECCMemory=True
but the service provider doesn’t provide ECC memory by default or as an add-on. I still consider workstation processors as commodity processors simply because they may or may not support ECC memory.
Ryzen 7/9 CPU’s all support ECC memory, so you don’t have to worry about these processors.
DDR5 Memory
- This video explains DDR5 and on-die ECC https://www.youtube.com/watch?v=q61vJJ48rkI
Choosing Hard Drives
Hard drives once dominated as the most common reason for server failures; with SSD and NVMe becoming more accessible and affordable, this has decreased. Since SSDs and NVMEs are not mechanical, there is likely for a mechanical failure to occur. There is still the possibility of an SSD or NVMe drive failing, or potentially enter it’s end-of-life operation based on usage versus failure.
Commodity versus Server/Enterprise Grade Hardware
The offering of commodity hardware for bare-metal is growing, primarily due to online gaming. There was a need for a higher frequency processor for specific games where you could run a server for said game. This is where Intel’s Core i7 was sought after due to its high processor frequencies, this was several years ago and now Ryen 7/9 is an option for those needing high-frequency servers for gaming and WordPress.
There’s a general old-guard stance that if you need a server to run an application, it should be on server or enterprise-grade hardware, which I agree with, depending on multiple variables. I see Ryzen 7/9 and Intel i7/i9 processors and their platforms as suitable for server and application workloads. As long as the following is true. There’s IPMI access, ECC Memory, a workstation or higher hardware.
IPMI and Out of Band Access
Let’s start with IPMI access; for me, it’s a hard requirement and a deal breaker. Why? It’s the only way you can access your bare-metal server directly and instantly without requiring remote hands. It allows you to re-install the operating system, power cycle, and check on hardware sensors and bios logs. It’s extremely powerful and provides a ton of valuable features.
3. Bare-Metal Downtime
Once a hardware issue is reported, the service provider will investigate and address the hardware problem or potentially replace your server’s hardware entirely. It’s quite possible that your data will not be available as the drives in your bare-metal instance will also be replaced or a new server will be provided, and you’re responsible for restoring your data.
This differs slightly from using a VPS or Cloud since your instance is on a sizeable bare-metal server with more redundancy or mechanisms within the service provider’s infrastructure to be more resilient to hardware failures. You might have never had to deal with a VPS node failure before. The typical node failure means a reboot to replace hardware, a maintenance window to migrate guests or replace hardware, or an email stating that you need to restore your VPS instance from backup or a snapshot if you have either.
Hardware failures occur more frequently as a server run time increases, and with bare-metal, sometimes you’re using previously used hardware from another customer of the service provider as bare-metal hardware gets reused. You rarely receive a never before used bare-metal server unless it’s a modern processor, or you happen to get lucky, or you’re paying for a new server in a rent-to-own or lease-to-own fashion.
Downtime will occur, and you must be prepared regardless of the platform. Outages can occur on bare-metal, VPS and Cloud platforms. Downtime is observed, where failures might not be observed due to software or hardware redundancy that allows for the failure to occur without interruption to service. For instance, in a hardware failure, hardware redundancy allows for redundant hardware to replace a failed piece of hardware and may allow for hot-swappable replacement not requiring an outage in service.
Similarly, the software can operate in the same regard, identifying an issue and mitigating the effects by utilizing techniques to ensure a workload continues to operate with minimal or zero loss of service. This might mean that workloads are moved or restarted to provide mitigation.
Hardware Failure Statistics
There’s lots of information and data online about hardware failures in servers. Unfortunately, I couldn’t find the holy grail of data and research. The standard does exist, Backblaze and their statistics on hard drive failures in their data centers is superb. Bare-metal isn’t just about hard drives thought, power supplies, motherboards, memory and other supporting components can also fail. Here are some articles that I consider alright on the topic.
- https://www.synergy-technical.com/blogs/post/why-servers-fail
- https://singlesource-it.com/2021/08/23/server-failure-frequency-vs-age-the-stats-are-in/
- https://www.synergy-technical.com/blogs/post/why-servers-fail
- https://www.datacenterfrontier.com/voices-of-the-industry/article/11429014/the-bathtub-curve-and-data-center-equipment-reliability
The Bathtub Curve
The bathtub curve is new to me; the term is new, not the theory.
When digging into reliability engineering theories, you will quickly find the widely used Bathtub Curve. According to this theory, when a product is new to the market, there are substantial rates of early failures – which commonly result from an error with handling or installation. As the end of product life approaches, the rate increases due to a second and final wave of wear-out failures. Although the Bathtub Curve, pictured below, accurately reflects the failure behavior of many products, we have found it does not universally apply to data center equipment.
https://www.datacenterfrontier.com/voices-of-the-industry/article/11429014/the-bathtub-curve-and-data-center-equipment-reliability
The article that contains this quote and the above picture highlights the bathtub curve. High early failure rates, with a lower useful life failure rate and then a slow climbing failure rate after useful life.
4. Bare-metal Disaster Recovery
Downtime and hardware failures of bare-metal aren’t necessarily higher than VPS or Cloud, as both are powered by bare-metal. There are failures of VPS and Cloud nodes (bare-metal hosts deployed by service providers) to this day, and there are different outcomes. You might need to reboot your instance, your instance might have already been migrated to a new node, your instance was destroyed and restored from backup by the service provider, or your instance was destroyed, and you have to restore from your own backups.
Putting it into perspective, Vultr deploys a bare metal server, configures it and makes it available for customers to create instances on it. They use bare metal with built-in redundancies like dual power supplies, dual network cards, and other such redundancies. They could still have a motherboard failure, memory failure and so on.