Live Blog

Monitoring CPU Steal using Monit

Content Error or Suggest an Edit

Notice a grammatical error or technical inaccuracy? Let us know; we will give you credit!

Why is monitoring CPU Steal important?

What is CPU Steal? CPU steal time refers to the proportion of time that a virtual CPU on a cloud server is forced to wait for a physical CPU to become available for processing. This metric is significant in understanding the performance of virtual environments.

There are two instances where CPU Steal will occur, when there is an unbalanced overcommit ratio and when a noisy neighbours instance occurs.

What is an overcommit ratio?

The “overcommit ratio” is a concept that refers to the practice of allocating more virtual resources than the actual physical resources available. This concept is based on the understanding that not all virtual machines (VMs) or processes will use their maximum allocated resources simultaneously. The overcommit ratio is expressed as a ratio or percentage. For example, a 2:1 CPU overcommit ratio means that two virtual CPU cores are allocated for every physical CPU core.

What is noisy neighbours?

The term “noisy neighbors” refers to a situation where multiple users or applications share the same physical resources (like CPU, memory, storage, or network bandwidth) and one or more of these users or applications consumes a disproportionate amount of resources. This excessive usage can negatively impact the performance of others sharing the same infrastructure.

Monitoring CPU Steal with Monit

You can monitor CPU Steal with Monit using a simple Monit check, below I will show you two examples of how to setup the check. One will be for a default Monit installation and one will be for GridPane.

Choosing a CPU Steal Percentage

The default 0.1% will catch all CPU Steal, which is fine if there is never any CPU Steal. However, that is only sometimes the case, even with a 1:1 overcommit ratio, as noisy neighbours will always be an issue. I’ve set the CPU Steal percentage to 0.1% in the above code to ensure that you’ve read the following and updated it or put forth the effort to find the proper CPU Steal percentage for your VPS instances.

So it’s suggested to monitor this at 0.1% and increase it as needed. I’ve seen some articles state 1-10% is fine, anything over will cause issues.

CPU Steal on Default Monit Installation

If you have a standalone server that isn’t managed by a control panel or your control panel doesn’t use Monit

1 – Install Monit

To be able to use the “cpu steal” function, it requires monit 5.27 or greater.

2 – Create Monit Check File (/etc/monit/conf.d/monit-cpusteal)

Create a new file called /etc/monit/conf.d/monit-cpusteal

check system $HOST-steal
    if cpu (steal) > 0.1% for 1 cycles
        then alert
        AND repeat every 10 cycles

The above will alert based on a CPU Steal of 0.1%, as per the blurb above you will want to decide on a percentage.

3 – Monit Alerting

You will need to ensure that you have setup alerting for Monit, which is outside of this guide. You can read more about Monit alerting at https://mmonit.com/monit/documentation/monit.html#ALERT-MESSAGES

CPU Steal Monitoring Setup on GridPane

If you’re using GridPane and running an older version of Ubuntu then use https://gridpane.com/kb/how-to-upgrade-monit-version/ and replace monit 5.26 with version 5.27 or greater.

1. Create Monit Check (/etc/monit/conf.d/monit-cpusteal) with Custom GridPane Slack Notification Alert

You will need to use the following code if you’re setting up this check on a GridPane server. Create a new file called Create a new file called /etc/monit/conf.d/monit-cpusteal and add the following to the file.

check system $HOST-steal
    if cpu (steal) > 0.1% for 1 cycles
        then exec "/root/bin/gp-slack-custom.sh 'WARNING: CPU Steal'"
        AND repeat every 10 cycles

You’ll notice the difference between the Generic versus GridPane monit check, there is “then” line. The Generic check will use the Monit built-in alert command, but the GridPane check will exec a script called /root/bin/gp-slack-custom.sh

2 – Monit Alerting with Custom GridPane Slack Notification Script (/root/bin/gp-slack-custom.sh)

I’ve created a GridPane wrapper that utilizes the GridPane slack functions to send a custom slack message. This has not been tested on Ubuntu 22 yet, but assume it works until this article is updated.

First you will need to make sure that the folder /root/bin exists, if not then you will need to create it. You will also have to add the executive bit to the file so that monit can execute the file. The following commands will create the directory, create a blank gp-slack-custom.sh file and then give it execute permissions.

mkdir /root/bin
touch /root/bin/gp-slack-custom.sh
chmod u+x /root/bin/gp-slack-custom.sh

Now open up the /root/bin/gp-slack-custom.sh file with your favorite editor and add the following code.

#!/bin/bash
#
# Usage: gp-slack-custom.sh <slack message title>
#

# - source GridPane
source "/usr/local/bin/lib/gridpane.sh" # - Contains lots of GridPane functions
gridpane::setglobals # - Needed for gridpane::notify:slack

# - Slack message
slack_type="error" # can be warning or error and maybe success?
title="$1" # slack title
details="Server Name: ${host}{{newline}}Server IP: ${serverIP}{{newline}} $MONIT_EVENT - $MONIT_DESCRIPTION" # full details for slack message
event_type="sys_load_avg" # Used because the API won't accept anything but specific event_types
slack_details="${details//{{newline\}\}/ \\n}" # GP Did this, for some reason

# - send slack
preemptive_support="false"
      gridpane::notify::slack \
        "${slack_type}" \
        "${title}" \
        "${slack_details}" \
        "${event_type}" \
        "${preemptive_support}"

3 – Testing gp-slack-custom.sh

You can test the gp-slack-custom.sh command by typing the following.

/root/bin/gp-slack-custom.sh 'WARNING: CPU Steal'

Code Updates and Support

At this time this code is private and being released as goodwill to the WordPress community. Eventually this and other code will be offered for a price that has yet to be determined.

This will come with code updates to address any changes in related services, infrastructure and providers. There will also be basic support and paid implementation services. If this is something that you’re interested in, click below for more information.

Changelog

  • 01-11-2024 – Better overall formatting and content structure.

0 Shares: