Ok. So what is this %RDY and what good is it to me?
%RDY or CPU Ready is a critical virtual machine performance statistic that is usually overlooked and often misunderstood. So what is CPU Ready?
The official VMware definition calls it:
The % of time a “world” is ready to run and awaiting the CPU Scheduler for approval.
A ‘world’ in VMware speak is an entity that the VMkernel CPU Scheduler can schedule for processing on a physical akin to a process or thread in an Operating system. So CPU Ready or %RDY is the time a ‘world’ waits in a queue to be scheduled for execution.
A thing to note here is that %RDY is the sum of all vCPU %RDY values for that virtual machine. For example, the max possible %RDY value for a virtual machine with 1 vCPU will be 100%, however in the case of a 2 vCPU virtual machine, the max possible %RDY is 200%.
In short, the more CPU Ready you see on you VMware Infrastructure the worse off it is, leading to performance degradation on the virtual guests and bad end user experience.
The generally accepted industry best practice based on VMware’s guidelines is that %RDY values up to 5% (per vCPU of course!) falls within acceptable parameters. At the end of the day, user perception is the best judge of the severity of the effect of high %RDY values.
One of the common misconceptions with CPU Ready is that with with a large amount of available pCPU GHz on ESXi hosts the infrastructure should operate with relatively low levels of CPU Ready. Active usage does not cover how many cores are being used by virtual guests at any point in time, preventing other virtual guests from being scheduled by the VMkernel scheduler. An important fact to note is that a virtual machines CPU Usage and CPU Ready values are not directly related to each other. A virtual guest can very easily have extremely high CPU utilization but low CPU Ready values in an environment with low consolidation ratios or vice-versa.
Another common assumption a lot of people make is that DRS will help alleviate CPU Ready issues. In reality, DRS is no help. DRS only takes account vCPU and memory utilization figures to make decisions to motion virtual machines. It does not take CPU scheduling into account.
So what leads to elevated CPU Ready values you ask.
While it is quite straight forward to determine the cause of high CPU utilization finding the root cause of high CPU Ready values can be a little bit more tricky. Some of the contributing factors are:
- CPU Over-subscription is by far one of the most common causes that leads to high %RDY. Oversubscribing the number of pCPUs on the host with too many vCPUs can make it more difficult for the VMkernel scheduler to queue up processes for execution without effecting performance.
There is no prescribed number that determines what constitutes too much oversubscription. This is purely a “it depends” answer. Some rough and ready guidelines suggest that in most cases consolidation ratios of 1:1 – 1:3 (pCPU to vCPU) do not cause any major issues. Consolidating at a ratio of any more than 1:3 may begin to cause performance degradation.
- CPU Limits: Putting limits on virtual machine CPU goes against VMware best practice and should be used sparingly and only under special circumstances. Placing limits on CPU allocations on a virtual machine will cause an increase in %RDY. esxtop exposes a metric %MLMTD that describes the percentage of time the VM is ready to run but is not scheduled as it would violate the CPU limits imposed.
%MLMTD is added to %RDY time, and can lead to increased %RDY values being reported
- CPU Affinity: implementing CPU Affinity takes control away from the VMKernel scheduler in determining where processes should be executed across all available pCPUs. In the even that multiple virtual machines are locked to a pCPU using CPU Affinity a situation similar to significant pCPU over-subscription is created, leading to increased %RDY times.
- Fault Tolerance (FT): In some extreme scenarios if the FT network between the primary and protected virtual machines cannot keep up with the volume of changes, the primary virtual machine is throttled back. This can be seen through increased %MLMTD values and consequently increased %RDY values throughesxtop.
- Oversized VMs: Deploying oversized VMs that span multiple NUMA nodes can lead to instances where a vCPU may need to retrieve memory blocks stored on another NUMA node. This will cause a CPU WAIT to be raised, while the memory block is retrieved. This has an effect of increasing latency and degrading performance on the virtual machine in question. However this has a flow on effect where other VMs on the same physical host will experience elevated %RDY values due to increased number of CPU WAIT requests being raised.
As far as possible, try and size the virtual machine such that it fits within its NUMA node, both for vCPUs and Memory allocation.
How do I fix high CPU Ready issues?
There are a few things that can be looked at to reduce CPU Ready values
- Pro-actively “right size” your virtual guests and ESXi hosts: I cannot begin to stress the importance of right sizing the virtual machines running on your infrastructure. Yes you can scale up to 32 vCPUs and 2 TB Memory with vSphere 5.1 but bigger does not always mean better! Start small and be proactive with monitoring the vCPU and vMemory utilization on both the virtual guests and the underlying physical infrastructure. Increase (or decrease) resources allocated to the guests based on past trends. While sizing the virtual guests, try and stay within NUMA boundaries to reduce latency and CPU WAIT times.
- Pro-actively manage over-subscription: Monitor the current pCPU to vCPU over-subscription values. Decide what over-subscription the organization is comfortable with and look to deploy new hosts or clusters on a “as needed” basis. Develop strategies to protect mission critical systems by segregating workloads on to clusters with lower subscription ratios.
- Avoid using CPU Limits: not only does this go against VMware best practice it also ends up becoming a nightmare to manage. Use resource pools instead.
- Avoid using CPU Affinity: let the VMKernel scheduler decide what’s best. That’s what it is built to do. Leverage the flexibility that the vSphere DRS offers in managing workload
- Understand what vSphere DRS can and cannot do. Monitor vSphere DRS initiated vMotions to ensure DRS does not cause CPU Ready issues.
- Use VMware best practice storage and network design principles so VMs do not spend too much time in the CPU WAIT state due to to storage/network latency issues.