SQL SERVER CPU Starvation: When 40% CPU Behaves Like 100%

SQL SERVER. A deep analysis on CPU Starvation

Why a server at 40% CPU can still behave like a completely saturated system 💣

Deep dive into SOS Scheduler fairness, Hekaton garbage collection, hash index scans, and why SQL Server CU5 matters far more than most DBAs think.

SQL Server CPU Starvation

Before we dive into today's topic, if you missed my previous post you can take a look at SQL SERVER CU5: Fixes CPU starvation issues

👉 If you found this deep-dive helpful, feel free to check out the ads—your support helps me keep creating high-quality SQL Server content for the community.

🧠 TL;DR

✔️ SQL Server uses cooperative scheduling through SOS Scheduler and workers must voluntarily yield CPU

✔️ SQL Server CU5 improves scheduler fairness during In-Memory OLTP hash index garbage collection scans 🚀

✔️ CPU starvation can occur even when total CPU usage looks moderate 💣

✔️ Bad hash index bucket sizing and large version chains can dramatically increase GC traversal costs ⚡

✔️ Persistent runnable_tasks_count growth is often more dangerous than raw CPU percentage 🔍

🚀 The Hook

In this post, I’ll show you why CPU starvation is one of the most misunderstood performance problems in SQL Server.

⚡ More importantly, you will understand why SQL Server CU5 quietly fixes a potentially catastrophic scheduler fairness problem inside In-Memory OLTP garbage collection scans.

💣 CPU starvation is one of the few SQL Server problems where average CPU usage can completely hide catastrophic scheduler unfairness.

⚡ A server at 40% CPU can still behave like a server at 100% CPU when scheduler fairness collapses.

Hi SQL SERVER Guys and Gals,

Many DBAs monitor CPU percentage.

Very few monitor scheduler fairness.

And that is exactly where some of the nastiest SQL Server production problems hide.

The CU5 fix around In-Memory OLTP garbage collection is important because it touches the heart of SQL Server execution internals:

  • ⚡ SOS Scheduler
  • ⚡ Cooperative scheduling
  • ⚡ Worker yielding
  • ⚡ Runnable queues
  • ⚡ Scheduler monopolization

🧠 What Cooperative Scheduling Really Means

SQL Server does not rely on Windows preemptive scheduling alone.

Internally, SQL Server uses cooperative scheduling through SOS Scheduler.

Workers are expected to voluntarily yield execution periodically.

If an internal task runs too long without yielding, runnable queues grow and other workers experience starvation.

⚡ Key concepts every serious DBA should understand:

  • SOS Scheduler → SQL Server internal scheduler abstraction
  • Worker → execution entity assigned to tasks
  • Quantum → execution time slice
  • Yield → voluntary CPU release
  • Runnable Queue → waiting workers ready to execute
  • Scheduler Fairness → balanced CPU access across workers

💣 The problem starts when some workers fail to yield frequently enough.

💣 Why Hash Indexes Can Make Things Worse

In-Memory OLTP hash indexes are extremely fast for point lookups.

But internally they rely on buckets and hash chains.

Under bad conditions, these chains become expensive to traverse.

  • ⚡ Incorrect bucket_count
  • ⚡ Heavy delete workloads
  • ⚡ Skewed key distribution
  • ⚡ High collision rates
  • ⚡ Massive row version churn

Large hash chains increase garbage collection traversal costs.

Under high churn workloads, GC workers may spend excessive time scanning version chains and hash buckets.

⚡ That is where scheduler monopolization begins.

🧠 Garbage Collection Internals

This is where things become very interesting under the hood.

In-Memory OLTP uses:

  • ⚡ Row versioning
  • ⚡ Timestamp visibility
  • ⚡ Version chains
  • ⚡ Epoch-based garbage collection
  • ⚡ Memory residency

In-Memory OLTP does not immediately remove deleted rows.

Older row versions remain visible for active transactions until epoch advancement allows garbage collection cleanup.

💣 This means garbage collection is not optional.

It is a core architectural requirement of Hekaton.

And under heavy workloads, garbage collection can become extremely CPU-intensive.

🔍 DIAGNOSIS

Start by checking scheduler pressure.


-- 🔍 Scheduler Pressure Diagnostic Query

SELECT 
    scheduler_id,
    cpu_id,
    current_tasks_count,
    runnable_tasks_count,
    active_workers_count,
    pending_disk_io_count,
    load_factor
FROM sys.dm_os_schedulers
WHERE status = 'VISIBLE ONLINE';

⚡ Persistent runnable_tasks_count greater than zero is suspicious.

💣 Especially on systems using Hekaton workloads.

🧪 QUERY

Now inspect SOS_SCHEDULER_YIELD waits.


-- 🔍 Top Scheduler Waits

SELECT 
    wait_type,
    waiting_tasks_count,
    wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type = 'SOS_SCHEDULER_YIELD';

⚡ Important:

  • Healthy yielding is NORMAL
  • SOS_SCHEDULER_YIELD alone is NOT automatically bad
  • Runnable queue growth is the real danger signal

💣 The issue appears when workers fail to yield fairly while runnable queues continue growing.

🧪 Hash Index Statistics (THIS DMV IS GOLD)


-- 🧪 Hash Index Statistics

SELECT 
    OBJECT_NAME(object_id) AS table_name,
    name,
    total_bucket_count,
    empty_bucket_count,
    avg_chain_length,
    max_chain_length
FROM sys.dm_db_xtp_hash_index_stats;

⚡ This DMV can reveal:

  • Bad bucket sizing
  • Excessive collisions
  • Long traversal chains
  • Potential GC traversal amplification

🧠 CPU Saturation vs CPU Starvation

CPU Saturation CPU Starvation
CPU genuinely full Scheduler fairness collapse
High query CPU cost Worker monopolization
Parallelism pressure Yield imbalance
System busy System unfair

💣 Real Symptoms DBAs Actually See

  • ⚡ CPU not at 100% but system extremely slow
  • ⚡ Random latency spikes
  • ⚡ Login delays
  • ⚡ AG synchronization lag
  • ⚡ THREADPOOL waits
  • ⚡ Query timeouts
  • ⚡ Runnable queue spikes
  • ⚡ Intermittent stalls

💣 Many DBAs misdiagnose these as generic CPU issues.

In reality, the problem may be scheduler fairness collapse.

⚡ NUMA and Scheduler Imbalance

Starvation often does NOT affect all schedulers equally.

This becomes especially visible on:

  • NUMA systems
  • Soft-NUMA environments
  • Large core-count servers

⚡ Some schedulers may become “hot” while others remain underutilized.

This creates:

  • Uneven worker distribution
  • Localized scheduler pressure
  • Artificial latency spikes

🚀 My REAL Strategy

In real production systems, I NEVER evaluate only raw CPU percentage.

⚡ I correlate:

  • SOS_SCHEDULER_YIELD waits
  • Runnable queue growth
  • THREADPOOL waits
  • Hekaton workload intensity
  • Hash index chain statistics
  • NUMA imbalance
  • Latency spikes

💣 If runnable queues stay elevated for sustained periods, I immediately suspect scheduler unfairness.

🚀 My recommendations:

  • Apply CU5 if you use In-Memory OLTP heavily
  • Continuously monitor runnable_tasks_count
  • Validate hash index bucket sizing
  • Avoid oversized bucket_count configurations
  • Monitor version chain growth
  • Watch for scheduler imbalance across NUMA nodes

📊 TAKEAWAY

SQL Server CU5 is not “just another cumulative update.”

⚡ It directly improves scheduler fairness inside one of the most CPU-sensitive areas of the SQL Server Engine.

For latency-sensitive systems such as:

  • Trading platforms
  • Telemetry ingestion
  • IoT pipelines
  • Real-time APIs
  • Microservices backends

💣 Even small starvation bursts can destroy P99 latency.

And that is exactly why understanding SOS Scheduler internals is no longer optional for serious SQL Server performance troubleshooting.

📚 Official References

📢 Support the Blog: Did you find this deep-dive helpful? The ads you see here are selected to reflect your interests. If a partner's offer catches your eye, give it a look! Your engagement helps me continue publishing premium SQL Server content for the community.

Biondi Luca @2026 - Sharing over 25 years of Gained Knowledge for Passion. Share if you like my posts!

Comments

I Post più popolari

Speaking to Sql Server, sniffing the TDS protocol

SQL Server, find text in a Trigger, Stored Procedures, View and Function. Two ways and what ways is better

SQL Server, execution plan and the lazy spool (clearly explained)