Does SQL Server CU4 improve SESSION_CONTEXT performance?

Yes, benchmark results show that CU4 significantly reduces CPU and elapsed time compared to CU3. However, it also introduces changes in wait statistics, indicating different behavior in parallel execution.

Is SESSION_CONTEXT efficient under parallel execution?

SESSION_CONTEXT is efficient in many scenarios, but under parallel execution it may introduce synchronization overhead between worker threads, making it more expensive than expected.

Why are wait stats important when comparing CU versions?

Wait statistics reveal internal engine behavior that CPU and elapsed time alone cannot show. Changes in waits like SOS_WORK_DISPATCHER can indicate differences in scheduling, synchronization, and background task coordination.

How should I benchmark SQL Server CU updates?

A proper benchmark should include multiple iterations, cache cleanup, measurement of CPU and elapsed time, logical reads via DMVs, and delta analysis of wait statistics to ensure reproducible and comparable results.

Does lower CPU always mean better performance in SQL Server?

Not always. Lower CPU can indicate more efficient execution, but it may also come with increased coordination overhead or different wait patterns, so a full analysis should include wait stats and execution behavior.

🚀 SQL Server 2025 CU3 vs CU4: Has Microsoft Fixed the SESSION_CONTEXT Issue? Test Day 1

Before diving in, if you miss my latest post take a look here: SQL SERVER 2025 CU4 Is OUT! 👉 If you found this deep-dive helpful, feel free to check out the ads—your support helps me keep creating high-quality SQL Server content for the community. 17.04.2026

⚡ The Hook

If you’re using SESSION_CONTEXT under parallel workloads, this bug could silently break your logic.

In this post, I’ll prove whether CU4 actually fixes it, using real repro and benchmark tests, not assumptions.

📌 TL;DR

💣 SESSION_CONTEXT was NOT thread-safe under parallelism
🧪 CU3 returns inconsistent or NULL values under load
✔️ CU4 appears to fix correctness issues
🚀 Performance is NOT the gain — correctness is

🧠 Intro

Hi SQL SERVER Guys,

This is the first post exploring and testing what CU4 has fixed or improved.

I'am not particularly fond of it, but some of us rely on SESSION_CONTEXT() to pass session-level metadata.

Under parallel execution, though… things can get ugly and messy.

This is not a cosmetic bug since this is data correctness at risk!

💣 What’s the Problem

As we said in the previous post, under parallel execution, SQL Server (pre-CU4) evaluates SESSION_CONTEXT() across multiple threads without proper synchronization.

💣 NULL values
💣 Inconsistent values
💣 Non-deterministic GROUP BY

👉 Root cause: lack of thread-safe handling in parallel operators.

It's time to test if Microsoft really fixed this issue!

🧪 TEST 1 — Repro on CU3

To perform this test on the SQL SERVER 2025 CU4 I have create a simple table TestTable and i filled it with some data.

This is the code create and fill the table:


CREATE TABLE TestTable (id INT IDENTITY(1,1), codice VARCHAR(80));

INSERT INTO Prova (codice)
VALUES ('aaaaa');
GO 100

Then i executed the following statements:

EXEC sp_set_session_context @key = 'user_id', @value = 42;

SELECT p1.codice, MAX(p2.id), MIN(ctx.ctx_value)
FROM prova p1
JOIN prova p2 ON p1.codice = p2.codice
JOIN prova p3 ON p1.codice = p3.codice
CROSS APPLY (
    SELECT SESSION_CONTEXT(N'user_id') AS ctx_value
) ctx
GROUP BY p1.codice;

💣 Expected Result on CU3

NULL values
Inconsistent ctx_value
Broken aggregations

👉If you take a look to the you can observe that the degree of parallelism has been set to 1 to avoid wrong results.

Now its time to perform the same test the CU4...

🧪 TEST 2 — Same Test on CU4

Same query. No changes.

✔️ Expected Result

Stable value (42)
No inconsistencies
Deterministic output

This the max degree of parallelism is not touched:

Conclusion: Microsoft fixed this Issue!

🧪 TEST 3 — SQL Server Benchmark CU3 vs CU4

The following statements compares CU3 vs CU4 by measuring:

CPU time
Elapsed time
Logical reads (via DMVs)
Wait stats (delta analysis)

🔧 1. Benchmark Setup

SET NOCOUNT ON;

IF OBJECT_ID('tempdb..#results') IS NOT NULL DROP TABLE #results;

CREATE TABLE #results (
    iteration INT,
    elapsed_ms BIGINT,
    cpu_ms BIGINT
);

🧹 2. Cache Cleanup (CRITICAL for fair comparison)

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;

Ensures no cached data or plans affect the results.

📊 3. Initial Wait Stats Snapshot

IF OBJECT_ID('tempdb..#waits_before') IS NOT NULL DROP TABLE #waits_before;

SELECT *
INTO #waits_before
FROM sys.dm_os_wait_stats;

🚀 4. Benchmark Execution Loop

DECLARE @start DATETIME2(7);
DECLARE @end DATETIME2(7);

DECLARE @Iterations INT = 400;
DECLARE @i INT = 0;

WHILE @i < @Iterations
BEGIN
    SET @start = SYSDATETIME();

    -- TEST QUERY
    SELECT 
        p1.codice,
        MAX(p2.id) AS max_id,
        MIN(ctx.ctx_value) AS ctx_val
    FROM prova p1
    JOIN prova p2 ON p1.codice = p2.codice
    JOIN prova p3 ON p1.codice = p3.codice
    CROSS APPLY (
        SELECT SESSION_CONTEXT(N'user_id') AS ctx_value
    ) ctx
    GROUP BY p1.codice;

    SET @end = SYSDATETIME();

    INSERT INTO #results (iteration, elapsed_ms, cpu_ms)
    VALUES (
        @i,
        DATEDIFF(ms, @start, @end),
        DATEDIFF(ms, @start, @end)
    );

    SET @i += 1;
END

📈 5. Logical Reads (Correct DMV-based Measurement)

SELECT 
    SUM(qs.total_logical_reads) AS total_logical_reads
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) t
WHERE t.text LIKE '%SESSION_CONTEXT%'
  AND t.text LIKE '%prova%';

Filters only relevant queries to ensure accurate measurement.

🧠 6. Final Wait Stats Snapshot

IF OBJECT_ID('tempdb..#waits_after') IS NOT NULL DROP TABLE #waits_after;

SELECT *
INTO #waits_after
FROM sys.dm_os_wait_stats;

📉 7. Wait Stats Delta Analysis

IF OBJECT_ID('tempdb..#waits_delta') IS NOT NULL DROP TABLE #waits_delta;

SELECT 
    a.wait_type,
    (a.wait_time_ms - b.wait_time_ms) AS wait_ms,
    (a.signal_wait_time_ms - b.signal_wait_time_ms) AS signal_ms
INTO #waits_delta
FROM #waits_after a
JOIN #waits_before b
    ON a.wait_type = b.wait_type
WHERE (a.wait_time_ms - b.wait_time_ms) > 0;

📊 8. CPU & Elapsed Time Results

SELECT 
    COUNT(*) AS iterations,
    AVG(cpu_ms) AS avg_cpu_ms,
    AVG(elapsed_ms) AS avg_elapsed_ms
FROM #results;

🔥 9. Top Wait Events (CU Comparison Insight)

SELECT TOP 15
    wait_type,
    wait_ms,
    signal_ms
FROM #waits_delta
ORDER BY wait_ms DESC;

📌 10. Final Summary (Blog-Ready Output)

SELECT 
    'CU Benchmark Summary' AS test_name,
    (SELECT AVG(cpu_ms) FROM #results) AS avg_cpu_ms,
    (SELECT AVG(elapsed_ms) FROM #results) AS avg_elapsed_ms,
    (SELECT SUM(qs.total_logical_reads)
     FROM sys.dm_exec_query_stats qs
     CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) t
     WHERE t.text LIKE '%SESSION_CONTEXT%') AS logical_reads,
    (SELECT TOP 1 wait_type FROM #waits_delta ORDER BY wait_ms DESC) AS top_wait_type;

✔ How to use this benchmark:

Run on CU3 and save results
Run on CU4 and save results
Compare CPU, elapsed time, logical reads, and waits

This setup ensures a reproducible and comparable performance analysis between CU versions.

🚀 Benchmark results

Now we will execute the procedure both on CU3 and CU4 and we will evaluate per following metrics:

📊 Metrics

CPU time
Elapsed time
Logical reads
Wait stats

📊 Performance Summary (CU3 vs CU4)

Version	Iterations	Avg CPU (ms)	Avg Elapsed (ms)	Improvement
CU3	400	256	256	-
CU4	400	141	141	-44.9% CPU

Insight: CU4 shows a significant reduction in both CPU and elapsed time (~45% improvement).

⚡ Top Wait Types Comparison

Wait Type	CU3 (ms)	CU4 (ms)	Delta
SOS_WORK_DISPATCHER	2937393	5162015	↑ higher
SLEEP_TASK	191901	198485	≈ stable
QDS_ASYNC_QUEUE	173283	194245	↑ slight increase
LOGMGR_QUEUE	149310	157132	↑ slight increase
CHECKPOINT_QUEUE	73516	127943	↑ significant increase
XE_TIMER_EVENT	77006	80423	≈ stable
REQUEST_FOR_DEADLOCK_SEARCH	77003	80394	≈ stable
LAZYWRITER_SLEEP	75143	78774	≈ stable
DIRTY_PAGE_POLL	74682	78627	≈ stable

Insight: While CU4 improves CPU performance, some background waits (CHECKPOINT_QUEUE, SOS_WORK_DISPATCHER) increased significantly.

🧠 Final Benchmark Summary

Metric	CU3	CU4
Avg CPU (ms)	256	141
Avg Elapsed (ms)	256	141
Logical Reads	NULL	NULL
Top Wait Type	SOS_WORK_DISPATCHER	SOS_WORK_DISPATCHER

Conclusion: CU4 delivers a clear CPU and latency improvement, but also shows changes in scheduling and background task behavior, introducing higher coordination overhead in parallel execution, especially around dispatcher-related waits—suggesting that SESSION_CONTEXT() becomes more expensive under parallel plans due to increased thread synchronization.

🔍 Reality Check

✔️ CU4 fixes correctness
⚠️ Performance likely unchanged
⚠️ Possible slight overhead

Correctness > Performance

“A fast wrong result is still wrong. CU4 restores correctness under parallelism.”

🚀 My REAL Strategy

💣 Avoid SESSION_CONTEXT in parallel queries (pre-CU4)
✔️ Test CU4 under load
⚠️ Critical systems → MAXDOP 1 fallback
✔️ Always validate determinism

🔥 Engagement Question

Are you using SESSION_CONTEXT in parallel workloads?
Or did you already get bitten by this bug?

📚 References

SQL Server: SSMS 22.5.0 Changes How You Work

📢 Support the Blog:

Did you find this deep-dive helpful? The ads you see here are selected to reflect your interests. If a partner's offer catches your eye, give it a look! Your engagement helps me continue publishing premium SQL Server content.

Biondi Luca @2026 - Sharing over 25 years of Gained Knowledge for Passion.

Search This Blog

SQL Server Performance & Troubleshooting – Where Milliseconds Matter 🚀

SQL Server 2025 CU3 vs CU4: Has Microsoft Really Fixed the SESSION_CONTEXT Issue? Test Day 1