August 15, 2014

Andrew Mayo

The dark side of SQL server… When good queries go bad

It was said somewhere in Lord of the Rings that Wizards 'are subtle and quick to anger,’ and substituting ‘SQL Server’ in this phrase proves to be also true.
Over the last couple of years I have had a number of issues with SQL Server which specifically involve self-referencing updates.
A self-referencing update uses the destination table in the query associated with the update.
These problems occasionally cause catastrophic, unpredictable degradation in query performance which has caused CM syncs to suddenly degrade and in some cases never complete.
It is becoming evident that there is definitely a quirk in SQL Server and this problem appears to occur in all versions I’ve tested so far. The problem is fiendish in that it is sudden, catastrophic and is both sensitive to the data AND the environment in which you perform the query. This means you may see it on one machine, move the exact same database elsewhere and the problem vanishes.
In the most recent example, the fairly simple SQL below exhibits the problem. Note that although this query uses a temp table, the problem has been seen even when the queries involve only permanent tables.
UPDATE ApplicationInstMachine SET aim_installdate_utc = t.installdate
FROM ApplicationInstMachine s
JOIN dbo.#tempApplications t
ON t.uapp_guid = s.aim_uapp_guid AND
t.uslm_smsguid = s.aim_machine_guid AND
(t.installdate IS NOT NULL AND s.aim_installdate_utc IS NULL )
This is one of those issues where a picture is worth a thousand words. Here’s what happened when this query was run. 24 hours later we were still wedged in this state!

In the example above the cardinality of the two tables is approximately 13 million rows and no rows will match the join clause (i.e the data was already updated on a previous sync run and is therefore not updated this time).
We can see that by simply modifying the query to say
SELECT COUNT(*)
FROM ApplicationInstMachine s
JOIN dbo.#tempApplications t
ON t.uapp_guid = s.aim_uapp_guid AND
t.uslm_smsguid = s.aim_machine_guid AND
(t.installdate IS NOT NULL AND s.aim_installdate_utc IS NULL )
This returns 0 rows in approximately 9 seconds on the test environment.
The problem is sensitive both to the data and the environment in which it is run. I could take this same database and bring it up on a different guest, possibly even with the exact same core count and memory, and see completely different behaviour.
Normally this query should complete in just a few seconds or maybe a minute or two at most.
If we refactor the query to create a fresh temp table with the candidate rows, like this, the problem goes away
1. Do the previous update query but just get the candidate rows into a fresh temp table (it will be empty in this case, because there are no candidate rows that match the criteria).
SELECT t.uapp_guid,t.uslm_smsguid,t.installdate
INTO dbo.#tempNewInstallDates FROM ApplicationInstMachine s
JOIN dbo.#tempApplications t
ON t.uapp_guid = s.aim_uapp_guid AND
t.uslm_smsguid = s.aim_machine_guid AND
(t.installdate IS NOT NULL AND s.aim_installdate_utc IS NULL )
2. Now update by joining back to the temp table but without the more complex join clause, because the temp table now has only the applicable data. This query completes in just a few seconds.
UPDATE ApplicationInstMachine SET aim_installdate_utc = t.installdate
FROM ApplicationInstMachine s
JOIN dbo.#tempNewInstallDates t
ON t.uapp_guid = s.aim_uapp_guid AND
t.uslm_smsguid = s.aim_machine_guid
In some cases, the problem can be mitigated by ensuring that SQL Server statistics are updated (so that it has an accurate picture of table cardinality). In theory, these statistics are updated automatically, however, in some cases SQL Server seems not to ever get round to doing it. Statistics can be safely (and quickly) updated at any time by running
Sp_updatestats
Doing this on the target database which exhibited the problem caused the issue to vanish, though it is likely to re-emerge if statistics get out of date again. The refactored query is probably not vulnerable to this issue.
Obviously temp tables may have incorrect estimated cardinality at the time they are used – but SQL Server will in this case assume the table has no rows, as far as I know. It’s not clear how this would cause the problem above, though.
I believe that SQL Server chooses an entirely inappropriate query plan which ( I suspect) in some way involves calculating a huge Cartesian product in memory, if:-
(a) The query is a self-referencing update query involving a JOIN to another table or tables
(b) The JOIN clause is non-trivial (though the above example is pretty trivial, so what constitutes ‘trivial’ is still a little unclear)
(c) Statistics are out of date
Under these circumstances, the query may simply consume enormous amounts of CPU and possibly never complete. Of course, ironically, it is therefore impossible to find out what the actual query plan used was!
Normally I recommend that any scheduled maintenance plan for SQL Server should run sp_updatestats, since it will mitigate this issue and is a generally benign and safe thing to do. However, obviously, as we find queries sensitive to this problem, they are progressively refactored to be resilient to the scenario. Hopefully this will turn out to be the last one!

Andrew Mayo

Andy Mayo has been involved in IT, both in software and hardware roles, for enough years to have worked through the tail-end of the punched card and paper tape era, and the subsequent invention of the PC. Currently he’s working in the area of cybersecurity, looking in depth at both attack and defence strategies and the evolution of the threat landscape. Previously Team Lead for the AppClarity project, he’s worked previously in various verticals including healthcare, finance and ERP.
When he’s not wrangling with databases, he enjoys playing piano and hiking, especially when the destination is one of England’s picturesque pubs.

Named a Leader by Gartner®

TeamViewer (formerly 1E) is a Leader in the 2025 Gartner Magic Quadrant™ for DEX Management Tools.

TeamViewer DEX Helps with

TeamViewer DEX for

Core Capabilities

Add-ons and Extensions

TeamViewer DEX Platform

Resource Library

Blogs

Use Cases

DEX Glossary

Other Resources

What is Digital Employee Experience (DEX)?

Our Customers

DEX ROI

Professional Services

Trust, Security, and Compliance

Federal

Customer Resources

Customer Success

Company Overview

Partners

Events and Webinars

Careers

Newsroom

CSR

Contact

About 1E

August 15, 2014

Andrew Mayo

The dark side of SQL server… When good queries go bad

Andrew Mayo

Related Posts

1E News

Empowering Success with 1E Training: Elevate Your Skills and Unlock Business Value

1E News

A New Day for the Digital Workplace: TeamViewer to Acquire 1E

1E News

DEX Connect Summit: Bridging Ideas, Building Connections, and Shaping the Future

SUBSCRIBE NOW

Get 1E digests straight to your inbox, including the latest thought leadership, insights on digital employee experience, endpoint management, and more.

About

Compare

Platform

Solutions

Resources