Correlating Data in SQL Trace, Part 1

I’ve been chasing down a series of issues with a system. One of the symptoms we’ve been getting is massive lock timeouts, so I’ve been taking (server side) profiler traces to see what code is requesting the locks that are being timed out. Today I landed on a lovely solution for correlating the lock timeouts I see in the trace file with the associated code. Let’s talk setup, and then we’ll walk through the thinking that led to our solution.

Setup

Take a server side trace, be sure to get “existing connections”, RPC:Starting, RPC:Completed, and Lock Timeout (Timeout > 0) (see the previous blog “Lock Timeout Revelation on Twitter”). For the RPCs and Timeouts, get your basic measures, plus ObjectName and DatabaseID.

Once you’ve captured enough of a trace, save it to a table, preferably on your local box.

Depending on how much time you’re going to spend digging through this data, you may want to index your trace table. Here are the indexes I developed for this query:

CREATE INDEX ix_trace ON trace (eventclass, spid, rownumber, databaseID, objectName) INCLUDE (starttime, endTime)

CREATE INDEX ix_trace2 ON trace (spid, eventclass, rownumber, databaseID, objectName)

Developing

Once you save the trace into a table, you'll need to know the event id (event class names aren't displayed). (*See footnote about trace in tables…)

Trace event 189 is Lock Timeout (timeout > 0). So when you find a 189, it’s easy to go back up the trace rows until you find the closest one with a corresponding SPID, which will have an eventclass of 11: RPC:Started. Rough pseudocode:

1. Find a timeout eventclass = 189

2. Find the closest row with rownumber < timeout.rownumber and spid = timeout.spid

Part 1 is easy:

SELECT TOP 10

[Timeout].RowNumber ,

[Timeout].EventClass ,

[Timeout].DatabaseID ,

[Timeout].spid ,

[Timeout].starttime ,

[Timeout].endtime ,

FROM Trace AS [Timeout]

WHERE [Timeout].EventClass = 189

Note that I’m limiting the query to the TOP 10 for now, just so I don’t have to look through 10,000 rows at once.

So Part 2: for each timeout, find the closest previous RPC Started with the same spid. On first look, I started falling into the cursor trap. You say it to yourself procedurally: “for each timeout” is actually direct pseudocode for a cursor. But don’t go there….just don’t…

What do we know about the timeout and its associated code call?

They have the same spid
They’re on the same database
The statement will have an RPC start (eventclass = 11), but not necessarily an RPC complete (eventclass = 10)
The statement will precede the timeout, and therefore have an earlier row number

So we could say that for a timeout, the corresponding statement is

the MAX rownum less than the timeout rownum
where the spids and databases match,
and eventclass=11.

Well partner, that’s a correlated subquery right there! Pretend for a moment that we only have one lock timeout, SPID = 52, rownumber = 11,304. So to find the code that goes with it:

SELECT TOP 1 rownumber

FROM trace

WHERE spid = 52 –Timeout spid

AND rownumber < 11304 -–Timeout rownumber

AND eventclass = 11

ORDER BY rownumber DESC

That’s our subquery. Let’s plug it into the part 1 query, and correlate the thing!

SELECT TOP 10

[Timeout].RowNumber ,

[Timeout].EventClass ,

[Timeout].DatabaseID ,

[Timeout].spid ,

[Timeout].starttime ,

[Timeout].endtime ,

Statement.*

FROM Trace AS [Timeout]

INNER JOIN Trace AS [Statement] ON Statement.spid = [Timeout].spid

AND Statement.spid = [Timeout].spid

AND Statement.DatabaseID = [Timeout].DatabaseID

AND statement.rownumber = ( SELECT TOP 1 rownumber

FROM trace

WHERE spid = [Timeout].spid

AND rownumber < [Timeout].rownumber

AND eventclass = 11

ORDER BY rownumber DESC

)

WHERE [Timeout].EventClass = 189

And finally, let’s do better with our select than just Statement.*:

SELECT TOP 10

[Timeout].RowNumber ,

[Timeout].EventClass ,

[Timeout].DatabaseID ,

[Timeout].spid ,

[Timeout].starttime ,

[Timeout].endtime ,

'STATEMENT: ' AS spacer,

Statement.rownumber ,

Statement.objectname ,

Statement.textdata,

Statement.loginname ,

Statement.spid ,

Statement.starttime ,

Statement.endtime

FROM Trace AS [Timeout]

INNER JOIN Trace AS [Statement] ON Statement.spid = [Timeout].spid

AND Statement.spid = [Timeout].spid

AND Statement.DatabaseID = [Timeout].DatabaseID

AND statement.rownumber = ( SELECT TOP 1 rownumber

FROM trace

WHERE spid = [Timeout].spid

AND rownumber < [Timeout].rownumber

AND eventclass = 11

ORDER BY rownumber DESC

)

WHERE [Timeout].EventClass = 189

I like the “STATEMENT” spacer; it’s a nice break between your timeout columns, and your statement columns.

I found another great application for this, but we’ve gone a bit long here. So, come back tomorrow for Correlating Data in SQL Trace, Part 2!

-Jen McCown

http://www.MidnightDBA.com

*Footnote: Again, I'll have to come back later with a link, but I also learned today that when you port your trace results into a table, duration is stored in microseconds (1/1,000,000 s), instead of milliseconds (1/1,000). This seems like it would be pretty good info to have. Edit: Linky! Here's the Brian Moran article "Duration and CPU Values" that mentions this.