PASS Day 3: Keynote Live Blog

I’m live blogging the PASS keynote…it’s 10:01 local time.

Me: Tuesday is #SQLSat shirt day, Wednesday is #sqlkilt day, Thursday is Show Up day. #sqlpass

Here’s the PASS website description of the day 2 keynote…this is the one everyone’s been talking about, the DeWitt keynote. Remember to follow along with the livestream: http://www.sqlpass.org/summit/na2010/livekeynotes.aspx!

Rick Heigis:

Heh, we just applauded the internet.

PASS board of directors is a volunteer board. They keep saying “best ___ EVER”, and I always hear “Best ___ EVAR!!!!” Board Directors are now Douglas McDowell, Andy Warren, and Allen Kinsel.

RT @wendy_dance: Well deserved kudos for outgoing board member @lyndarab #sqlpass This #passwit is an inspiration!

New event in 2011: The PASS SQLRally, with precons in BI, DBA, De, Prof, and Dev tracks (including Grant Fritchey [http://scarydba.wordpress.com] , Kevin Kline, Patrick LeBlanc, Devin Knight, and more). #SQLRally is now open for you to register. Precon $299, conference $199, or both together for $449 before Dec 1. See www.SQLRally.com

SQLChicken: Preview Plus package: Summit + 2 precon seminars = $1295, good until Dec. 31st #sqlpass

Wow, next year Summit is October 11-14, with two precon days instaed of one pre and one postcon. Earlybird reg is under $1k. Cool, maybe we’ll be able to hit SQL Connections in November for the first time ever…maybe…

@AndyLeonard: #sqlpass DVDs available for non-attendees for $395
@SQLFool: Seriously, folks, you should get the dvd. Just think of all the awesome sessions you missed. #sqlpass

Send questions for today’s keynote using ASKDrDeWitt@SQLPASS.org#sqlpass Kids at home, play along!

Dr David DeWitt: Technical Fellow, Data and Storage Platform Division

(Note: Get David DeWitt’s slides on Query Optimization here: http://on.fb.me/94GszE)

Me: David DeWitt is running out of things to talk about. #sqlpass
Me: Hah, chart: The Impress Index descends to zero from day 1 to day 3 of #sqlpass

We’re getting art by the SQL Server query optimizer. Every color in the pic represents different parameters in the TPC-H benchmark. Today we’re talking about SQL query optimization (another power lecture). Fundamentals, leaving with an understanding of why optimizers sometimes produce bad plans. Query optimization is harder than rocket science.

100,000 ft view has “magic happens”‘ as the key component.

KimberlyLTripp: Awesome slide #sqlpass SQL goes in -> magic happens -> awesome query plan comes out! LOL
MarkGStacey: “Query Optimisation is not rocket science. When you flunk out of Query Optimisation, we make you do rocket science” #SQLPASS

Query 8 of the TPC-H benchmark, there are 22 million alternative ways of executing this query. Cost-based query optimization was invented by Pat Selinger in the 70s. This remains the hardest part of building a DBMS. QO transfers queries into an efficient execution plan: parse [into logical operator tree], QO [physical operator], execution. Physical operators = how they do it, e.g. nested loop join, etc.

Example… shows that the best plan depends largely on how many rows will meet the query criteria (in this case, WHERE id = 287).

Equivalence rules: Select and join operators commute with each other – doesn’t matter which comes first in select or which is the lefthand join table. Other equiv rules: associate, distribute, cascade. Examples given of equivalent logical plans.

This is really hard to blog and track, I’m essentially taking notes on a university lecture now. I’m not complaining, but the blg isn’t going to be terribly helpful if you’re not listening at home.

@KimberlyLTripp: #sqlpass statistics are the key to plan choice and the basis of the tipping point (is a query “selective enough”)
@erinstellato: Dr. DeWitt “The query optimizer estimates the cost based on the statistics it has.”
@Kendra_Little: DrDewitt’s Lab’s Facebook page (where slides will be posted) is here: http://on.fb.me/dkwbvU #sqlpass
@unclebiguns: #sqlpass I feel like @SQLChicken. Retweeting everything

“I could’ve gotten much more out of this, but I got bored doing the PowerPoint.”

@merv; optimizer’s equivalence rules among operators. develops logically equivalent alternatives
SQLRockStar; next year we should have Dr. DeWitt speak BEFORE #sqlkaraoke night #sqlpass

Selectivity Estimation: Task of estimating how many rows will satisfy a predicate such as Movies.MID=932 Plan quality is highly dependent on the quality of the stimates that the QO makes.

Me: Notice that the keynote room is nearly silent: This guy is keeping everyone’s attention. Incredible speaker – easygoing, engaging. #sqlpass
@joewebb: Really smart people can make complex topics understandable. Dr. DeWitt is a pro at this! #sqlpass
Next slide, “Histogram Motivation”…sounds like a great name fora Big Bang Theory episode. You listening, @BillPrady? #sqlpass
@AndyLeonard #sqlpass Microsoft presenters: Thank you for your presentations. Please remove the marketing and present like Dr. Dewitt.

If customer 9 has submitted 55 of 939 reviews, the percentage of reviews for the predicate customer ID=9 is about 5%. We can’t store that level of details in statistics, so we need to be able to estimate how many rows satisfy that predicate. The answer is histograms. All buckets cover roughly the same key range (one bucket is high, about 160…another bucket [several customers] is lower, around 90). Most of those buckets are pretty close in % to the individual customers they represent, about 5% of the total review each. But there is one bucket (histogram) that’s way higher than all the others, so it’s inaccurate. That was an equal width histogram, 4 customers per bucket.

Here is an equi-height histogram – divide ranges so that all buckets contain roughly the same nunmber of values (equi-height). If the optimizer uses equi-height histograms, we do much better, but there’s still an error. Unless the data’s exactly uniformly distributed, we’re going to get these errors in estimating selectivity…

Summary: histograms are critical tool for estimating selectivity factors for selection predicates, but errors still occur.

@Datachix2: (Shatner voice) Can’t… Absorb… Fast… Enough! #sqlpass
tjaybelt: DeWitt + Itzik in the same day? in the same 4 hour period? you are an animal. my brain hurts thinking about it. #sqlpass

Two key costs that the optinmizer considers – io time, cpu time. Actual values are highly dependent on CPU and io subsystems. And in some cases (missed what he said specifically), network traffic.

Given two plans, which is cheaper? We’ll assume for now that the optimizer gets it exactly right, that the query will be satisfied by 100 rows.

Tim_Mitchell: Trying to live blog this material is like trying to document a human DNA chain on a post-it note. #sqlpass
Me: Blog entry for this time slot: *database science happens here* #sqlpass
KimberlyLTripp: @sqlfool @erinstellato I’m glad that none of us were given the “math is hard” barbie! LOL #sqlpass

But what if the estimate is wrong? The non-clustered index plan is better at fewer rows, and the sequential scan is better for 10,000 rows.

…

Estimating join costs – three basic join menthods: nested loops join, sort-merge join, hash-join. Very different performance characteristics. “This it the place where the QO very frequently makes a mistake.”

Sort-merge join algorithm: sort R and M on the join column (MID), then scan them to do a “merge” (on join columN), and output result tuples. WHAT DID I JUST TYPE??? Hah, this is the slide that was online as a preview to this keynote… Reviews (R) and Movies (R) sorted first, then joined together.

KimberlyLTripp:Merge best when leveraging suitably sorted sets (i.e. indexes w/same leading keys) there are good ways to strat for this. #sqlpass

Nested-loops join – scan table R and for each tuple iin R proble tuples in M (by scanning it) output result tuples. So there’s no sort involved here, just a scan and joine. IO cost = scanning R (R IOs), plus R*M (because we’re scanning the M table R times). Cool.

Index-nested loops: Scan R, and for each tuple in R probe tuples in M (by probving its index.) …

Me: I notice no one’s giving Dr DeWitt any trouble for his using SELECT *. #respect #sqlpass
GlennAlanBerry: He did not schema qualify his table either
kekline: Dr David DeWitt =WANTED= to talk about #MapReduce here at #sqlpass. Read more of his thoughts here: http://bit.ly/dqu0Z9
@sqlpass: Standing room only and dead quiet in the #sqlpass keynote room as we all pay attention #lovingdaviddewitt
AndyLeonard: Q: How do you make a bunch database geeks all giggly? A: Information. Delivered professionally. Communicated passionately. #sqlpass
peschkaJ: Want to know how the SQL Server Cost Based Optimizer works? http://is.gd/gWsbI #sqlpass
Toshana: I hope he tells us about the day when he fell from the toilet hit his head and thought up the fluxcapactor #sqlpass

Estimating result cardinalities: Consider SELECT * FROM Reviews WHERE

…

@wendy_dance: Last year I came away from #sqlpass with the fire to build my network. This year, I’m inspired to get out and TEACH. Thanks

Two common query shapes: star join queries (all join to a central table), chain join queries (each table joins to the next in a line).

Martinpoon: we use only left deep query plans to reduce the search space, & we ignore bushy plans #SQLPASS
Me: RT @way0utwest: let MS know we love Dr. Dewitt /want to see more of him instead of product demos. #sqlpass < Cut testimonials, not demos!

Bottom-up QO using dynamic programming: Opt is performed in N passes (if N relatinos are joined): …Sigh, I have no way to summarize this. This approach is still exponential in the number of tables.

SQLChicken: LOLDewitt: Im in ur keynotez, meltin ur brainz #sqlpass

Optimizer still pick bad plans too frequently for several reasons:

bad stats
cardinality estimates assume uniformly distributed values
attribute values are correlated with one another
…

Better understanding of QO behavior: Bing “Picasso Harista” to find the project’s website (Picasso Project). Simple but powerful idea, for a given query (say, with 2 predicates) and you vary the predicates, the software will get back from the DB system what the QO thinks is the cost for each combination of input parameters It’ll plot the results…visual representation of a plan space. Every different color represents a different plan that SQL produced – 90k different queries submitted (same query with varying parameters), 204 different plans.

Here is my WHOA! moment: This visual has a big impact…that one query can have such a massively different footprint varying solely on the parameters given really shows us how much variance there is in the potential QO return.

Query plans should be more stable…

“Reduced” Plan Diagrams – rubustness is somehow tied to the number of plans. This graphic is looking at slightly slower (2% slower on average) plans, and fewer of them. Since each plan covers a larger area it’ll be less sensitive to changes in parameters.

@MishDBA: #sqlpass I have a sudden urge to update statistics

How might we do better? At QO time, have the QO annotate compiled query plans with statistics and check operators. Helping the optimizer learn. Opt of subsequent queries benefits from the observed statistics. That would be mega-cool.

Another idea: Dynamic reoptimization.

Finish. Wow.

?: *Clap from my desk* #sqlpass
GFritchey: GIGGITY! This was great! #sqlpass
SQLRockstar: keynote? i want Dr. DeWitte to do a pre-con at the next #sqlpass
Erinstellato: Thank God this session will be on the DVDs. #sqlpass
Me: Sean: “I got a question: What?” #sqlpass
sqlpass: We want your suggestions for David DeWitt’s keynote at Summit 2011! #sqlpass
Me: The problem with the optimizer team is that dr DeWitt is not on the optimizer team. #sqlpass [S]

“I want hints to go away.” – YES!!!

@sql_r: Please invite Dr. DeWitt back to #sqlpass 2011, topics: MapReduce, query optimization, NoSQL, the phonebook, shit, anything!