PASS Summit 2011 Day 3 – Keynote Live Blog with Dr. DeWitt!

Today’s keynote with Dr. DeWitt starts at 8:15am EDT. Click here to watch today’s keynote live. For information on today’s keynote, titled “Big Data – What is the Big Deal?“, see the SQLPASS.org page on Keynotes.

Announcement: bradmcgehee: Free download of SQL Server MVP Deep Dives Special “PASS Edition” at: www.manning.com/passbook/  

8:12 A whoooole lot of us had late nights last night, and we are up bright and early to fill our mushy brains with Dr DeWitt-ness.  PASS is playing good rock music over the loudspeaker, and it’s helping.

 8:15 sharp, the lights go down. More videos! “Connect-share-learn. This is community!”  THAT WAS PHENOMENAL! 

Me: That was WAY better than Tina Turner.

Andy: “FREEBIRD!!”

Kevin Kline’s  Tribute to Wayne Snyder

8:33 Im going to have to blog from my phone. I still can’t get consistent internet here.

Dr. DeWitt is onstage! We apparently have his wife to thank for his return to the keynotes this year, and for the new glasses.

“This guy doesn’t seem like the marketing type…” [massive cheers]

8:39 Explosion of interest in big data because of the sudden increase in mass generated data that’s too valuable to discard. ( Twitter, web clicks, sensors…)

@Kendra_Little: Takeaway: Invest in hard drive manufacturers. #sqlpass

Why nosql? More data model flexibility (no “schema first requirement), relaxed consistency models such aseventual consistency (willing to trade consistency for availability)  And maybe they’re not smart enough to learn sql <cheers>.

We do a lot of loading/processing, whereas nosql has no cleansing, no etl, no load, and analyzes data where it lands.

Re Sharding, ” I don’t know why they felt the need to rename something we’ve had around since the 80s.” Oh, like renaming the internet “The Cloud”?

8:49 Two universes are the new reality: Structured and unstructured. I don’t understand how no acid/tran/sql/etl works…

                @mikehillwig: Why does the phrase ‘eventual consistency’ scare the hell out of me? #sqlpass

Why this talk? Summarized: We can’t just shoud “Get offa mah lawn.” any more. Learn and grow. RDBMSs aren’t going away, but Hadoop (etc) is the future. “Many businesses will end up with data in both universes.”

8:53 It all started at Google: massive amounts of click stream data that had to be stored and analyzed.  Hadoop equivalents of HDFS and MapReduce (Store and Process), the counterparts to those Google components designed to handle the new kind of workload.

Hadoop and MapReduce offer: scalability and a high degree of fault tolerance; low up front software /hardware cost, etc  etc.

The Hadoop ecosystem looks like a tetris game.

@jdanton ” Hive and Pig sounds like a great bar name #sqlpass”

8:57 File Splits. Large file, broken into large locks (64Mb) and store them as a separate file in NTFS. 

                @SQLDBA: ” DeWitt mentioning Windows and NTFS but the reality is almost all Hadoop installs are running on Linux. #sqlpass”

Those blocks are distributed around the nodes of the cluster. Triple replication is used, so youc an survive up to two failures.

@DBArgenis: ‘ I think of this as a really weird RAID implementation #sqlpass.”

@peschkaj: ” Dr DeWitt’s explanation of HDFS is much better than reading any of the original research/documentation. #sqlpass”

@Kendra_Little: ” I see software RAID in them there Hadoopery file system! #sqlpass”

@GFritchey: ” Just pointing this out, no “uhms” or “ahs” Smooth… #sqlpass”

NameNode is always checking the state of the data nodes (hearbeat, balancing, replication, etc). And backupnode.

Fault Tolerance (DataNode Failure)

If a data node fails, the remaining nodes will copy over their copies of the “lost” data so there’s still 3way redundancy. If the name node fails….(I missed that part). When a new data node comes online, it’ll automatically detect that it’s available, and redistribute/copy data blocks for optimum redundancy.

@AirbornGeek: “Don’t really like that no auto-failover of the name node #sqlpass”

It’s designed for scanning large amounts of data, not OLTP.

9:08 No use of mirroring or raid, all block replication…why? Reduce cost, one mechanism (triply replicated blocks) to deal with wide variety of failure types. Downside: you don’t know where your data is sitting, so certain optimization options aren’t available to you.

@Kendra_Little: But how can I tell when my data is near a frozen yogurt shop? #sqlpass
@BrentO: Because you’ll see the pedabytes.

Mid-keynote Editorial (by me):

Just had a brief conversation with Sean, and I think we’ve more or less reached a consensus. It seems pretty clear that they’re trying to get us excited about Hadoop and big data, and Dr. DeWitt is doing it right.  You have a room full of technical people: to get them excited about a product, show what it’s for, and then how it works at a good, detailed, technical level.  Notice by the way that his entire presentation is lecture and PowerPoint, and we still don’t mind it…he’s not wowing us with demos because frankly the talk is more technical than a “look I ran a command and made a thing happen” demo would be, in this instance.

In short: Good message, great presenter, great strategy. Hadoop isn’t likely to be relevant to me in the coming 12 months, but (1) I now have a very solid idea of what it is, what it’s for, and how it works, and (2) there are people in the room for whom this is (or for whom it just became AS OF THIS KEYNOTE) relevant, and this is the talk they need to hear.

Okay, back to the show…

9:26 MapReduce Summary… Pros: Highly fault tolerant; relatively easy to write “arbitrary” distributed computations over very large amounts of data; mr framework removes burden of dealing with failures from programmer. Cons: Schema embedded in application code. …

@andrewbrust: DeWitt comments that the lack of a declarative query language is a downside of MapReduce. That puts it politely. 

@dirkMyers: Dr. DeWitt makes the case for SQL… 4 pages 5pt. font versus 1/2 page 20pt. font. #sqlpass #ymmv

 9:31 Benchmarks!!!  “I’m a performance kind of guy…”  He outlined his benchmark setup: 9 servers, 8 cores, mem/sas drives, etc etc. 140Gb, “A billion rows or so of orders.”.  Showed breakdown of Hive vs. PDW.  Spoilers: PDW won.

Most of us at the blogger tables are watching intently, not typing most of the time. The exception is Mr. Grant Fritchey, on my right, live blogging like mad to try to keep up. At this point (after the fact), it would be a good idea to watch the keynote recording and use my and Grant’s blogs to follow along. You’re lucky…you have the ability to pause and replay!

 @AndrewBrust: DeWitt’s team will try to build such a system in their lab. They need a name. AmbiSQL? #SQLPASS

@SQL_Kiwi: Multiverse Data Manager #sqlpass

Me: Hey Dr. DeWitt: Why not call the Enterprise Data Manager a TARDIS? Because it can go between universes. #sqlpass

9:47 “RDBMS-only or Hadoop-only  is NOT going to be the default.” Are “Enterprise Ddata Managers” the answer?

@sqlpass_de: RDBMS oder Hadoop ist NICHT die Frage. Beide haben Ihre Berechtigung #sqlpass #sqlpass_de

So, what should Dr. DeWitt talk about next year? Send your ideas to dewitt@microsoft.com !

@StrateSQL: I vote for Dewitt breaking down technology for everyone to consume #sqlpass

@AndrewBrust: This is a manifesto on the grand unification of database paradigms. Makes sense to me. Why fight about it? #SQLPASS

Thanks to Dr. DeWitt and his team for this most excellent talk…once again, the highlight of the PASS Summit!!

@SQLPASS announced: You can download Dr. DeWitt’s slides from PASS Summit 2011 and previous years here: http://t.co/o9zKgYDB

-Jen McCown, http://www.MidnightDBA.com/Jen

A few more tweets:

@adam_jorgensen: Check out how the Presidential Election is using BI and BigData #SQLServer #SQLPASS http://ow.ly/6SzSJ