Tuning a large, legacy SQL Server DB – A Journal

The Backstory

As noted in the About page, I’ve worked at EnergyCAP for over 12 years (as of this post).  We’re a leader in our industry… and have been for more than 35 years.  But with any long-term history comes long-term technical debt, especially when not handled well team-to-team, year-to-year.  For us one of the major pieces of technical debt comes in the form of our SQL Server database.

About 21 years ago Enron acquired the original company, Omnicomp, and infused new life into the product and technology that the original application was built on.  SQL Server 6.5 was chosen as the storage engine and a new staff of DBAs was tasked with designing the schema and DB logic to support the application.  At the time it was a C++ installed Windows application.

After Enron collapsed, our CEO, Steve Heinz,  was able to buy back the IP out of bankruptcy and start a new company to continue serving loyal customers.  Enron gave us a great foundation, but unfortunately most of the talent was gone and the company was bootstrapped from the ground up to get to where we are today.

It’s the “where we are today” part that has become a bit problematic at times.  Don’t get me wrong, I still believe we’ve held to many good design tenets over the years.  The folks in Houston, and those that have have stayed with us since, have continued to hold fast on many good choices even when others advocated for “faster” ways.  But EnergyCAP is still a complex database that hasn’t had a thorough review in about 15 years.  We’ve known instinctively for years that there has to be room for improvement, but resources and time just always seem to slip away from digging in and doing the hard work.  And on man, it’s hard work… eh?

Sizing up the Data

To give you an idea of what we’re dealing with, here are some basic stats about the database that backs our core application.

  • There are currently 351 tables in an unaltered DB schema
  • There are 1367 indexes (yeah… do that math…)
  • There are 27 tables that have 8 or more indexes… the top 5 of which account for 116 of those indexes.  Obviously most of these are foreign key indexes and, as it turns out, are never, ever used. 🙁
  • This translates into ~15 tables using 75% of the reserved data, although the features a client uses can skew this some
  • Batch requests/sec on our busiest servers only run at about 50 for 95th percentile week over week. Again, you’ll see later why this doesn’t jive with user experience
  • Although the database is generally in 3NF, has solid FK constraints and other design principles, all of the constraints are disabled and integrity is enforced with triggers. (the astute among you will immediately see the problem here… more on this later)
  • Reporting on data is the #1 thing our clients use EnergyCAP for. We have nearly 400 “canned” reports that ship with the software, each of which has at least 20 different filters that can be set.  Trying to do OLAP well while other users expect OLTP success  in day-to-day work has proven to be difficult for us.
  • The majority of clients have their own copy of the database, although we have modified the database over the years to allow multi-tenant deployments. Most multi-tenant deployments have less than 100 clients per database.

The Problem

So why am I spending time to document my research and journey as I attempt to improve something that’s generally been solid for thousands of clients for 15+ years?  It’s simple.

In all of my time at EnergyCAP, holding numerous positions and heading the development team for 8+ years, I’ve never been satisfied with some of the answers tell ourselves (and clients) when things aren’t performing as well as we think they should.  I can say in good faith that we’ve never lied to a client about the issues.  What I’ve come to learn is that we simply didn’t understand some parts of SQL Server as well as we thought we did, and even after further research came away with the wrong answer.

Admittedly the problems are not always the fault of the database.  Particularly when a client hosts their own application we tend to see more issues than when it’s in our environment.  This is usually because the clients have no dedicated technical resources to maintaining the database and it’s simply one among many.

But still, there are far too many times we see error logs come through and we know that some of the core features of the application just aren’t performing as well as they should be under the given load.

In short, our servers often appear to be underutilized, but our client experience at some of the busiest parts of the day “feel” otherwise… and I’ve never been satisfied with it!

The Catalyst

And maybe more importantly… why now?  If I’ve been with the company for so long, worked with most of our clients in some form or fashion, and I’ve had a leading role in forming the product… what’s changed to make me dive in now?

In January 2016 I attended my second CodeMash and was able to make it for the Pre-Compiler sessions.  One of the all-day training sessions I sat in on was about SQL Server.  Over the course of six hours, ideas and concepts that ranged from basic server setup to queries that revealed information on server internals totally transformed the way I thought about SQL Server.  I left that session unwilling to look at our problems and give another status quo answer moving forward.

In fact, I realized that I had been lying to myself for years about how much I really knew about the core technology of our application.  I had relied too long on my experience and natural understanding of how the server worked.  I was hooked on digging in more than I ever had.

And the most helpful part of this new journey is that we host hundreds of clients in a SaaS environment which has allowed us to see first hand how some of the challenges above can lead to less-than-desirable performance and user experience.  We have a direct proving ground for any changes.  Sweet!

The Journey Begins Now

And now it’s time to dive in.  Over the last 15 months I’ve taken whatever time I could get to dig into the database side of EnergyCAP.  I’ve learned a ton, attended more classes, read hundreds of blog posts, watched Pluralsight courses, went to my first SQL Saturday event, started to put my voice out there on the #sqlhelp thread on Twitter… determined that I’d do my best to contribute back to the community that has done such an amazing job pushing me forward.  At the same time I’m pretty certain that I’m still just trying to pass SQL Kindergarten at this point.  Nothing I say here will be earth shattering… and there’s a really good chance some of it will still be plain wrong.

But hopefully, as I document this first project and others that are spurred on from it, someone will gain insight from my journey, improve upon it, and pass it on.

So grab your comfy hiking boots and some snacks… because this is going to take a while.

Topics I plan to cover include the following.  As the posts are made active, I’ll update the links below as necessary.

  • Performance monitoring, what we’ve tried, and how we’ve landed on using Splunk for now
  • Examining Indexes to simplify the database and improve IO
  • The tooling I’ve found helpful
  • Service Broker and how it totally transformed how legacy processes are run
  • How we’ve tested the changes over time and how we’ve measured “success”