This post will detail the SQL database/infrastructure tests we have configured on our data warehouse environment using dbachecks, the open source PowerShell module.
I will just say I think dbachecks is awesome. I’m pretty grounded and not known to hype things up but if anybody is questioning if they should be using the module I would strongly encourage them to do so. There is a bit of a learning curve if you aren’t used to PowerShell but it’s totally worth the effort and it is possible to be so much more proactive in terms of checking some of the basic things that can go wrong.
Anyway, some background to start with. I am a database administrator and developer, I split my time roughly 50/50 between the two disciplines. Our data warehouse environment is on-premises SQL Server 2016 and we do a full ETL load every night which takes around 4 hours to process. This includes restoring copies of transactional databases onto the DW server, running staging ETL, DW ETL, processing SSAS cubes etc, so it’s quite an intensive workload particularly in terms of IO and CPU.
What made us decide to implement dbachecks? We have considered using it for several months but various high profile and short turnaround projects have meant that lately we haven’t dedicated much time for this kind of thing. About a month ago we had an ETL failure which dbachecks would’ve given us ample warning of had we already been using it: As part of our staging ETL we have a couple of IDENTITY columns which we seed from 500 million to create uniqueness from a legacy system which we are still required to derive history and report from. One of these IDENTITY columns hit it’s maximum value which caused the ETL to fail partway through. This wasn’t a problem for the source column within the transactional system which is seeded from 1, and isn’t likely to be a problem for a few years.
Moving onto the exciting stuff, here are some details about the checks we are currently running on a daily basis:
The Ola Hallengren maintenance tests; we are using all apart from log backups because all databases on our DW server are in simple recovery mode. We do take backups of some databases but not all (because some are restored from backups taken from transactional systems). So we’ve had to customise our implementation so as not to receive failure notifications about database backups which we know we aren’t taking. I will add some detail about how we did this in part 2.
We are only running two of the Agent checks at the moment (AgentServiceAccount and ValidJobOwner) but we will add others later, maybe as infrequent checks. We probably wouldn’t configure FailedJob because some of our jobs routinely fail, we are notified separately and we don’t want these failures appearing on any reports. When we implement on transactional systems it is likely that we will configure FailedJob because we expect all jobs to succeed on OLTP
Here are a list of Database checks along with a description of why we chose to implement:
|DatabaseStatus||We have around 30 databases on our DW environment and we want to know if any of them are not accessible or having some kind of problem.|
|IdentityUsage||As previously described, this is the test which would’ve avoided the ETL failure caused by IDENTITY column hitting the maximum for the data type.|
|LastDiffBackup||We take nightly diff backups of some (but not all) of our databases on DW, and for those that we do we want to know if the last diff backup was within the last last 24 hours (except weekends when we do full backups).|
|LastFullBackup||For databases which are being backed up, full backups are taken every weekend. We want to make sure that the full backups are never older than 1 week.|
|SuspectPage||We do not want any suspect pages in any of our databases|
Here is a selection of the Instance checks we are running. There are so many that I haven’t had time to list and provide comment on them all, but I may edit this post to provide the full list when there is time:
|ADUser||We want to check that our AD user accounts are mapped to users which exist and aren’t locked etc. This may give a useful indication of whether domain service accounts are locked out (which happens occasionally)|
|BackupPathAccess||As I found out last week you never know when somebody will unshare or remove permissions from a folder dedicated to storing backups, so this is a handy check to perform.|
|NetworkLatency||We don’t want query or ETL load performance to be constrained by unacceptably high network latency|
|OLEAutomation||We want to make sure that OLE automation is disabled for security reasons, it is not required for any of the databases on DW server|
|OrphanedFile||Implemented to keep things tidy, and to avoid having large orphaned database files taking up space unnecessarily|
|ErrorLog||Handy to know if we have any severity level 17-24 errors which need investigation|
|LinkedServerConnection||To keep our long list of linked server connections tidy, as we aren’t always informed when another server has been renamed or decommissioned.|
Finally here are some of the Server checks we have configured:
|DiskCapacity||Our data centre team are already monitoring disk capacity but we want to have some visibility for ourselves, to track capacity over time and compare with growth of individual databases (which we are capturing separately).|
|InstanceConnection||We want to know that superficial health of the instance is okay and that the server is responding to ping requests.|
This was just an overview of what we decided to test with dbachecks on a daily basis. We are currently looking at what checks we may want to run more infrequently (maybe once or twice per month) and again I may come back and edit this post at some point.
Part 2 will focus on how we implemented dbachecks on our DW server, including some things which tripped us up and that we need to remember for implementations on other environments.
Hope this is helpful, if further information is required please feel free to contact me.