Pure Storage FlashArray

A Peek Inside a SQL Server Backup File

Confessions First

In all of my years working with SQL Server, I never really thought about the actual contents of a SQL Server backup file. Sure, it contains data from a given database, but despite my love of storage engine internals, backup file internals is not something I’ve ever had any interest in looking into.

Until now…

The Premise

This first came up during my onboarding with Pure Storage. Anthony Nocentino (b|t) taught me that a SQL Server backup file is a byte-for-byte copy of your data, as materialized in SQL Server MDF files (assuming no backup compression or backup encryption). And that would make sense – how else would SQL Server store a copy of your data in a backup file? It does not make sense for SQL Server to alter your data when it writes it down to a backup file (again, with NO backup compression/encryption) – that’s a waste of compute and effort.

Well, I had a conversation with someone who was unclear about that assertion. I tried some Google-fu to present some supporting materials, but could not actually find any documentation, official or otherwise, to back it up. So here we are.

Why Do You Even Care?

There’s a bit of Pure Storage related background here, so feel free to skip this section if you don’t care about why I’m writing this.

On FlashArray, we de-duplicate your data behind the scenes across the entire array. So if you had three SQL Servers (Prod, QA, Dev) all attached to a given FlashArray, and each instance had an identical copy of AdventureWorks, it would almost completely dedupe down to one copy on FlashArray.

Along those lines, a single database will have many places where deduplication can also occur within it. Think about how much repetition occurs within a stereotypical dataset. Things like dates, product IDs, product category IDs, etc. are all duplicated throughout a typical dataset, thus ripe for FlashArray to dedupe within your data file too.

But much like the data that resides in each of our databases, there’s a great degree of variability too. You may have a database where practically everything is unique. You may have a database that stores binary data. The list goes on and on. So while we see a certain average deduplication ratio with SQL Server databases, that’s AVERAGE. And often our customers want to know what THEIR database will yield.

And this is where a backup file comes into play.

One trick that Anthony taught me is to provision an empty volume on FlashArray and take a single uncompressed, unencrypted backup of your database and stick the file there. Because the backup file contains a byte-for-byte copy of your data, as materialized in your MDF/NDF files, its dedupe yield will be very close to that of your MDF/NDF files.

Great way to test, huh? Unfortunately the individual I was speaking with was not confident about the underlying byte-for-byte composition of a backup file. So I decided to test, validate, and document it!

Testing Setup

Using SQL Server 2017, I created a simple database with a single table and inserted some data.



Then I created an uncompressed, unencrypted backup file. Finally, I shut down SQL Server’s service and copied the MDF and BAK to another location to begin analysis.

All code I used can be found here: https://github.com/SQLBek/PureStorage/tree/main/backup_test

So What’s Inside?

To quickly analyze differences, I found a cool piece of software called Beyond Compare that has a “Hex Compare” feature – perfect for binary file comparison!


Click for high-res


To give you a quick overview, the left sidebar shows an overview of the two files, with red lines/blocks to designate some kind of difference in the file. In the example screenshot, the left is the MDF file and the right panel is the backup file. This is the beginning of each file, so you can see that there are some differences present here.

Why Is More Than Half Red?!

However, look closer at the sidebar. The first half has very few differences. But what about that second half that’s ALL RED?


Click for high-res


At least that answer is easy. All of those 00’s is simply extra empty space that has been padded at the end of the MDF file. And because it has nothing, it has been omitted from the backup file. I could have truncated the data file first, but I kept this here to illustrate that one’s data file may be larger than the backup file due to this nuance.

Okay, Let’s See the Data


Click for high-res


As for the data itself, that’s present in the 2nd quarter of the MDF file or final 3rd of the backup file. And you can see from this screenshot that the backup file is in fact a byte-for-byte copy of the MDF file!

Takeaways

First, I hope that this is enough to prove that data in a database are re-materlized byte-for-byte in a backup file. Sure, there’s some differences in other metadata, but what I care about in this exercise is whether the data itself is identical, which it is.

Second, if you are still in doubt, I’ve published everything to my github here. If you look inside backup_test.sql, you’ll find some extra code in the form of DBCC IND and DBCC PAGE commands. Instead of searching for data, try using DBCC IND and find a different data structure like an IAM page. Then use DBCC PAGE to look at the raw contents and use the hex editor to search for the matching binary data in both the MDF and backup file. I did that myself and found it cool that those underlying supporting pages are also materialized identically.

Third, if you see a hole or gap with this analysis, please let me know in the comments. I did this to learn and validate things for myself, and I definitely want to know if I made a goof somewhere!

Finally, I hope you enjoyed this and stay curious.

Thanks for reading!

Advertisement

T-SQL Tuesday #150: My First Tech Job

Welcome to another edition of T-SQL Tuesday! This month’s host is Kenneth Fisher and Kenneth asks for bloggers to write about “Your first technical job”

Something A Little Different

I suspect that most will write about their first jobs after college. But for me, I’m going to go further back to senior year of high school.

Took Economics as an elective and it was a really awesome class. Got along really well with my teacher, who knew I was a computer nerd. At some point, she told me that her (adult) daughter needed a copy of WordPerfect and a CD-Rom drive installed in her home desktop and asked if I’d be willing to do the job for her. Sure, why not? I was already acing that class, so it’s not like I needed extra credit or brownie points. She said she’d pay me but never said how much. I didn’t really care; I was eager to help.

Drove over to her daughter’s apartment, installed the CD-Rom drive & WordPerfect, gave them a tour of the software, chatted about Sierra adventure video games with her boyfriend/husband, and left. They handed me an envelope and it had $150 in it! To high schooler Andy, that was a TON of money for an hour of my time. And I I suppose that makes that my “first technical job.”

What’s So Significant About That?

Why am I writing about that story? Because that first job was something I enjoyed and I was eager to help. I knew I was getting paid, but I didn’t really care how much.

Flashing forward more years than I care to admit, and I can say that I’ve been blessed with a career where I’ve been able to continue to work with technology that I love, and help people along the way, all while getting paid to do it. It’s taken a great deal of work and there’s certainly been some rough patches, but I appreciate and am grateful that I continue to be able to help people day in and day out.

What About You?

If you’re reading this, then I would challenge you to think over your career. Do you have an earlier story that helped to shape or foretold who you have become today?

Thanks for reading.