Pure Storage FlashArray

A Peek Inside a SQL Server Backup File

Confessions First

In all of my years working with SQL Server, I never really thought about the actual contents of a SQL Server backup file. Sure, it contains data from a given database, but despite my love of storage engine internals, backup file internals is not something I’ve ever had any interest in looking into.

Until now…

The Premise

This first came up during my onboarding with Pure Storage. Anthony Nocentino (b|t) taught me that a SQL Server backup file is a byte-for-byte copy of your data, as materialized in SQL Server MDF files (assuming no backup compression or backup encryption). And that would make sense – how else would SQL Server store a copy of your data in a backup file? It does not make sense for SQL Server to alter your data when it writes it down to a backup file (again, with NO backup compression/encryption) – that’s a waste of compute and effort.

Well, I had a conversation with someone who was unclear about that assertion. I tried some Google-fu to present some supporting materials, but could not actually find any documentation, official or otherwise, to back it up. So here we are.

Why Do You Even Care?

There’s a bit of Pure Storage related background here, so feel free to skip this section if you don’t care about why I’m writing this.

On FlashArray, we de-duplicate your data behind the scenes across the entire array. So if you had three SQL Servers (Prod, QA, Dev) all attached to a given FlashArray, and each instance had an identical copy of AdventureWorks, it would almost completely dedupe down to one copy on FlashArray.

Along those lines, a single database will have many places where deduplication can also occur within it. Think about how much repetition occurs within a stereotypical dataset. Things like dates, product IDs, product category IDs, etc. are all duplicated throughout a typical dataset, thus ripe for FlashArray to dedupe within your data file too.

But much like the data that resides in each of our databases, there’s a great degree of variability too. You may have a database where practically everything is unique. You may have a database that stores binary data. The list goes on and on. So while we see a certain average deduplication ratio with SQL Server databases, that’s AVERAGE. And often our customers want to know what THEIR database will yield.

And this is where a backup file comes into play.

One trick that Anthony taught me is to provision an empty volume on FlashArray and take a single uncompressed, unencrypted backup of your database and stick the file there. Because the backup file contains a byte-for-byte copy of your data, as materialized in your MDF/NDF files, its dedupe yield will be very close to that of your MDF/NDF files.

Great way to test, huh? Unfortunately the individual I was speaking with was not confident about the underlying byte-for-byte composition of a backup file. So I decided to test, validate, and document it!

Testing Setup

Using SQL Server 2017, I created a simple database with a single table and inserted some data.



Then I created an uncompressed, unencrypted backup file. Finally, I shut down SQL Server’s service and copied the MDF and BAK to another location to begin analysis.

All code I used can be found here: https://github.com/SQLBek/PureStorage/tree/main/backup_test

So What’s Inside?

To quickly analyze differences, I found a cool piece of software called Beyond Compare that has a “Hex Compare” feature – perfect for binary file comparison!


Click for high-res


To give you a quick overview, the left sidebar shows an overview of the two files, with red lines/blocks to designate some kind of difference in the file. In the example screenshot, the left is the MDF file and the right panel is the backup file. This is the beginning of each file, so you can see that there are some differences present here.

Why Is More Than Half Red?!

However, look closer at the sidebar. The first half has very few differences. But what about that second half that’s ALL RED?


Click for high-res


At least that answer is easy. All of those 00’s is simply extra empty space that has been padded at the end of the MDF file. And because it has nothing, it has been omitted from the backup file. I could have truncated the data file first, but I kept this here to illustrate that one’s data file may be larger than the backup file due to this nuance.

Okay, Let’s See the Data


Click for high-res


As for the data itself, that’s present in the 2nd quarter of the MDF file or final 3rd of the backup file. And you can see from this screenshot that the backup file is in fact a byte-for-byte copy of the MDF file!

Takeaways

First, I hope that this is enough to prove that data in a database are re-materlized byte-for-byte in a backup file. Sure, there’s some differences in other metadata, but what I care about in this exercise is whether the data itself is identical, which it is.

Second, if you are still in doubt, I’ve published everything to my github here. If you look inside backup_test.sql, you’ll find some extra code in the form of DBCC IND and DBCC PAGE commands. Instead of searching for data, try using DBCC IND and find a different data structure like an IAM page. Then use DBCC PAGE to look at the raw contents and use the hex editor to search for the matching binary data in both the MDF and backup file. I did that myself and found it cool that those underlying supporting pages are also materialized identically.

Third, if you see a hole or gap with this analysis, please let me know in the comments. I did this to learn and validate things for myself, and I definitely want to know if I made a goof somewhere!

Finally, I hope you enjoyed this and stay curious.

Thanks for reading!

T-SQL Tuesday #150: My First Tech Job

Welcome to another edition of T-SQL Tuesday! This month’s host is Kenneth Fisher and Kenneth asks for bloggers to write about “Your first technical job”

Something A Little Different

I suspect that most will write about their first jobs after college. But for me, I’m going to go further back to senior year of high school.

Took Economics as an elective and it was a really awesome class. Got along really well with my teacher, who knew I was a computer nerd. At some point, she told me that her (adult) daughter needed a copy of WordPerfect and a CD-Rom drive installed in her home desktop and asked if I’d be willing to do the job for her. Sure, why not? I was already acing that class, so it’s not like I needed extra credit or brownie points. She said she’d pay me but never said how much. I didn’t really care; I was eager to help.

Drove over to her daughter’s apartment, installed the CD-Rom drive & WordPerfect, gave them a tour of the software, chatted about Sierra adventure video games with her boyfriend/husband, and left. They handed me an envelope and it had $150 in it! To high schooler Andy, that was a TON of money for an hour of my time. And I I suppose that makes that my “first technical job.”

What’s So Significant About That?

Why am I writing about that story? Because that first job was something I enjoyed and I was eager to help. I knew I was getting paid, but I didn’t really care how much.

Flashing forward more years than I care to admit, and I can say that I’ve been blessed with a career where I’ve been able to continue to work with technology that I love, and help people along the way, all while getting paid to do it. It’s taken a great deal of work and there’s certainly been some rough patches, but I appreciate and am grateful that I continue to be able to help people day in and day out.

What About You?

If you’re reading this, then I would challenge you to think over your career. Do you have an earlier story that helped to shape or foretold who you have become today?

Thanks for reading.

T-SQL Tuesday #149: Less Is More… Or Is It?

Welcome to another edition of T-SQL Tuesday! This month’s host is Camila Henrique and Camila asks for bloggers to write “T-SQL Advice you’d give to your younger self.”

Less Is More

I studied Computer Science in college, then my first job out of school I was a combination internet developer and sysadmin. I was taught “less is more” when it came to writing code, as many of us were.

So when I started working with T-SQL, I started crafting gigantic, do-it-all T-SQL statements. So while the statement/query itself might have been many lines of code, it was just one single discrete command. So less is more, right? Better to only send over one query rather than a bunch, right?

… Or Is It?

The “less is more” bias was turned around when I started to learn about the storage engine and query optimizer internals. As I talk about in many of my presentations, I’d rather send a bunch of simple queries to the optimizer rather than one single gigantic one (generally speaking).

There’s many reasons for this but one key one is that the query optimizer’s end goal is NOT to come up with the best execution plan possible for a given query, but to create one that is good enough as quickly as possible. And that the query optimizer has a limited amount of time before it stops permutating and just goes with whatever is the best “draft” it has at that point.

When It Comes to T-SQL

My TL;DR advice… “Less Is More… work for the query optimizer.” Instead, keep your queries simple, even if it means sending more discrete queries to the optimizer. Break it down into intermediate steps. It may be counter to other programming languages but in the world of T-SQL, Less is NOT More.

T-SQL Tuesday #148 – Building Community

This month, Rie Merritt is asking us to share advice about User Groups. I had the honor of running the Chicago Suburban User Group for over 6 years.

Build Community

My advice is something I stumbled upon by complete accident. Prior to our first meeting, we sent out a SurveyMonkey asking attendees what they wanted to get out of these meetings. Networking was a top answer. And on the drive over to our first meeting, I came up with an idea.

When we kicked off, I explained the survey results wanting networking, so I simply asked everyone to introduce themselves to whomever they were seated by. I said just share your name, your role, and what you do. I expected this to last maybe 2-3 minutes tops. Instead it transformed into 15-20 minutes of chatter!

As our group evolved, we offered food but had to locate to a common area to eat (food wasn’t allowed in our classroom). So I’d repeat the “introduce yourself” bit, remind people that this is their time to chat, etc. I always made the joke that “if you’re an introvert, you’re in luck… almost all of us in the room are!” And I said that if you cannot think of anything to talk about, just vent about something you’ve been struggling with at work lately – we ALL have work headaches and challenges.

And it worked. It worked really… really… well! I’m very proud of the community that we built with the Chicago Suburban User Group.

T-SQL Tuesday #146 Round-up! Upending Preconceived Notions

Hello everyone!

First of all, I want to thank everyone who took some time to participate in this month’s T-SQL Tuesday: Upending Preconceived Notions. There’s a fantastic line-up of contributions! I was tickled that many of the topics chosen were things I once held one viewpoint on, later learned more, and evolved that viewpoint.


That was a lot of fun. And I’m happy to say that I learned some stuff.

Thank you again to everyone who took some time to contribute a blog post!

Dear readers, I hope you take the time to read each and every contribution, and come away learning something new too!

T-SQL Tuesday #146: Disk Storage – More Than Meets The Eye

Welcome to another edition of T-SQL Tuesday. This month, I’m the host and I’ve asked participants to blog about “Upending Preconceived Notions.”

What’s Your Preconceived Notion Andy?

Just over 3 months ago, I joined Pure Storage, which is fundamentally a storage infrastructure company. Confession: I’ve never been an infrastructure OR storage guru. I’ve spent the vast majority of my career focused on slinging code, and not the underlying hardware. But like many things in life, these things can be taught and I’ve been learning a LOT these past few months!

What I Used To Think About Disk Storage

For the vast majority of my career, I just thought of storage as a place to read and write your data from. Storage was merely a receptacle… a place for your data to live. Of course, as data professionals, we want to move tons of data quickly, so things like latency, throughput, and capacity still mattered to me. But beyond that, disk was disk.

So What Changed?

Frankly, I learned about how Pure Storage’s FlashArray worked underneath the covers. (DISCLAIMER: I may slightly butcher what I explain next, as I’m still getting to the point of being able to eloquently articulate the stuff I’ve been learning.) When writing data blocks, FlashArray essentially maintains “metadata” about what data tidbits live where in the storage array. I’m going to over-simplify and say let’s just think of this metadata as a glorified index or table of contents, okay?

So why is that special? FlashArray uses this metadata not only to find where your data lives but to MANAGE it as well. We do things like make copies of our data and I know we can ALL relate to staring at progress meters moving at a glacier’s pace. Well what if all we had to do is make an adjustment to the metadata instead? I’d rather just copy a small little “table of contents” instead of an entire book of information, and it’s a heck of a lot FASTER!

If you’re curious to learn more about this, here’s a cool blog about ODX and FlashArray. https://blog.purestorage.com/purely-technical/odx-flasharray-engineers-perspective/

Why does all of this matter?

As data professionals, we have to move data around a lot. And it is often a hassle. Refresh dev/QA/staging with a copy of Prod. Reseed an Availability Group. What has blown my mind is that because of clever decisions that Pure has made, regarding how data is stored in FlashArray, now enables data professionals to attack those copy/data movement headaches in ways they could not before! I’m amazed at some of the things that can be done almost instaneously, that would take hours in another vendor’s storage array. And that is extremely exciting to me!

Revelations

As I’m writing this very blog post, I just had a personal revelation.

Years ago, I attended SQLskills’ IE1 class. One of the first topics was SQL Server storage engine internals and it was another one of those eye opening moments in my career journey. And I just realized that my new understanding of FlashArray and how it works underneath the covers, has really opened by eyes and changed how I view storage SOLUTIONS! This bit is totally just a personal notation, but I figured I’d write and share it here anyway!

Think Back

Anyway, I hope you all enjoyed this story. Sorry if it came across as a Pure Storage commercial – it wasn’t meant to be, but damn, they’re doing some really wicked awesome stuff underneath the covers!

Regardless, if you’re reading now, think back over the course of your career and ask yourself if there’s ever been a time when you’ve learned something that’s totally changed your outlook or view on something bigger.

Thanks for reading!

T-SQL Tuesday #146: Upending Preconceived Notions

Welcome back to another edition of T-SQL Tuesday! I’m honored to be your host once again!

Theme to Kick off 2022

This month, I’d like to ask everyone to think about something you’ve learned, that subsequently changed your opinion/viewpoint/etc. on something. Maybe you’ve had a certain opinion, belief, or bias? Perhaps you’ve always thought something worked a certain way? Or you’ve always thought that a certain something (called “X”) was only good for “A”, only to later learn that it can help with “B”, and “C” as well. Regardless, you learned something and it totally upended and changed that preconceived notion you held previously.

When has this happened to you Andy?

Let me share an example. In my past as a T-SQL developer, I remember when I first learned about CTEs. I thought the world of them and started using them everywhere! However, there was one slight problem. I was under the mistaken impression that they pre-materialize each sub-query. And they do… in OTHER RDBMS’s. Whoops! After a few years, I learned that they don’t behave that way in SQL Server. Instead the query optimizer inlines the query in the CTE, making them functionally no different that a subquery. And well, let’s just say that that made me regret some of the coding decisions I’d made during my “CTEs-are-awesome” phase.

Rule Refresher

To remind you of the rules, please write a blog post corresponding to this month’s theme, include the T-SQL Tuesday logo, link back to this announcement post for the trackback, and leave a comment here with a link back to your contribution. Deadline is end of day Jan 11, 2022 (don’t worry about time zones, I’m not picky about late entries).

The other thing to remember is, your blog is your blog. If you want to gently twist the theme, to write a blog that better suits you, by all means do so! Just want to encourage you all to participate!

Go Forth and Blog!

So take some time to think about something you’ve learned that subsequently upended a preconceived notion you held before! I look forward to reading your contributions.

Happy Blogging!

The Future Is Orange

A month ago, I blogged about my last day with SentryOne.

And today, I’m thrilled to share that I’ve had my first day at Pure Storage! I’m joining Marsha Pierce’s team of Field Solutions Architects, where I’ll be doing all sorts of awesome stuff with SQL Server and Pure Storage technology. I know I have a lot to learn but am extremely excited to grow and learn once again.

Burnout and stagnation sucks. It’s okay to take time off and it’s okay to shake things up, move out of your comfort zone, and challenge one’s self again.

T-SQL Tuesday #143 – Random Fun

This month’s T-SQL Tuesday is hosted by John McCormack, who asked participants to blog about Short Code Examples.

A few years ago, I wrote a similar blog about T-SQL Tools that I could not live without. Those were my snippets files, which I maintain in Google Drive. Not much has changed since I’d spent the last 4 years working for SentryOne, but I still maintain the entire toolset and did use it periodically.

For today’s, I’m going to dive into my 1_Scripts.sql file and share some of my favorites all involving “random” values!

Random Fun

First, there are times where you need multiple random numbers. Simply using multiple iterations of RAND() in a single statement won’t cut it, as you need to vary the seed. So I keep this snippet handy for when I need a bunch of random values in a single statement:

/* Random Number between 0-99 */
SELECT
   CAST((RAND(CAST(NEWID() AS VARBINARY)) * 100) AS INT),
   CAST((RAND(CAST(NEWID() AS VARBINARY)) * 100) AS INT);
SELECT
   RAND(CHECKSUM(NEWID())),
   RAND(CAST(NEWID() AS VARBINARY));

Wait For Me Randomly

During my time at SentryOne, I found that I had to create mock workload on a regular basis. I wanted to introduce some unpredictability to make demos more interesting, so wrote this chunk of code to introduce random WAITFOR DELAY intervals.

/* Shell Agent Job with randomized runtime */

DECLARE
    @DelayMin INT = 5,
    @DelayLength CHAR(8) = '00:00:00'; -- hh:mm:ss
SELECT
    -- Create randomized 5-15 MINUTE WAITFOR DELAY
   @DelayLength = '00:'
   + RIGHT('0' + CAST((@DelayMin + CAST((RAND(CAST(NEWID() AS VARBINARY)) * 100) AS INT) % 15) AS VARCHAR(2)), 2)
   + ':00';
   -- Create randomized 5-15 SECOND WAITFOR DELAY
   --@DelayLength = '00:00:'
   --+ RIGHT('0' + CAST((@DelayMin + CAST((RAND(CAST(NEWID() AS VARBINARY)) * 100) AS INT) % 15) AS VARCHAR(2)), 2)

SELECT @DelayLength AS '@DelayLength';

PRINT 'START - Do Something';

-- DO SOMETHING HERE;
WAITFOR DELAY @DelayLength;

PRINT 'END - All Done With Something';

For My Final Trick, Random Strings

In much of my AutoDealershipDemo database, I have random strings for various things, so needed code to do that. I’m using a Tally Table approach, that then combines CHAR() with modulus operators to only general alpha-numeric values. But this can easily be adapted for other values available in CHAR().

; WITH Ten_CTE (N) AS (
   SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
   SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), Thousand_CTE (N) AS (
   SELECT 1 FROM Ten_CTE a CROSS JOIN Ten_CTE b CROSS JOIN Ten_CTE c
), Tally_CTE (N) AS (
   -- This provides the "base" CTE and limits the number of rows right up front
   -- for both a performance gain and prevention of accidental "overruns"
   SELECT
      ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
   FROM Thousand_CTE
   -- CROSS APPLY Thousand_CTE t2 -- UNCOMMENT IF YOU WANT 1,000,000 results instead of 1,000
)
SELECT
   ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RecID,
   -- Increase concat below to increase length of string
   CHAR(ABS(CHECKSUM(NEWID())) % 26 + 97)
   + CHAR(ABS(CHECKSUM(NEWID())) % 26 + 97)
   + CHAR(ABS(CHECKSUM(NEWID())) % 26 + 97)
   + CHAR(ABS(CHECKSUM(NEWID())) % 26 + 97)
   + CHAR(ABS(CHECKSUM(NEWID())) % 26 + 97)
   + CHAR(ABS(CHECKSUM(NEWID())) % 26 + 97)
   AS RandomString
FROM Tally_CTE;


Hope you’ve enjoyed reading. These will most likely not be useful to the vast majority of you. But if you do ever have to write some mock workload, demo scripts, etc., then hopefully you’ll remember this blog post and make use of the above!

Standing Desk: V3

It seems crazy to me that it’s been FIVE YEARS since I last blogged about my Standing Desk setup, and boy a LOT has changed!! During those years, I moved several times, but my Ikea standing desk stuck with me.

Out With The Old

This month marks a big career change for me so I decided it was finally time to retire my Ikea setup. It is sturdy and solid enough but its biggest drawback is that it is not really convertible between sitting & standing. Technically you could change it up with the pegs and trestle legs, but that’s very impractical given how much heavy hardware “permanently” lives on the desk to begin with.

My old standing desk built from Ikea parts

So What Did You Want?

My new office space is a smaller converted attic room, which presented some limiting factors – the primary being desk width because of where I can place a standing desk with my sloped ceilings. This limited to me to setups around 40 inches wide. Because of another storage dresser, I had to overlap my window but I don’t mind. I also wanted to treat myself to a real wood desktop vs the cheap corrugated Ikea top I’ve had since the beginning.

What’d You Wind Up Buying?

After a ton of research, I wound up selecting Uplift. Went with a walnut butcher block top, C-frame, and decided to be a bit different with their “industrial” finish. I enjoyed having a monitor shelf, so added that option as well. The biggest drawback to my old setup was that I lost desktop space to my keyboard & mouse, so I also opted for an under-desk keyboard tray.

Uplift desk pre-assembly photo
C-frame almost done!
Had to stop and actually do some work, so used a TV tray! 😀
All stood up!

And here’s the final product! I kept my Vivo laptop stand, added some cheap LED under-cabinet kitchen lighting that we had leftover from our last place, and picked up a set of Cyber Acoustics CA-SP34BT 2.1 speakers for music.

Seated position w. my Steelcase Gesture
Standing position

What’s That Underneath Your Desk?

But it’s what’s underneath my desk that’s noteworthy. I’d been using an Ergodriven Topo mat for years and think it’s great. But I wanted something more. A few years ago, I’d learned of under-desk elipticals for sitting desks and began to wonder if there were now specialized treadmills that’d fit under a standing desk.

A quick bit of research yielded a number of results that seemed like fairly cheap Chinese treadmills. If I wanted solid quality, I’d be looking at spending at least $800 if not more. One key detail about an under-desk treadmill is that its not meant for hardcore running. I also wanted something “portable” enough to move out of the way, which is why I did not look for a traditional running treadmill.

Uplift + GoYouth under-desk treadmill

In the end, I opted to go with a lower-end treadmill. I’m optimistic that I can get into a habit of using it regularly, especially while on webcasts or doing lesser intensive tasks like reading, e-mail “chores,” etc. And if it winds up like other home exercise equipment often does (as a clothes hanger), then I won’t feel as bad if I had burned cash for a higher end model.

Was It Worth It?

My setup has been operational for a few days now and I’m pleased to share that I’ve already gotten good use out of the treadmill. Ran it for during a few meetings in my last workweek, and I after a while, I didn’t notice the treadmill at all. I’m only walking between 0.8MPH and 1.4MPH at this point, to get used to walking while doing stuff. Don’t expect to get much faster than that but I’m okay with that. As long as this thing keeps me moving for an extended period of time AND I can ingrain its use as a daily habit, I’ll call it a win.