Data De-Duplication is Awesome

TL;DR

I had the honor of recording not one, not two, but THREE lightboard videos for Pure Storage! And they’re up on YouTube for you to enjoy!

For today’s blog post, I want to focus our attention on the first Lightboard video: How Volumes work on FlashArray and a discrete benefit to FlashArray’s architecture. If you’re not yet familiar with that, I strongly suggest you read “Re-Thinking What We’ve Known About Storage” first.

Silly Analogy Time!

As many of you know from my presentations, I pride myself on silly analogies to explain concepts. So here’s a silly one that I hope will help to explain a key benefit of our “software defined” volumes.

My wife Deborah (b|t) and I have a chihuahua mix named Sebastian. And like many dog owners, we take way too many of pictures of our dog. So let’s pretend that Deborah and I are both standing next to one another and we each snap photos of the dog on our respective smartphones.

Sebastian wasn’t quite ready for for me, but posed perfectly for Deborah (of course)!

We also have a home NAS. Both Deborah and I have our own volume shares on the NAS and each of our smartphones automatically backs up to each of our respective shares. So once on the NAS, each photo would consume X megabytes per photo, right? That’s what we’re all used to.

Software Saves Space!

Now let’s pretend that instead of a regular consumer NAS, I was lucky enough to have a Pure Storage FlashArray as my home NAS. One of its special super powers is that it deduplicates data, not just within a given volume, but across all volumes across the entire array! So for example, if I have 3 different copies of the same SQL Server 2022 ISO file on the NAS, just in different volumes, FlashArray would dedupe that down to one underneath the covers.

But FlashArray’s dedupe is not just done at the full file level – it goes deeper than that!

So going back to our example, when we each upload our respective photos to our individual backup volumes, commonalities will be detected and deduplicated. Sebastian’s head is turned differently and he’s holding his right paw up in one photo but not the other. Otherwise, the two photos are practically identical.

So instead of having to store two different photos, it can singly store the identical elements of the photo as a “shared” canvas and plus the distinct differences. (If your mind goes to the binary bits and bytes that comprise a digital photo, I’m going to ask you to set that aside and just think about the visual scene that was captured. This is just a silly analogy after all.)

How About Some Numbers?

Let’s say each photo is 10MB each, and 85% of the two photos’ canvas is shared while the remaining 15% is unique to each photo. If we had traditional storage, we’d need 20MB to store both photos. But with deduplication technology, we’d only need to store 11.5MB! That breaks down as 8.5MB (shared: photo 1 & 2) + 1.5MB (unique: photo 1) + 1.5MB (unique: photo 2). That’s a huge amount of space savings, by being able to consolidate duplicate canvases!

Think At Scale

As I mentioned earlier, Deborah and I take TONS of photos of our dog. Most happen to be around our house. And because Sebastian is a jittery little guy, we’ll take a dozen shots just to try to get one good one out of the batch. And if we had data deduplication capabilities on our home NAS, that’d translate to a huge amount of capacity savings that is otherwise wasted storing redundant canvases.

What About the Real World?

Silly analogies aside, what does this look like in the real world? On Pure Storage FlashArray, a single SQL Server database get an average of 3.5:1 data reduction (comprised of data deduplication + compression), but that ratio skyrockets as you persist additional copies of a given database (ex: client federated dbs, non-prod copies on staging, QA, etc. ). If your databases are just a few hundred gigabytes, you might not care. But once you start getting into the terabyte range, the data reduction savings starts to add up FAST.

Wouldn’t you be happy if your 10TB SQL Server database only consumed 2.85TB of actual capacity? I sure would be. Thanks for reading!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.