Backup Internals – Part 5: Balancing Performance

Welcome back to Part 5 of my Backup Internals Series. In this blog, I want to explore some thoughts around maximizing the performance of your SQL Server backup operations.

Blog Series Table of Contents

Fast, Good, or Cheap…

Many of us have heard the tidbit that there’s 3 options to any endeavor: Fast, Good, and Cheap, but you can only ever pick 2. When it comes to accelerating your (FULL) backup performance, there’s a similar set of choices you can make, most of which have a trade-off of some sort.

Gather Requirements

If I were to have an opportunity to re-evaluate an environment’s backup strategy and schedule, there’s a number of questions I’d want to try to answer first. Remember, the scope of this thought exercise is thinking about FULL Backup strategy from a purely performance angle.

Questions to Ask

  • Backup maintenance windows: How long and how frequently?
    Every night for 4 hours? Anytime over the weekend? For 3 minutes on Sunday, between 22:55 and 22:59?
  • How many databases to back up and how large are they?
    Does the server just have 1 database, but it’s 15TB in size? Or do you have 800 databases, one per customer? And of those 800, what’s the individual size distribution? Maybe 700 databases are smaller than 100GB, another 75 databases are between 100GB and 500GB, and the last 25 databases are +500GB?
  • Exclusivity during backup maintenance window?
    Do you have the luxury of no workload running during your maintenance window? Or do you also have to juggle index maintenance, DBCC CHECKDB, or other application processes like ETLs, nightly processing jobs, etc.? Or to put it another way, do you have free reign to max out your SQL Server resources during your window?

Assumptions

  • 1 SQL Server with many databases
  • CPU = 12 cores; RAM = 256GB
  • Backup storage capacity and storage ingest throughput are not a concern (you bought “orange”)
  • Will ignore RPO/RTO business priorities (which I might tackle in a future blog)

Start Doing Math

Now that you know how much time you have and how many/how much you need to back up, you need to start weighing your options. Do you have the luxury to run one single backup job that will back up your databases one after another? Or do you need to split your databases across multiple backup jobs that run in parallel?

If server resources were infinite, one could theoretically kick off an individual backup job for each and every database, all starting at the exact same time. But of course, that makes no sense.

Revisiting What We’ve Learned

We now know that if our database’s data file(s) all reside on a single volume, we’ll only ever get one Reader Thread. So with a 12 core server, if I have full usage of the server’s resources, I may start with 8 backup output files to get 8 Writer Threads. And I might choose to use a large BUFFERCOUNT value and larger MAXTRANSFERSIZE value, which will result in more RAM consumption.

Now let’s pretend that to meet our requirements, we need to run backups in parallel. You might estimate that you need to run 4 backup jobs simultaneously. If you use the above parameters, you’ll may now start overrunning your CPU! Remember it’s not always about 100% CPU utilization either… all cores could be utilized at say 30%, but you could be context switching like crazy.

Who Doesn’t Love a Trace Flag?

So of course the above means you need to test different permutations. But when you’re doing testing, how can you determine the actual resource utilization of a given backup job? This is where some infrequently highlighted trace flags come into play.

Trace Flag 3213 - Generate diagnostic data about BACKUP & RESTORE operation (least)
Trace Flag 3004 - Generate diagnostic data about BACKUP & RESTORE operation (more)
Trace Flag 3014 - Generate diagnostic data about BACKUP & RESTORE operation (all)
These are officially undocumented

Trace Flag 3604 - Redirects output information to SSMS Results Pane
Trace Flag 3605 - Redirects output information to SQL Server Error Log: use if you need timestamps

You’ll see the above three Trace Flag are suffixed with least, more, and all. That’s because the output that each Trace Flag yields seems to have some overlap. And what’s more challenging is that I’ve found inconsistent documentation in older blogs that cover these Trace Flags as well. So I just opt to use them all to cover all of my bases.

-- Turn on Trace Flags
DBCC TRACEON(3604, 3004, 3014, 3213, -1);
GO

When turned on, you will receive a ton of diagnostic information after running a backup operation. Here’s an example (with prefix removed for brevity):

BACKUP DATABASE Sandbox_MultiFile_SingleVol TO
DISK='NUL'
WITH COPY_ONLY, FORMAT, INIT, STATS = 15
GO

--------------
[prefix]: BACKUP DATABASE started
[prefix]: Opening the database with S lock
[prefix]: Acquiring bulk-op lock on the database
[prefix]: Synchronizing with other operations on the database is complete
[prefix]: Opening the backup media set
[prefix]: The backup media set is open

Backup/Restore buffer configuration parameters

Memory limit: 32765 MB
BufferCount: 7
Sets Of Buffers: 1
MaxTransferSize: 1024 KB
Min MaxTransferSize: 64 KB
Total buffer space: 7 MB
Tabular data device count: 1
Fulltext data device count: 0
Filestream device count: 0
TXF device count: 0
Filesystem i/o alignment: 512
Media Buffer count: 7
Media Buffer size: 1024 KB

[prefix]: Preparing the media set for writing
[prefix]: The media set is ready for backup

[prefix]: Effective options: Checksum=0, Compression=0, Encryption=0, BufferCount=7, MaxTransferSize=1024 KB
[prefix]: Checkpoint LSN: 0:0:0
[prefix]: Checkpoint is complete (elapsed = 17 ms)
[prefix]: Start LSN: 3246:76679:230, SERepl LSN: 0:0:0
[prefix]: Last LSN: 3246:76775:1
[prefix]: Scanning allocation bitmaps
[prefix]: Data section: 204684263424 bytes in total
[prefix]: Scanning allocation bitmaps is complete

[prefix]: Calculating expected size of total data
[prefix]: FID=1, ExpectedExtents=780610, IsDifferentialMapAccurate=1
[prefix]: FID=3, ExpectedExtents=781546, IsDifferentialMapAccurate=1
[prefix]: FID=4, ExpectedExtents=780523, IsDifferentialMapAccurate=1
[prefix]: FID=5, ExpectedExtents=780555, IsDifferentialMapAccurate=1
[prefix]: TotalSize=204684263424 bytes
[prefix]: Estimated total size = 204684312576 bytes (data size = 204684263424 bytes, log size = 49152 bytes)

[prefix]: Work estimation is complete

[prefix]: Do the first force checkpoint before copying data section
[prefix]: Checkpoint LSN: 3246:76679:230
[prefix]: Checkpoint is complete (elapsed = 8 ms)
[prefix]: Writing the leading metadata

Shared Backup BufferQ count: 7

[prefix]: Calculating expected size of total data
[prefix]: FID=1, ExpectedExtents=780610, IsDifferentialMapAccurate=1
[prefix]: FID=3, ExpectedExtents=781546, IsDifferentialMapAccurate=1
[prefix]: FID=4, ExpectedExtents=780523, IsDifferentialMapAccurate=1
[prefix]: FID=5, ExpectedExtents=780555, IsDifferentialMapAccurate=1
[prefix]: TotalSize=204684263424 bytes

[prefix]: Copying data files
[prefix]: Data file backup process set to run in serial mode within a volume.
[prefix]: Number of data file readers = 1

15 percent processed.
30 percent processed.
45 percent processed.
60 percent processed.
75 percent processed.
90 percent processed.

[prefix]: InitialExpectedSize=204684263424 bytes, FinalSize=204684263424 bytes, ExcessMode=0

[prefix]: Calculating expected size of total data
[prefix]: FID=1, ExpectedExtents=780610, IsDifferentialMapAccurate=1
[prefix]: FID=3, ExpectedExtents=781546, IsDifferentialMapAccurate=1
[prefix]: FID=4, ExpectedExtents=780523, IsDifferentialMapAccurate=1
[prefix]: FID=5, ExpectedExtents=780555, IsDifferentialMapAccurate=1
[prefix]: TotalSize=204684263424 bytes

[prefix]: Do the second force checkpoint before copying diff section
[prefix]: Checkpoint LSN: 3246:76679:230
[prefix]: Checkpoint is complete (elapsed = 9 ms)
[prefix]: Start pin the log.
[prefix]: Start LSN: 3246:76778:1, SERepl LSN: 0:0:0
[prefix]: Offline the sparse bitmap
[prefix]: Scanning allocation bitmaps
[prefix]: Diff section: 3473408 bytes in total
[prefix]: Scanning allocation bitmaps is complete

[prefix]: Calculating expected size of total data
[prefix]: FID=1, ExpectedExtents=14, IsDifferentialMapAccurate=1
[prefix]: FID=3, ExpectedExtents=13, IsDifferentialMapAccurate=1
[prefix]: FID=4, ExpectedExtents=13, IsDifferentialMapAccurate=1
[prefix]: FID=5, ExpectedExtents=13, IsDifferentialMapAccurate=1
[prefix]: TotalSize=3473408 bytes

[prefix]: Copying data files
[prefix]: Data file backup process set to run in serial mode within a volume.
[prefix]: Number of data file readers = 1
[prefix]: InitialExpectedSize=3473408 bytes, FinalSize=3473408 bytes, ExcessMode=0

[prefix]: Calculating expected size of total data
[prefix]: FID=1, ExpectedExtents=14, IsDifferentialMapAccurate=1
[prefix]: FID=3, ExpectedExtents=13, IsDifferentialMapAccurate=1
[prefix]: FID=4, ExpectedExtents=13, IsDifferentialMapAccurate=1
[prefix]: FID=5, ExpectedExtents=13, IsDifferentialMapAccurate=1
[prefix]: TotalSize=3473408 bytes
[prefix]: Diff section copy finished
[prefix]: Last LSN: 3246:76781:1
[prefix]: Copying data files is complete

Processed 6244992 pages for database 'Sandbox_MultiFile_SingleVol', file 'Sandbox_MultiFile' on file 1.
Processed 6252472 pages for database 'Sandbox_MultiFile_SingleVol', file 'Sandbox_MultiFile2' on file 1.
Processed 6244288 pages for database 'Sandbox_MultiFile_SingleVol', file 'Sandbox_MultiFile3' on file 1.
Processed 6244544 pages for database 'Sandbox_MultiFile_SingleVol', file 'Sandbox_MultiFile4' on file 1.

[prefix]: Copying transaction log
[prefix]: MediaFamily(0): FID=2, VLFID=3246, DataStreamSize=65536 bytes
Processed 1 pages for database 'Sandbox_MultiFile_SingleVol', file 'Sandbox_MultiFile_log' on file 1.
[prefix]: Copying transaction log is complete

[prefix]: Writing the trailing metadata
[prefix]: Writing the end of backup set
[prefix]: Writing history records for NoteBackup
[prefix]: Writing history records for NoteBackup is complete (elapsed = 37 ms)

BACKUP DATABASE successfully processed 24986297 pages in 87.111 seconds (2240.881 MB/sec).
[prefix]: BACKUP DATABASE finished

I’ve grouped together interesting “sub-operations” like when the tail of the log backup is taken. Note the multiple size estimates that are taken throughout as well. Note that there are 4 data files, which is why you see 4 FID entries each time. But also note that all 4 data files are on 1 single volume!

Key TF Output Highlights

Backup/Restore buffer configuration parameters

Memory limit: 32765 MB
BufferCount: 7
Sets Of Buffers: 1
MaxTransferSize: 1024 KB
Min MaxTransferSize: 64 KB
Total buffer space: 7 MB

[prefix]: Effective options: Checksum=0, Compression=0, Encryption=0, BufferCount=7, MaxTransferSize=1024 KB

[prefix]: Number of data file readers = 1

Let’s compare and contrast this to a backup operation with multiple backup output targets (this different database’s data files are spread across 4 different data volumes).

BACKUP DATABASE [Sandbox_MultiFile_MultiVol] TO
DISK = '\\10.21.200.27\ayun-sql-backups\Sandbox_MultiFile_MultiVol_1.bak',
DISK = '\\10.21.200.28\ayun-sql-backups\Sandbox_MultiFile_MultiVol_2.bak',
DISK = '\\10.21.200.70\ayun-sql-backups\Sandbox_MultiFile_MultiVol_3.bak',
DISK = '\\10.21.200.71\ayun-sql-backups\Sandbox_MultiFile_MultiVol_4.bak',
DISK = '\\10.21.200.72\ayun-sql-backups\Sandbox_MultiFile_MultiVol_5.bak',
DISK = '\\10.21.200.73\ayun-sql-backups\Sandbox_MultiFile_MultiVol_6.bak',
DISK = '\\10.21.200.74\ayun-sql-backups\Sandbox_MultiFile_MultiVol_7.bak',
DISK = '\\10.21.200.75\ayun-sql-backups\Sandbox_MultiFile_MultiVol_8.bak'
WITH FORMAT, INIT, STATS = 10,
MAXTRANSFERSIZE = 2097152,
BUFFERCOUNT = 500, COMPRESSION;
GO

--------------

Backup/Restore buffer configuration parameters
Memory limit: 98295 MB
BufferCount: 500
Sets Of Buffers: 3
MaxTransferSize: 2048 KB
Min MaxTransferSize: 64 KB
Total buffer space: 3000 MB
Filesystem i/o alignment: 512
Media Buffer count: 500
Media Buffer size: 2048 KB
Encode Buffer count: 500

Backup(Sandbox_MultiFile_MultiVol): Number of data file readers = 4

I think there’s a lot of really cool data in here. Remember that you can use Trace Flag 3605 to log this information to the Error Log. Thusly, you could use that to record what your BACKUP jobs are currently doing and review them later at your leisure (just remember it turn the TF off, lest you bloat your logs).

If you want to see more examples of these Trace Flags in action, visit my GitHub and check out my demo scripts from the original presentation. I introduce the Trace Flags in the 1_Baselining.sql script, then leverage them throughout the other demo scripts.

Workers vs CPU Cores

One more thing to keep in mind are the total number of worker threads you have available on your SQL Server. That’s different than the number of CPU cores you happen to have. I’m not going to dive deeper into this one today but will leave this old but still valid blog from Bob Dorr for reference instead: “How It Works: How many databases can be backed up simultaneously?

Bringing It All Together

Hopefully now, with the above information, you can see that you really need to do an analysis and essentially create a matrix, based on your requirements and available resources. This will help you determine how much you want to crank up your database tuneables to best accelerate your FULL backups for your environment.

Hope You’ve Enjoyed This Series

This’ll most likely be the final blog of this series, though I might do an “addendum” with random tidbits. Let me know in the comments if there’s anything else you’d like me to explore in an epilogue.

Thanks for reading!

2 thoughts on “Backup Internals – Part 5: Balancing Performance

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.