Linux HowTo: How to recompress daily SQL database dumps to 0.8% of their size
January 23, 2006 - Filed in Linux HowTos by Felix
I have a cron job installed on the office server, that is using mysqldump to create a full database backup every day. The script adds a timestamp into the filename and compresses the output via gzip to save some space. The cron job generates backup files that are stored in the filesystem like this:
/some/backupdir/oxygen-2003-01-01-00.sql.gz
/some/backupdir/oxygen-2003-01-02-00.sql.gz
/some/backupdir/oxygen-2003-01-03-00.sql.gz
/some/backupdir/oxygen-2003-01-04-00.sql.gz
..
After a couple of years, however, this cron job is starting to fill up the office server's harddrive. So it was time to think about a solution that would
- reduce the disk space needed by the backups
- was applicable on the already saved backup files
- did not loose any data
- provided an easy way to reconstruct any backup file desired.
The solution I came up with is to write a script that
- decompresses all gzipped sql.gz files
- creates diff files that only contain the changes between two days' logs
- creates a reconstruction script that can be used to reconstruct all files
- packages up all diffs and the reconstruction script into a bz2 file
Here's the code (please read this posting to its end before using it):
#!/bin/bash
#
# mysqldump_archive_recompress
#
# This script decompresses all SQL backups of a year,
# does an incremental diff and repacks the whole thing as
# gzip compressed tar ball along with a reconstruction
# script
#
# (c)2006 Felix Schwarz.
## Number of the year to convert to an archive
YEAR=2003# The actual script
mkdir sql_recompress_$YEAR
cd sql_recompress_$YEAR
cp -av /some/backupdir/oxygen-$YEAR-*.sql.gz ./
gzip -d *.sql.gz
echo >EMPTY
echo "LASTFILE=EMPTY" >temp_script_$YEAR
echo "echo >EMPTY" >reconstruct_script_$YEAR
find *.sql -printf "diff \$LASTFILE %p >%p.diff\necho patch -b -p0 \$LASTFILE -o %p \< %p.diff >>reconstruct_script_$YEAR\nrm \$LASTFILE\nLASTFILE=%p\n" >>temp_script_$YEAR
echo "rm \$LASTFILE" >>temp_script_$YEAR
bash temp_script_$YEAR
echo "rm EMPTY" >>reconstruct_script_$YEAR
tar -czvf ../oxygen_sql_diffs_$YEAR.tar.gz reconstruct_script_$YEAR *.sql.diff
cd ..
rm -R sql_recompress_$YEAR
echo "Done!"
Compared to the result of recompress-unifying all dumps (which gives me a tar.gz file of 5.443.314.049 bytes), the file created by this script is only 43.863.957 bytes in size. That's 5.4 GB vs. 43 MB. That means the solution presented here needs only 0.8 % of what simple recompression would require. Or an additional compression ratio of 99.2% on top of bzip2.
One weakness of the above code though is, if two backups who directly follow one another are identical. In this case diff will produce an empty file, patch will write an empty result file on reconstruction and all following reconstruction will then fail. This can be easily fixed by removing the respective lines from the reconstruction script. Or you enhance the script to filter out identical files before diffing or handle empty result files from patch by checking its size. There are many possibilities. Yet, since I don't need this functionality, I haven't implemented it in the script.
Feel free to adapt above code to your needs and use it for your own purposes. Just do so on your own risk (I accept no liability whatsoever!). Give me credit ("Based on work (C)2006 by Felix Schwarz (http://www.felix-schwarz.org/).") in case you redistribute any part of it.
Enjoy!


0 comment(s):
Write a comment
The comments are closed for this article.