Save a directory(-tree) in a tar.gz archive
and make incremental backups

Copyright (C) 1996, 1997 DFG/M.Watermann, D-30177 Hannover
Copyright (C) 1998, 2007 M.Watermann, D-10247 Berlin, FRG
All rights reserved
EMail : support@mwat.de

Table of Contents:

Overview  

This script provides a quite simple but often sufficient backup mechanism. Presuming it's called by crond once a day, on every first of month it makes a complete backup of the directory tree given as the last commandline argument (--dirs name; see Commandline Options below), eventually replacing last years respective monthly backup (if it existed, that is). Similar on each sunday a full backup is made as well, replacing last sunday's backup file. On weekdays, however, only those files that changed (or were added) since the time of the last full backup (i.e. last sunday of first-of-month, whatever was later) are saved. This incremental backup files are automatically removed after a new weekly backup was successfully created.

So, for every directory you backup this way usually you'll get this files:

If the first day of a month happens to be a sunday, physically there will be one file less: in this special case (which is not that seldom) the months backup is hard-linked with the sunday file, this way preserving processing time and disc space (instead of creating two separate files for month and sunday with identical contents).

Incremental, by the way, is not meant on a true day-by-day base here, but relative to the last full backup (i.e. last sunday or first of month, whichever was later). While this may use a little more disc space it has the benefit that in case of a disaster (und thus the need for restoring your data) you'd only need at most two files: the last full and the last incremental backup. Presuming your machine crashes on friday you'd have to restore the sunday (full backup) and the thursday (incremental backup), in that order. You would not need the monday/tuesday/wednesday files. If your machine crashed on monday, however, you'd only have to deal with the sunday (full) backup.

The following sections discuss in more detail the handling of the created backup archives, the naming conventions and the various optional commandline switches and how they change the normal behaviour of this script as outlined before.

[up to Table of Contents]

Archive Files

Let's assume you're called Jane (at least as far as the computer-world is concerned), your hostname is Castle.do.main and you're going to backup your personal home directory which happens to be /home/users/jane. The destination of the backupfile is a directory exported by some other host within your LAN. I won't get into detail here how to (auto)mount such a remote directory, it's enough to say that Miss Ruth kindly provided a sym-link into your home as ~jane/backup/. Further assuming this script is reachable through your PATH setting, you'd call it in your personal crontab file like:

    30 3 * * * incBackup -b ~/backup -d ~

This runs the script at half past three every morning, over the year producing the following files:

    /home/users/jane/backup/Castle.home.users.jane-d1.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-d2.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-d3.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-d4.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-d5.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-d6.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-w00.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m011.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m021.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m031.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m041.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m051.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m061.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m071.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m081.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m091.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m101.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m111.tar.gz
    /home/users/jane/backup/Castle.home.users.jane-m121.tar.gz

In case that seems to much for your servers disc (or Miss Ruth) to handle and you know what you're doing, you may use the --shortmonth commandline switch (see below) which leaves the month number out, causing each monthly full backup (on the respective first of month) to overwrite the one of the month before. While saving the disc space of 11 full backups, you won't be able to recover lost data if you'd notice the missings only after a months change (since the current months full backup wouldn't contain them anymore). Another option for you could be the --trueincremental switch (see the section Commandline Options below).

As can be seen above, the full path/filenames of the backup archives show a consistent pattern: /backupdir/hostname.saveddir-WD.tar.gz where W (when) stands for either m (a monthly full backup), w (a weekly/sunday full backup) or d (a daily incremental backup) and D (day) stands for either 00 (weekly), 1-6 (daily) or MMD (monthly) with MM indicating the month and D the day-of month. The hostname, as you see, is used in its short form since a FQDN would only make the filenames much longer without any benefit; and storing the backups from hosts of different domains in one and the same server directory would be a very bad idea anyway. – So it shouldn't be too difficult to figure out the meaning/relevance of a given file in your backup-directory.

If you think about it, this procedure looks quite safe, doesn't it? But, alas, there's still a timeframe where there's a chance to loose data (in case of something silently corrupting your data, that is). Consider a scenario where everything's allright until, say, third of december. At that very day you discover (not without lengthy investigation) that your harddisc seems to have some serious problems with the remapping of bad blocks, and the worst of it: either the disc failed to mention its problem to your operating system (the harddisc device drivers) or the latter didn't care very much. Whatever the technical reason was, the result is that several of your files happen to be, well, deranged, but – unfortunately in this case – still readable as far as the operating system is concerned (otherwise you would have noticed the problems earlier). "Well", you might think while Miss Ruth is replacing the faulty disc, "bad luck but I've got all those backup files on my server, so it's only a question of tar xzf filename to restore the whole stuff."

Let's see, which files are there for you to start with: The sunday/weekly full backup, as it turns out, already contains some corrupted files (this also renders the incrementals as unuseable). Since that very sunday happens to be the first of month (in 2002) the same applies for the monthly backup. The weekly backup before that (which was created on 2002-11-24 and might have had still uncorrupted files) is gone (replaced by the one of 2002-12-01). So you'll have to go back to the november full backup (as of friday 2002-11-01) which means that in effect the changes of a whole month are lost. Not that good at all, wouldn't you say?

This is where the --weeknumber switch comes in: It changes the name of the sunday backups from becoming .../Castle.home.users.jane-w00.tar.gz in this case to .../Castle.home.users.jane-w48.tar.gz which in turn means that there's also a file of the sunday before named .../Castle.home.users.jane-w47.tar.gz and respective ones for all the weeks before that as well. So now – with our made up scenario – it's no longer a matter of the way the backup files are rotated (replaced) by newer ones to see how long you have to go back but only depends on how long your faulty disc silently corrupted your files. Probably you'll have to go back just one week, may be several weeks. In any case, the range of time of lost data is much smaller when going back week by week than it would be by going back month by month.

But, as you have probably figured out by now, there's a price to pay for this additional safety: more disc space usage on your backup server (hopefully not the one with the faulty disc). To keep this additional disc space as low as possible the --weeknumber switch has another side-effect: the first-of-month is handled like any other day, thus disabling the monthly full backups which wouldn't make much sense anyway since we've got all the sunday full backups already.

So, in short, usually you'll get up to 13 full backups per directory (12 first-of-month files and the last sundays weekly backup). With the --weeknumber switch instead you'll get up to 52 full backups (one for every sunday in a year), IOW: four times the usual number. That's the reason why the --weeknumber option has to given explicitely on the commandline. – Of course, you (or Miss Ruth) could deal with the disc space issue e.g. by the tmpwatch utility (see man 8 tmpwatch for details) ot tmpreaper (see man 8 tmpreaper) to automatically remove backups older than, say, 100 days (actually tmpwatch uses a hours argument but I presume you know the formula for converting days to hours), or you could use find (see man 1 find).

About cronjob times: If you happen to live in an area with daylight saving (summertime / wintertime) you should make sure, that the chosen time (03:30 above) does not fall into that special period which passes twice (summer to winter) or not at all (winter to summer). In e.g. Europe that are the hours between 01:00-03:00 (I guess, elsewhere it will be similar).

As an alternative to setting up a personal cronjob like the one shown above you could ask Miss Ruth to put a script like the following in the systems /etc/cron.daily/ directory:

    #!/bin/sh
    # remove backups older than three weeks:
    /usr/sbin/tmpwatch --nodirs --fuser --mtime 504 /path/to/backups/
    # backup several system and user directories:
    /opt/bin/incBackup -b /path/to/backups -d /etc /root /var/www/html \
        /home/users/* /home/helpers/* /home/admins/* /whatever/else
    #EoF

The various user directories, you see, are given separately instead of saying just /home. The latter works well, of course. It would, however, create one huge backup file containing the data of all the subdirectories under /home (i.e. the user home dirs). On the other hand, giving the directory names as a wildcard (with /bin/bash expanding it to a number of names) will result in separate backup files for each user home directory. That's what you'd most probably want, isn't it?

As you can see, Miss Ruth keeps her backup archives for at most 504 hours (guess, how many days that is). I'd say, one could even reduce this to ~300 hours: Someone who – within two weeks – does neither notice that something's wrong with his/her data nor copied the backup files to somewhere else for safety suffers not only technical faults but, has some, er, serious, say, mental problems as well.

To have a backup permanently available, it's not enough to just store it on some other machine: What, if that one crashes as well? So it's rather likely, that you will have to do something with the backup archives. Storing to tape or burning on CD comes to mind. Whatever you're up to, you should watch the file sizes of the produced (full i.e. monthly/weekly) backup archives and by chance adjust the directories (passed with the --dirs argument) so that the backups won't become larger than whatever may fit on your chosen storage media. (And see the section about Exclusions below as well.)

Let's assume a CDROM (i.e. ~600MB) for permanent saving of your archived data. Considering most of the files in your home directory are (kind of) text and thus compressable with a good ratio (say ~60%) this numbers would mean that about 1GB of (raw) data would fit into a backup archive writable on CD. Of course, this highly depends on the files actually stored in your directories. If there a lots of, say, sound or video files that can't be compressed much more, you might even end up with a 1:1 ratio. Redesigning your directory structure and calling this script not only once for the whole structure but for several sub-directories instead, combined with well chosen Exclusions (see below) might help in such cases. – Anyway, at least you've got the idea, didn't you?

Just to summarize it finally: The usual call would be

    incBackup -b /destination/dir -d /some/dir/to/save

with upto 12 monthly full backups per year, one weekly full backup as of last sunday and six daily incremental files with changes sunday-to-weekday. The least safe (and least disc space using) call would look like

    incBackup -b /destination/dir -s -T -d /some/dir/to/save

with only one monthly full backup per year, one weekly full backup and six daily incremental files with changes day-by-day. In contrast the most safe (and most disc space consuming) call would look like

    incBackup -b /destination/dir -w -d /some/dir/to/save

with upto 52 weekly full backups per year and six daily incremental files (holding changes sunday-to-weekday). In between would be calls like incBackup -b destination/dir -T -d /some/dir/to/save (12 monthly, 1 weekly full backup, six day-by-day incrementals) or incBackup -b destination/dir -T -w -d /some/dir/to/save (52 weekly full backups, six day-by-day incrementals; this is, in fact, the way I use in my personal crontab to backup my home-dir). – It's completely up to you to use those options that match best with your very personal balance of comfort, disc usage and safety.

[up to Table of Contents]

Commandline Options

As mentioned (or indicated) before there are some commandline switches in both short and long variants which must ([m]andatory) or may ([o]ptional) be used. The long option names may be shortened to at least three letters; the phrase prints out means, the output is written to stdout (i.e. usually your display/screen) so that you can redirect it to a file or pipe through some other tool (e.g. lpr). – In alphabetical order:

-b destdir or --backupdir destdir
[m]: Here the destdir argument names the destination directory where to store the backup archives. This directory must be writable, of course, for the user running this script and should provide enough space to hold at least twice the size of the directory to save. It may be a (host-) local directory or a remote directory accessed by e.g. NFS or SMB (Samba). The filesystem doesn't matter as long as it supports socalled long filenames and stores/provides a file modification time. Ah, and hard-links should be supported as well (either by the filesystem directly or by the transport protocol indirectly). So it's merely a matter of taste (of Miss Ruth).
-d dir1..dirN or --dirs dir1..dirN
[m]: This argument names the directories to backup, each of which will be stored in a separate archive file. So even if given a lot of directories you won't end up with one single huge backup file. – Backing up just the root / of your filesystem is not only a bad idea (why wasting time and space for data that can be found on your installation CD?), but it won't work at all: Since trailing slashes are automagically removed, the root (/) would be reduced to an empty string (which, of course, is ignored). Please note, that --dirs has to be the very last option used at the commandline: all following arguments are assumed to be directory names.
-F or --force
[o]: Usually a full (complete) backup is made on every first of month and every sunday. If, for one reason or the other, you want to create a full backup on a weekday, this switch makes the script behave as if it were the sunday of the crrent week. – Please note that a possibly existing full backup of the real sunday will be replaced, so use with care e.g. only to create an initial full backup when running this script for the first time.
-h or --help
[o]: Print out a short usage note and terminate without any further work. (See also the --html and --info options below.)
-H or --html
[o]: Print out this text in HTML format, suitable for reading with a HTML-viewer (aka web-browser), and terminate without any further work. (See also the --help and --info options.) [This switch produced what you're reading right now.]
-I or --info
[o]: Print out this text in ASCII format, suitable for printing, and terminate without any further work. (See also the --help and --html options above.)
-q or --quiet
[o]: Without this switch tar will (besides error messages) print out a brief totals report when finished with the backup. In case you neither need nor want that, this switch suppresses it. (See also the --verbose option below.)
-s or --shortmonth
[o]: As mentioned above (see section Archive Files) usually the name of the monthly backup will contain (besides the host and directory names) both the number of the month (e.g. 03 for march) and the number of the day (i.e. 1). To omit the month, you can use this switch. In consequence each monthly backup replaces the one of the month before, not the backup of the same month a year ago as it would do otherwise. – This switch obviously has no effect if the --weeknumber option is used as well since in that case no monthly backups are created anyway. (See also the --force and --weeknumber options as well as the discussion of Archive Files above.)
-S or --strip
[o]: This option you'll most probably never use: It prints out the source of this script with all comments and whitespace (and thus formatting) completely stripped off. – While the scripts intended functionality is kept intact (as bash does not depend on indentation), the --html and --info options, of course, won't print anything (all the fine docs are gone!), and it will give you a hard time to read and understand the source (since all the structure/indentation and comments are gone as well). The only benefit of such an operation is the size reduction from ~40KB to ~6KB (lines from ~970 to ~250) and hence a faster startup theoretically. So unless you're really very short of disc space, just ignore this option. – I once needed it for an embedded device of a client where space was rare, and left it here in case someone else might find it useful. (See also the --gzexe option below.)
-T or --trueincremental
[o]: As mentioned elsewhere several times incremental refers to the last full backup made (either last sunday or first-of-month). In consequence (a) you have to restore at most two files in case of a crash and (b) the daily backups grow larger from day to day (tuesday contains the changes of monday, wednesday those of monday and tuesday, the thursday file those from monday to wednesday, and so on). While this default behaviour is convenient in case a restore is needed, it's not that efficient in terms of disc usage. And since the latter seemed to concern some users more than the former, I introduced this additional switch. It alters the way incremental is interpreted: On weekdays (sundays still work as usual) not the last full backup (time) is used to determine which files are to be saved but the time of the last backup at all, meaning that e.g. the friday archive will contain only the changes of thursday. In consequence (a) in case of a desaster you'd have to restore up to seven files (from sunday to saturday) and (b) the daily archives will be smaller overall (when saving directories with often changing contents, that is). – Additionally the names of the backup archives are slightly different (see the section Archive Files above): instead of a d (daily) they have an i (incremental) followed each by the day-of-week number, this way helping you to distinguish the usual daily backups and the true incremental ones. – So you've got the choice between convenience and disc space. Consider it carefully. – Obviously this switch has no effect if the --force option is used. (You can't get both a full backup and an incremental one, can you?).
-v or --verbose
[o]: In case you not only want a totals report by tar but to see all files processed, this switch is your friend. Additional it produces a short time summary telling you, how long it took to store your backup. – Note, that crond usually sends all output of its jobs by email. So check with Miss Ruth, that email works (at least locally). (See also the --quiet option above.)
-V or --version
[o]: Print out the current version of this script and terminate without any further action. (See also the --help, --html and --info options above.)
-w or --weeknumber
[o]: Usually the sunday full backup filename contains just a w00 part to indicate it's a weekly backup (see the section Archive Files above). That file is replaced on each sunday by the new one so for each directory archived by this script there's always only at most one such weekly file. Now, this switch will change the filename part to something like w43 where the number indicates the week of the year (01..53) resulting in up to 52 weekly files each of which is replaced only by the next years sunday backup of the respective week. – No separate monthly full backup is made on first-of-month and obviously the --shortmonth option (see above) also doesn't have any effect when used together with this switch.
-Z or --gzexe
[o]: As well as the --strip option (see above) this one can be put onto the bells'n'whistles – or better: gimmicks – account. Using it creates not a backup but another version of this script (with an extension of sh appended): stripped down to its bare bones and additionally compressed by gzexe/gzip thus reduced to an overall size of ~2KB (from ~40KB). If you want retain the functionality of the --html and --info options (see above) use the --force switch before this one (i.e. incBackup -F -Z ) which results in a compressed script of ~15KB in size. Please note that in either case the compressed script relies on gzip and some other utilities (tail, chmod, rm) to run.

To make a long story, er, list short: During normal usage you'll only need the -b and -d options (and possibly -w or -T). Everything else seems luxury or may be potentially dangerous (i.e. less safe) in one aspect or another. As kind of additional hint the short commandline switches supposed to be given less frequently (or never at all) are using uppercase letters.

[up to Table of Contents]

Exclusions

If a directory given with the --dirs option contains (at its relative root) a dot-file named .nobackup it is assumed (w/o any testing) suitable to get passed to tar as a list of patterns for files not (repeat: not) to save (see info tar for more details about this). For an users home directory such an exclusion file might look like:

    *~
    *.bak
    */cache/*
    */Cache/*
    */cache?/*
    */devel/*.o

This suppresses backup files made by some editors as well as files in cache-directories (e.g. of web-browsers or news-readers); the last line says that we don't want a backup of the object files in our personal developement area (after all, we have the sources there). – Generally speaking, you should put patterns for all those files into the exclusions-file that are temporary in one way or another, or can be restored (possibly better) from other sources such as install discs. Following this rule keeps your backup archives as small and sensible as possible.

If no such .nobackup file exists, however, everything gets stored in the backup archive regardless whether it makes sense or not and provided there's enough space in the chosen destination directory. You should consider carefully, what to exclude; the example above should give you an idea, which kind of files would only waste backup space without any real benefit in case of a desaster.

[up to Table of Contents]

Misc Notes

tar gets called with the --one-file-system option causing it to not backup files and directories on other filesystems than the one the start directory resides on. This might look like a drawback since the backup archive might seem to be incomplete (compared to the real directory structure in-use). But it allows for e.g. sym-linked directories from other filesystems to be part of the directory structure without having files therein saved twice (once as part of your backup and another time when the original directory is saved). And for the cases where real mounting points happen to be within the directory to backup: Just call this script with a --dirs argument naming that very mount point and you're done.

Another option passed to tar is the --ignore-failed-read switch which makes sure that the backup archive gets created even in cases an unreadable file is encountered. This way you'll have at least all readable files saved instead of none at all.

For storing the time of the very last full backup a (hidden) 0-byte dot-file is maintained with a name of .Host.sub.dir.lasttime. This file (i.e. its last-modified date) is used by tar to determine which files should be put into the usual incremental backup archive. Removing that file would result in a complete/full backup the next time this script is run. See the --force commandline argument above for a better way of initializing a full backup. – In case the --trueincremental switch is given, the date of the last backup before the current day (i.e. usually yesterdays file) is used to check for (and backup) files newer than that. If no such yesterday (or day before yesterday) file could be found, the date of the last full backup is used, and if that one doesn't exist either, actually a new full backup is made (although the filename remains that of an incremental backup). You don't have to worry about this, I explained it just to ensure you, that your backups will be saved even if someone by mistake deleted one or the other file.

Over the years of usage those backup files not meant for storage on CD became bigger and bigger eventually crossing the 2GB border, resulting in corrupted (artificially truncated thus incomplete) archives. Since I hadn't that much time to investigate and after experimenting with switching between NFS- and SMB-mounted destination directories, I finally kind of, er, "resolved" the problem by replacing the tar commandline argument -f filename by output redirection. This in consequence means, that neither tar nor gzip will ever lsearch() (which is limited to 2GB) but only write to stdout, leaving the issue completely to the host/OS receiving the data stream, where the 2GB border may be crossed during write() calls but without lsearch(). – Of course I'm well aware that this "solution" is just a workaround (and far from bullet-proof) until all userspace programms (including tar etc.) and filesystems (e.g. SMB and NFS) are fixed to fully use 64-bit addressing shemes. But hey, as long as the hack works, why not use it? – And again: it's better, of course, to keep the backups smaller anyway (see the discussion on Archive Files above).

[up to Table of Contents]

System tools used

The following utilities are used by this script and are assumed to be reachable in PATH (/sbin:/bin:/usr/sbin:/usr/bin:/opt/bin by default):

date
provides the day-of-month, day-of-week, month-of-year and week-of-year values;
hostname
figures out the short name of the machine running this script;
ln
creates a hard-link if the first of month happens to be a sunday (like 1996-09-01, 1996-12-01, 1997-06-01, 1998-02-01, 1998-03-01 etc.pp., just have a look at your preferred calendar);
nice
(if available) the runtime priority is reduced;
rm
deletes old destination files (if they exist) to avoid changing a possibly hard-linked file; on sunday removes obsolete (daily/incremental) archives;
sed
processes/creates the --html and --info output (see Commandline Options above);
tar
finally creates the backup archives.

[up to Table of Contents]

ChangeLog

    $Log: incBackup,v $
    Revision 1.8  2007/11/18 08:35:13  matthias
    * added mentioning of "tmpreaper";
    # fixed a "sed" expression;
    
    Revision 1.7  2007/06/05 07:30:26  matthias
    + implemented use of nice tool;
    * in case of missing (removed) time flag file a weekly backup is done
        possibly overwriting the last weekly backup file;
    * the time flag file is written by echo redirection to avoid problems
        with shells not implementing the redir operator w/o command;

    Revision 1.6  2004/08/09 11:38:05  matthias
    * added a test for the last-time flag for cases where it was removed by
        accident (a complete/weekly backup is made in this case);
    * updated sed scripts and CSS;

    Revision 1.5  2003/01/06 20:50:07  matthias
    + added -V|-W|-Z options;
    # fixed a problem when 1st'o'month was linked to daily instead of weekly
        archive file if that day was a sunday;
    * updated/enhanced docs, especially HTML output (which got a linked ToC);

    Revision 1.4  2002/09/02 15:04:43  matthias
    + added/implemented -T option (incl. docs);

    Revision 1.3  2001/01/26 18:30:05  matthias
    + added -H|-I|-S options and a lot more documentation;
    * tar now uses output redirection (instead of --file) to avoid (?)
        possible problems with archives gt 2GB;

    Revision 1.2  1999/09/12 20:13:53  matthias
    # modified screen output on --help;

    Revision 1.1  1997/03/14 22:02:32  matthias
    + (long delayed) initial CVS checkin;

[up to Table of Contents]

*)   In case you're wondering who that famous Miss Ruth might be: In German the firstname Ruth and the English word root sound exactly the same.


Disclaimer: No bits or bytes were harmed and no harddisk destroyed in order to create this page.
All letters and digits on this page are strictly virtual and
any resemblance to real letters or digits – monospaced, serif or sans-serif – is purely coincidental.