Dealing with GridPane Backups (duplicacy) and Fossil Files

Content Error or Suggest an Edit

Notice a grammatical error or technical inaccuracy? Let us know; we will give you credit!

Introduction

GridPane’s backup system is based on duplicacy, and this article talks about how it functions and how to manage backups from the terminal.

This stems from the following live blog

https://wpguide.io/live-blog/blog-articles/gridpane-duplicacy-leaving-38gb-of-fossils-behind

Investigating Duplicacy Backup Snapshot Sizes

You can run the following command to see all backups for each site and how much they’re taking up.

duplicacy check -tabular | less

If you look at this image, you will see all the snapshots. The Chunks/Bytes is the total bytes and chunks used for each snapshot. The uniq/bytes is the unique data that can’t be deduplicated. The new/bytes are new data duplicacy hasn’t seen. The numbers at the bottom are a sum. So under chunks/bytes that’s the total space used in MB. Hope this helps.

What are fossil files?

There are a couple of explanations for what fossil files are. Here’s one from the DESIGN.md file on the duplicacy GitHub repository. It goes pretty deep into how the two-setup fossil collection works.

Interestingly, the two-step fossil collection algorithm hinges on a basic file operation supported almost universally, file renaming. When the deletion procedure identifies a chunk not referenced by any known snapshots, instead of deleting the chunk file immediately, it changes the name of the chunk file (and possibly moves it to a different directory). A chunk that has been renamed is called a fossil.

The fossil still exists in the file storage. Two rules are enforced regarding the access of fossils:

  • A restore, list, or check procedure that reads existing backups can read the fossil if the original chunk cannot be found.
  • A backup procedure does not check the existence of a fossil. That is, it must upload a chunk if it cannot find the chunk, even if an equivalent fossil exists.

In the first step of the deletion procedure, called the fossil collection step, the names of all identified fossils will be saved in a fossil collection file. The deletion procedure then exits without performing further actions. This step has not effectively changed any chunk references due to the first fossil access rule. If a backup procedure references a chunk after it is marked as a fossil, a new chunk will be uploaded because of the second fossil access rule, as shown in Figure 1.

The second step, called the fossil deletion step, will permanently delete fossils, but only when two conditions are met:

  • For each snapshot id, there is a new snapshot that was not seen by the fossil collection step
  • The new snapshot must finish after the fossil collection step

The first condition guarantees that if a backup procedure references a chunk before the deletion procedure turns it into a fossil, the reference will be detected in the fossil deletion step which will then turn the fossil back into a normal chunk.

https://github.com/georgyo/duplicacy-cli/blob/master/DESIGN.md

Prune command and two-step fossil collection

The prune command implements the two-step fossil collection algorithm. It will first find fossil collection files from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil collection file stored locally (the fossil collection step).

For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least one new snapshot from each snapshot id must be created between two runs of the prune command. However, some repository may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days, and you can also use the -ignore option to skip certain repositories when deciding the deletion criteria.

https://forum.duplicacy.com/t/prune-command-details/1005#:~:text=they are included.-,Two-step fossil collection algorithm,ignore option to skip certain repositories when deciding the deletion criteria.,-1 Reply

One-liner to check your fossil file sizes

Run this command on your server from anywhere, it will detect and find .duplicacy files.

echo "\n** Checking duplicacy storage **"; \
echo -n "Total Chunks size of backup chunks: "; du --max-depth="0" -h "/opt/gridpane/backups/duplications/chunks"; \
echo "----"; \
echo -n "Total .fsl files: "; find /opt/gridpane/backups/duplications/chunks -name "*.fsl" | wc -l; \
echo -n "Total .fsl file size: "; find /opt/gridpane/backups/duplications/chunks -type f -name "*.fsl" -print0 | du --files0-from=- -hc | tail -n1; \
echo "----"; \
echo -n "Total normal chunk files: "; find /opt/gridpane/backups/duplications/chunks -type f ! -name "*.fsl" | wc -l; \
echo -n "Total normal chunk file size: "; find /opt/gridpane/backups/duplications/chunks -type f ! -name "*.fsl" -print0 | du --files0-from=- -hc | tail -n1; \
echo "----"; \
echo -n "Duplicacy reporting totals: "; \cd "$(dirname "$(find /var/www/ -name ".duplicacy" | tail -n 1)" )" >> /dev/null; duplicacy check -tabular | grep Total

Deleting Fossil Files

Attention

This may corrupt your backups; you need to ensure that no backups are running before or during this process.

The prune command will not only prune snapshots, and revisions but also unused chunks when specified. So by design, if you have a backup system and want to prune successful backups. You need to run a prune and specify how long you want to keep the revisions. Otherwise, if you run duplicacy it will simply continue to backup data and never delete any data unless a prune operation is run.

Here are some important switches to consider.

Important duplicacy Switches

-keep

Keep 1 revision every n days for revisions older than m days.

The retention policies are specified by the -keep option, which accepts an argument in the form of two numbers n:m, where n indicates the number of days between two consecutive revisions to keep, and m means that the policy only applies to revisions at least m day old. If n is zero, any revisions older than m days will be removed.

Examples:
duplicacy prune -keep 1:7       # Keep a revision per (1) day for revisions older than 7 days
duplicacy prune -keep 7:30      # Keep a revision every 7 days for revisions older than 30 days
duplicacy prune -keep 30:180    # Keep a revision every 30 days for revisions older than 180 days
duplicacy prune -keep 0:360     # Keep no revisions older than 360 days
Multiple -keep options must be sorted by their m values in decreasing order.

For example, to combine the above policies into one line, it would become:

duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7

-exhaustive

Remove all unreferenced chunks (not just those referenced by deleted revisions).

The -exhaustive option will scan the list of all chunks in the storage, therefore it will find not only unreferenced chunks from deleted revivions, but also chunks that become unreferenced for other reasons, such as those from an incomplete backup.

It will also find any file that does not look like a chunk file.

In contrast, a normal prune command will only identify chunks referenced by deleted revisions but not any other revisions.

-exclusive

Assume exclusive access to the storage (disable two-step fossil collection).

The -exclusive option will assume that no other clients are accessing the storage, effectively disabling the two-step fossil collection algorithm.

With this option, the prune command will immediately remove unreferenced chunks.

WARNING: Only run -exclusive when you are sure that no other backup is running, on any other device or repository.

-dry-run, -d

This option is used to test what changes the prune command would have done. It is guaranteed not to make any changes on the storage, not even creating the local fossil collection file.

Example:
After running this nothing will be modified in the storage, but duplicacy will show all output just like a normal run:

duplicacy prune -dry-run -all -exhaustive - exclusive

-collect-only

Identify and collect fossils, but don’t delete fossils previously collected.

Example:
duplicacy prune -collect-only
The -delete-only option will skip the fossil collection step, while the -collect-only option will skip the fossil deletion step.

-delete-only

Delete fossils previously collected (if deletable) and don’t collect fossils.

Example:
duplicacy prune -delete-only

Deleting Fossil Files

So when you set up pruning, fossil files are not deleted by default. You need to use specific switches to delete fossil files. First, let’s do a dry run because we always do a dry run when doing something destructive.

The following command will do a dry run and show you the unreferenced chunks that should be removed.

duplicacy prune -exhaustive -exclusive -d

Once we confirm the output seems logical and it’s not a huge amount of chunks (unless you have a large number of revisions due to more frequent backups or keeping revisions longer) we can safely run the command.

duplicacy prune -exhaustive -exclusive

Next, we want to delete the fossil files collected previously or in previous prune jobs. But we want to dry run, because we want to confirm how many fossils are being deleted.

duplicacy prune -delete-only -d

If nothing looks abnormal, you can remove the -d and the fossil files will be deleted.

duplicacy prune -delete-only

Dangers in Deleting Fossil Files

Are there any dangers in pruning fossil files?

There is always a danger in doing something destructive; hence always be cautious. However, the two-step fossil collection will do its due diligence to ensure that the collected fossils are safe to delete.

However, you must ensure that there is a reoccurring backup running that creates snapshots, at least two, per the documentation. To ensure that no fossils are deleted that may be attached to a snapshot, the second step of the two-step process will check if the chunk to be turned into a fossil is not referenced by a snapshot backup.

There are some corner cases in that fossils may be deleted mistakenly.

There are two corner cases that a fossil still needed may be mistakenly deleted. When there is a backup taking more than 7 days that started before the chunk was marked as fossil, then the prune command will think the repository has become inactive which will then be excluded from the criteria for determining safe fossils
to be deleted.

The other case happens when an initial backup from a newly recreated repository that also started before the chunk was marked as fossil. Since the prune command doesn’t know the existence of such a repository at the fossil deletion time, it may think the fossil isn’t needed any more by any backup and thus delete it permanently.

Therefore, a check command must be used if a backup is an initial backup or takes more than 7 days. Once a backup passes the check command, it is guaranteed that it won’t be affected by any future prune operations.

https://forum.duplicacy.com/t/prune-command-details/1005#:~:text=backup%20in%20progress.-,Corner%20cases%20when%20prune%20may%20delete%20too%20much,-There%20are%20two

Change Log

  • 02/24/2023
    • Updated article and quoted duplicacy documentation about the two-step fossil collection process when pruning.
    • Changed pruning to delete for fossil files as you don’t prune fossil files.
0 Shares:

You May Also Like
Read More

GridPane Caveats

I’ve created this page to track the caveats I’ve found in the GridPane platform. This is an opinion…