Skip to content

Archiving and Restoring Data

Archiving data ensures that important information is securely stored, easily retrievable, and frees up space in your lab share.

Suggested Directory Standards

Set an organizational standard for your lab archives:

  • Yearly/Monthly: /wistar/lab/archive-YYYY-MM Example: archive-2025-07
  • Quarterly: /wistar/lab/archive-QX Example: archive-Q1 (for Jan–Apr)
  • Static: /wistar/lab/archive/ (One standard archive directory)

Preparing Data for Archiving

4 TiB File Size Limit

Our storage system does not support files larger than 4 TiB.
- If your data.tar.gz exceeds 4 TiB, it will be incomplete.
- The bundle tool (below) automatically splits large files into smaller parts.

The bundle Tool (Prepare)

The bundle tool compresses and packages data into .tar.gz archives and automatically splits them into parts if needed.

Warning

This tool is very powerful and has the ability to generate large files as well as delete files. Do not run this against any data that is NOT YOURS. If you need any help or assistance running this please see Getting Help. Thank you!

Usage

module load wi-bundler
bundle --source DIR --output DIR --log DIR [options]

Required Options

  • -s, --source DIR: Directory to bundle
  • -o, --output DIR: Destination for .tar.gz files
  • -l, --log DIR: Directory for reports/checksums

Optional Flags

  • -y, --yes: Auto-accept all prompts (splitting, overwrite, etc.)
  • --delete-original: Auto deletes the original source dir
  • -n, --dry-run: Show what would happen without writing files
  • -t, --test: Perform a checksum-only validation of the source data

Example: Standard Run

bundle --source /wistar/lab/data \
       --output /wistar/lab/archive \
       --log /wistar/lab/archive_log

If your source is larger than 3 TiB, the tool will create split parts:

data_part_aa.tar.gz
data_part_ab.tar.gz
data_part_ac.tar.gz
...

What You Get

  1. One or more .tar.gz files in the --output directory
  2. A report file in the --log directory, which includes:
  3. Directory tree structure
  4. Checksums for each file (for later validation)
  5. Original data stays intact (if you wish to delete, use the --delete flag or delete manually)

Save the report file!

Keep the report file in a safe place. It is required if you ever need to restore the data.

The unbundle Tool (Restore)

The unbundle tool extracts archived .tar.gz bundles, including multi-part sets, using pigz for parallel decompression.

Usage

unbundle --source DIR --output DIR --prefix STR [options]

Required Options

  • -s, --source DIR: Directory containing .tar.gz file(s)
  • -o, --output DIR: Destination directory for extracted files
  • -p, --prefix STR: String prefix for starting words of .tar.gz file(s)

Optional Flags

  • -y, --yes: Auto-accept all prompts
  • --delete-archive: Auto-delete the archive files
  • -n, --dry-run: Show actions without extracting
  • -f, --force: Overwrite existing files
  • -h, --help: Show the help message

What You Get

  1. Original data is restored to the --output/--prefix path
  2. .tar.gz files stay intact (if you want to delete, use the --delete flag or delete manually)

Example: Standard run

unbundle --source /wistar/lab/archive --output /wistar/lab --prefix project1

Schedule Job Submission Example

The best and recommended way to run these process are through a scheduled job (as it will run in the background for you).

!!!note Passing the -y,--yes flag When using a slurm job, you must pass the -y or --yes flag as you will not be able to type yes for confirmation on this type of job.

#SBATCH --job-name=(un)bundle
#SBATCH --cpus-per-task=20          # prioritize CPUS
#SBATCH --mem=10G                   # low memory utilization
#SBATCH --time=8:00:00              # could be long running depending on size
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load wi-bundler

# Prep data for archive
bundle --source /wistar/lab/data --output /wistar/lab/archive --log /wistar/lab/archive_logs/ --yes

# Restore data from archive
unbundle --source /wistar/lab/archive --output /wistar/lab/data --prefix project1 --yes

Slurm Job Tips to know

  1. Archival and (de)compressions are very CPU intensive, the tool is designed to run with multiple cores, so prioritize using higher --cpus-per-task.
  2. You must pass the -y or --yes flag as you will not be able to type yes for confirmation on scheduled job.
  3. Depending on the size of your data, this may take a while so ensure you are giving the job enough --time

Next Steps: Submitting for IT Archival/Retrieval

Pushing TO Archive

Once you have completed the bundling process:

  1. Confirm that all .tar.gz files and the report file are present.
  2. Submit a request via Getting Help and specify the --output directory.
  3. IT will archive the data from the designated location.

Pulling FROM Archive

  1. Submit a request via Getting Help and specify your preferred --source directory to put the .tar.gz files.
  2. Once notified, confirm that all .tar.gz files are present (see your report file for confirmation)
  3. Optionally, confirm the checksum of the .tar.gz files (run sha256sum path/to/tar.gz)

Once confirmed, use the unbundle tool to put your data back in your preferred --output directory.

FAQ

Why do I have to do this?

  • There is a 4TB file size limit on your lab shares, so we need to break up archives into separate "parts".
  • Our archival destination (AWS S3 - Glacier Instant Retrieval) has a minimum file size of 128K so we need to consolidate data into single large files.

What happens if my bundle fails or is too large?

  • If the bundle fails, check available disk space and permissions in your --output and --log directories.
  • If you manually created a .tar.gz larger than 4 TiB, re-run bundle so it can split the files into parts.

How do I restore data if I only have the .tar.gz parts?

  • Use the unbundle tool (see above). It will automatically detect and combine multi-part archives.

What if some parts are missing?

  • unbundle will alert you to any missing parts. Retrieve the missing files before attempting to restore.

My Job Failed with "cancelled"

  • You may have forgotten to use the -y, --yes flag with either command.