Archiving and Restoring Data
Archiving data ensures that important information is securely stored, easily retrievable, and frees up space in your lab share.
Suggested Directory Standards
Set an organizational standard for your lab archives:
- Yearly/Monthly:
/wistar/lab/archive-YYYY-MM
Example:archive-2025-07
- Quarterly:
/wistar/lab/archive-QX
Example:archive-Q1
(for Jan–Apr) - Static:
/wistar/lab/archive/
(One standard archive directory)
Preparing Data for Archiving
4 TiB File Size Limit
Our storage system does not support files larger than 4 TiB.
- If your data.tar.gz
exceeds 4 TiB, it will be incomplete.
- The bundle
tool (below) automatically splits large files into smaller parts.
The bundle
Tool (Prepare)
The bundle
tool compresses and packages data into .tar.gz
archives and automatically splits them into parts if needed.
Warning
This tool is very powerful and has the ability to generate large files as well as delete files. Do not run this against any data that is NOT YOURS. If you need any help or assistance running this please see Getting Help. Thank you!
Usage
module load wi-bundler
bundle --source DIR --output DIR --log DIR [options]
Required Options
-s, --source DIR
: Directory to bundle-o, --output DIR
: Destination for.tar.gz
files-l, --log DIR
: Directory for reports/checksums
Optional Flags
-y, --yes
: Auto-accept all prompts (splitting, overwrite, etc.)--delete-original
: Auto deletes the original source dir-n, --dry-run
: Show what would happen without writing files-t, --test
: Perform a checksum-only validation of the source data
Example: Standard Run
bundle --source /wistar/lab/data \
--output /wistar/lab/archive \
--log /wistar/lab/archive_log
If your source is larger than 3 TiB, the tool will create split parts:
data_part_aa.tar.gz
data_part_ab.tar.gz
data_part_ac.tar.gz
...
What You Get
- One or more
.tar.gz
files in the--output
directory - A report file in the
--log
directory, which includes: - Directory tree structure
- Checksums for each file (for later validation)
- Original data stays intact (if you wish to delete, use the
--delete
flag or delete manually)
Save the report file!
Keep the report file in a safe place. It is required if you ever need to restore the data.
The unbundle
Tool (Restore)
The unbundle
tool extracts archived .tar.gz
bundles, including multi-part sets, using pigz
for parallel decompression.
Usage
unbundle --source DIR --output DIR --prefix STR [options]
Required Options
-s, --source DIR
: Directory containing.tar.gz
file(s)-o, --output DIR
: Destination directory for extracted files-p, --prefix STR
: String prefix for starting words of.tar.gz
file(s)
Optional Flags
-y, --yes
: Auto-accept all prompts--delete-archive
: Auto-delete the archive files-n, --dry-run
: Show actions without extracting-f, --force
: Overwrite existing files-h, --help
: Show the help message
What You Get
- Original data is restored to the
--output/--prefix
path .tar.gz
files stay intact (if you want to delete, use the--delete
flag or delete manually)
Example: Standard run
unbundle --source /wistar/lab/archive --output /wistar/lab --prefix project1
Schedule Job Submission Example
The best and recommended way to run these process are through a scheduled job (as it will run in the background for you).
!!!note Passing the -y,--yes
flag
When using a slurm job, you must pass the -y
or --yes
flag as you will not be able to type yes for confirmation on this type of job.
#SBATCH --job-name=(un)bundle
#SBATCH --cpus-per-task=20 # prioritize CPUS
#SBATCH --mem=10G # low memory utilization
#SBATCH --time=8:00:00 # could be long running depending on size
#SBATCH --output=%j.out
#SBATCH --error=%j.err
module load wi-bundler
# Prep data for archive
bundle --source /wistar/lab/data --output /wistar/lab/archive --log /wistar/lab/archive_logs/ --yes
# Restore data from archive
unbundle --source /wistar/lab/archive --output /wistar/lab/data --prefix project1 --yes
Slurm Job Tips to know
- Archival and (de)compressions are very CPU intensive, the tool is designed to run with multiple cores, so prioritize using higher
--cpus-per-task
. - You must pass the
-y
or--yes
flag as you will not be able to type yes for confirmation on scheduled job. - Depending on the size of your data, this may take a while so ensure you are giving the job enough
--time
Next Steps: Submitting for IT Archival/Retrieval
Pushing TO Archive
Once you have completed the bundling process:
- Confirm that all
.tar.gz
files and the report file are present. - Submit a request via Getting Help and specify the
--output
directory. - IT will archive the data from the designated location.
Pulling FROM Archive
- Submit a request via Getting Help and specify your preferred
--source
directory to put the.tar.gz
files. - Once notified, confirm that all
.tar.gz
files are present (see your report file for confirmation) - Optionally, confirm the checksum of the
.tar.gz
files (runsha256sum path/to/tar.gz
)
Once confirmed, use the unbundle
tool to put your data back in your preferred --output
directory.
FAQ
Why do I have to do this?
- There is a 4TB file size limit on your lab shares, so we need to break up archives into separate "parts".
- Our archival destination (AWS S3 - Glacier Instant Retrieval) has a minimum file size of
128K
so we need to consolidate data into single large files.
What happens if my bundle fails or is too large?
- If the bundle fails, check available disk space and permissions in your
--output
and--log
directories. - If you manually created a
.tar.gz
larger than 4 TiB, re-runbundle
so it can split the files into parts.
How do I restore data if I only have the .tar.gz
parts?
- Use the
unbundle
tool (see above). It will automatically detect and combine multi-part archives.
What if some parts are missing?
unbundle
will alert you to any missing parts. Retrieve the missing files before attempting to restore.
My Job Failed with "cancelled"
- You may have forgotten to use the
-y
,--yes
flag with either command.