Archiving and Restoring Data
Archiving data ensures that important information is securely stored, easily retrievable, and frees up space in your lab share.
Suggested Directory Standards
Set an organizational standard for your lab archives:
- Yearly/Monthly:
/wistar/lab/archive-YYYY-MM
Example:archive-2025-07
- Quarterly:
/wistar/lab/archive-QX
Example:archive-Q1
(for Jan–Apr) - Static:
/wistar/lab/archive/
(One standard archive directory)
Archival Requirements
In order to archive data, your dataset/project must meet the following requirements:
- In
.tar.gz
format - Less than 3 TB in size
- Greater than 128 KB in size
4 TiB File Size Limit
Our storage system does not support files larger than 4 TiB.
- If your data.tar.gz
exceeds 4 TiB, it will be incomplete.
- The bundle
tool (below) automatically splits large files into smaller parts.
The bundle
Tool (Prepare)
The bundle
tool compresses and packages data into .tar.gz
archives and automatically splits them into parts if needed.
Warning
This tool is very powerful and has the ability to generate large files as well as delete files. Do not run this against any data that is NOT YOURS. If you need any help or assistance running this please see Getting Help. Thank you!
Usage
module load wi-bundler
bundle --source_dir DIR --output_dir DIR --log_dir DIR [options]
Required Options
-s, --source_dir
: Directory to bundle-o, --output_dir
: Destination for.tar.gz
files-l, --log_dir
: Directory for reports/checksums
Optional Flags
-y, --yes
: Auto-accept all prompts (splitting, overwrite, etc.)--delete-original
: Auto deletes the original source dir- -h,--help: Show this help message
Example: Standard Run
bundle --source_dir /wistar/lab/data \
--output_dir /wistar/lab/archive \
--log_dir /wistar/lab/archive_log
If your source is larger than 3 TiB, the tool will create split parts:
data_part_aa.tar.gz
data_part_ab.tar.gz
data_part_ac.tar.gz
...
What You Get
- One or more
.tar.gz
files in the--output_dir
directory - Report files in the
--log_dir
directory, which includes: - Directory tree structure
- Checksums for completed
.tar.gz
(for later validation) - Original data stays intact (if you wish to delete, use the
--delete-original
flag or delete manually)
Save the report files!
Keep the report files (tree and checksums) in a safe place. It is required if you ever need to restore the data.
The unbundle
Tool (Restore)
The unbundle
tool extracts archived .tar.gz
bundles, including multi-part sets, using pigz
for parallel decompression.
Usage
unbundle --source_dir DIR --output_dir DIR [options]
Required Options
-s, --source_dir
: Directory containing 1 or more.tar.gz
file(s)-o, --output_dir
: Destination directory for extracted files
Optional Flags
-y, --yes
: Auto-accept all prompts--delete-archives
: Auto-delete the archive files-h, --help
: Show the help message
What You Get
- Original data is restored to the
--output_dir
path .tar.gz
files stay intact (if you want to delete, use the--delete-archives
flag or delete manually)
Example: Standard run
unbundle --source_dir /wistar/lab/archive --output_dir /wistar/lab/restore
Schedule Job Submission Example
The best and recommended way to run these process are through a scheduled job (as it will run in the background for you).
!!!note Passing the -y,--yes
flag
When using a slurm job, you must pass the -y
or --yes
flag as you will not be able to type yes for confirmation on this type of job.
#SBATCH --job-name=(un)bundle
#SBATCH --cpus-per-task=20 # prioritize CPUS
#SBATCH --mem=10G # low memory utilization
#SBATCH --time=8:00:00 # could be long running depending on size
#SBATCH --output=%j.out
#SBATCH --error=%j.err
module load wi-bundler
# Prep data for archive
bundle --source_dir /wistar/lab/data --output_dir /wistar/lab/archive --log_dir /wistar/lab/archive_logs/ --yes
# Restore data from archive
unbundle --source_dir /wistar/lab/archive --output_dir /wistar/lab/restore
Better yet, use job arrays to archive multiple projects/datasets; see Job Arrays.
Slurm Job Tips to know
- Archival and (de)compressions are very CPU intensive, the tool is designed to run with multiple cores, so prioritize using higher
--cpus-per-task
. - You must pass the
-y
or--yes
flag as you will not be able to type yes for confirmation on scheduled job. - Depending on the size of your data, this may take a while so ensure you are giving the job enough
--time
Next Steps: Submitting for IT Archival/Retrieval
Pushing TO Archive
Once you have completed the bundling process:
- Confirm that all
.tar.gz
files and the report file are present. - Submit a request via Getting Help and specify the
--output_dir
directory. - IT will archive the data from the designated location.
Pulling FROM Archive
- Submit a request via Getting Help and specify your preferred
--source_dir
directory to put the.tar.gz
files. - Once notified, confirm that all
.tar.gz
files are present (see your report file for confirmation) - Optionally, confirm the checksum of the
.tar.gz
files (runsha256sum path/to/tar.gz
)
Once confirmed, use the unbundle
tool to put your data back in your preferred --output_dir
directory.
FAQ
What is archiving?
Archiving is the process of storing old/unused data on the cloud (such as AWS S3) where is much cheaper to store. Archiving IS NOT a means to quickly get more space on your lab share. If you are running into space problems, please see Getting Help and we can assist you.
Why is archiving important?
- Storage costs money and can be very expensive. Data organization is paramount to ensuring that Wistar and your lab can continue work. Archiving can reduce cost and improve organization in your lab share.
Why do I have to do this?
- There is a 4TB file size limit on your lab shares, so we need to break up archives into separate "parts".
- Our archival destination (AWS S3 - Glacier Instant Retrieval) has a minimum file size of
128K
so we need to consolidate data into single large files.
What happens if my bundle fails or is too large?
- If the bundle fails, check available disk space and permissions in your
--output_dir
and--log_dir
directories. - If you manually created a
.tar.gz
larger than 4 TiB, re-runbundle
so it can split the files into parts.
How do I restore data if I only have the .tar.gz
parts?
- Use the
unbundle
tool (see above). It will automatically detect and combine multi-part archives.
What if some parts are missing?
unbundle
will alert you to any missing parts. Retrieve the missing files before attempting to restore.
My Job Failed with "cancelled"
- You may have forgotten to use the
-y
,--yes
flag with either command.