Archiving and Restoring Data

Archiving data ensures that important information is securely stored, easily retrievable, and removed from your lab share once it is no longer active.

Important Distinctions:

Archiving ≠ Bundling
Archiving is NOT a temporary solution for freeing up space.
It is a long-term strategy for removing old and completed data/projects from active lab shares.

Suggested Directory Standards

Set an organizational standard for your lab archives:

Standard	Format	Example
Yearly	YYYY	/wistar/lab/archive-2025
Monthly	YYYY-MM	/wistar/lab/archive-2025-07
Quarterly	Qn	/wistar/lab/archive-Q1
Static	n/a	/wistar/lab/archive

Storage Classes

Under your archive folder it is recommended to create two sub-folders called instant-retrieval and deep_glacier. Then put the prepared data into its respective sub-folder. Please see Amazon S3 Storage Classes for more details on all the available options.

Instant Retrieval

Use instant retrieval for data or projects you expect to access again, such as results that may need to be revisited or reused in the near future. Keep in mind that Amazon requires a minimum storage duration of 3 months for this class.

Deep Glacier

Use deep glacier for long-term data that you do not expect to touch again, such as final archived projects or datasets stored for compliance purposes. This class has a 6-month minimum storage duration and is intended for data that is rarely, if ever, retrieved.

Archival Requirements

To archive data, your dataset/project must meet the following requirements:

In .tar.gz format
Less than 3 TB in size
Greater than 128 KB in size
Designation for Instant Retrieval (IR) or Deep Glacier (DG)

4 TiB File Size Limit

Our storage system does not support files larger than 4 TiB.
- If your data.tar.gz exceeds 4 TiB, it will be incomplete.
- The bundle tool (below) automatically splits large files into smaller parts.

Why Do I Need to do this?

To prepare data for archiving, one must first "bundle" their data into one or more .tar.gz files. This is required due to AWS storage class requirements, storage server limitations, and archiving best practices.

Files must be at least 128KB or larger, otherwise, you will be charged for 128KB. In other words, if pushing a 1KB file to an S3 bucket, it will be seen and charged as 128KB. Even though this cost is negiable for one file, if you have millions of files the cost grows very quickly.
Our NAS (Network Attached Storage) has a file size limit of 4TB. If you attempt to pack a 10TB dataset into a single .tar.gz, it will stop at 4TB, rendering the .tar.gz incomplete. This script provides an easily solution to this by using the split command to split .tar.gz files into 3TB chunks (1TB buffer).
Finally, this script provides detailed tree output and sha256sums on your data to provide you with details about your archived data.

The `bundle` Tool (Prepare)

The bundle tool compresses and packages data into .tar.gz archives and automatically splits them into parts if needed.

Available Space for Bundle

To bundle data, you need enough free space to hold the entire bundle. For example, bundling a 5TB folder requires at least 5TB of free space so the process can complete. Use df -h to check the available space on your lab share. If you see an error like No space left on device., it means you have run out of space. The --delete-original option does not change this requirement as the bundle must be fully written before the originals can be removed.

Usage

module load wi-bundler
bundle --source_dir DIR --output_dir DIR --log_dir DIR [options]

Required Options

-s, --source_dir: Directory to bundle
-o, --output_dir: Destination for .tar.gz files
-l, --log_dir: Directory for reports/checksums

Optional Flags

-y, --yes: Auto-accept all prompts (splitting, overwrite, etc.)
--delete-original: Auto deletes the original source dir
-h,--help: Show this help message

Example: Standard Run

bundle --source_dir /wistar/lab/data        --output_dir /wistar/lab/archive        --log_dir /wistar/lab/archive_log

If your source is larger than 3 TiB, the tool will create split parts:

data_part_aa.tar.gz
data_part_ab.tar.gz
data_part_ac.tar.gz
...

What You Get

One or more .tar.gz files in the --output_dir directory
Report files in the --log_dir directory, which includes:
Directory tree structure
Checksums for completed .tar.gz (for later validation)
Original data stays intact (if you wish to delete, use the --delete-original flag or delete manually)

Save the report files!

Keep the report files (tree and checksums) in a safe place. It is required if you ever need to restore the data.

The `unbundle` Tool (Restore)

The unbundle tool extracts archived .tar.gz bundles, including multi-part sets, using pigz for parallel decompression.

Available Space for Unbundle

To unbundle data, you need enough free space to hold the entire unbundle. For example, unbundling a 5TB folder requires at least 5TB of free space so the process can complete. Use df -h to check the available space on your lab share. If you see an error like No space left on device., it means you have run out of space. The --delete-archive option does not change this requirement as the unbundle must be fully written before the archives can be removed.

Usage

unbundle --source_dir DIR --output_dir DIR [options]

Required Options

-s, --source_dir: Directory containing 1 or more .tar.gz file(s)
-o, --output_dir: Destination directory for extracted files

Optional Flags

-y, --yes: Auto-accept all prompts
--delete-archives: Auto-delete the archive files
-h, --help: Show the help message

What You Get

Original data is restored to the --output_dir path
.tar.gz files stay intact (if you want to delete, use the --delete-archives flag or delete manually)

Example: Standard run

unbundle --source_dir /wistar/lab/archive --output_dir /wistar/lab/restore

Schedule Job Submission Example

The best and recommended way to run these processes is through a scheduled job (as it will run in the background for you).

Passing the -y,--yes flag

When using a slurm job, you must pass the -y or --yes flag as you will not be able to type yes for confirmation on this type of job.

#SBATCH --job-name=(un)bundle
#SBATCH --cpus-per-task=20          # prioritize CPUS
#SBATCH --mem=10G                   # low memory utilization
#SBATCH --time=8:00:00              # could be long running depending on size
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load wi-bundler

# Prep data for archive
bundle --source_dir /wistar/lab/data --output_dir /wistar/lab/archive --log_dir /wistar/lab/archive_logs/ --yes

# Restore data from archive
unbundle --source_dir /wistar/lab/archive --output_dir /wistar/lab/restore

Better yet, use job arrays to archive multiple projects/datasets; see Job Arrays.

Slurm Job Tips to Know

Archival and (de)compressions are very CPU intensive. The tool is designed to run with multiple cores, so prioritize using higher --cpus-per-task.
You must pass the -y or --yes flag as you will not be able to type yes for confirmation on scheduled job.
Depending on the size of your data, this may take a while so ensure you are giving the job enough --time

Next Steps: Submitting for IT Archival/Retrieval

Pushing TO Archive

Once you have completed the bundling process:

Confirm that all .tar.gz files and the report file are present.
Submit a request via Getting Help and specify the --output_dir directory.
Specify which storage class you wish this to be archived to (Instant Retrieval or Deep Glacier)
IT will archive the data from the designated location.

Pulling FROM Archive

Submit a request via Getting Help and specify your preferred --source_dir directory to put the .tar.gz files.
Once notified, confirm that all .tar.gz files are present (see your report file for confirmation)
Optionally, confirm the checksum of the .tar.gz files (run sha256sum path/to/tar.gz)

Once confirmed, use the unbundle tool to put your data back in your preferred --output_dir directory.

FAQ

What is archiving?

Archiving is the process of storing old/unused data on the cloud (such as AWS S3) where it is much cheaper to store. Archiving IS NOT a means to quickly get more space on your lab share. If you are running into space problems, please see Getting Help and we can assist you.

Why is archiving important?

Storage costs money and can be very expensive. Data organization is paramount to ensuring that Wistar and your lab can continue work. Archiving can reduce cost and improve organization in your lab share.

Why do I have to do this?

There is a 4TB file size limit on your lab shares, so we need to break up archives into separate "parts".
Our archival destination (AWS S3 - Glacier Instant Retrieval) has a minimum file size of 128K so we need to consolidate data into single large files.

What happens if my bundle fails or is too large?

If the bundle fails, check available disk space and permissions in your --output_dir and --log_dir directories.
If you manually created a .tar.gz larger than 4 TiB, re-run bundle so it can split the files into parts.

How do I restore data if I only have the `.tar.gz` parts?

Use the unbundle tool (see above). It will automatically detect and combine multi-part archives.

What if some parts are missing?

unbundle will alert you to any missing parts. Retrieve the missing files before attempting to restore.

My Job Failed with "cancelled"

You may have forgotten to use the -y, --yes flag with either command.

Archiving and Restoring Data

Suggested Directory Standards

Storage Classes

Instant Retrieval

Deep Glacier

Archival Requirements

Why Do I Need to do this?

The bundle Tool (Prepare)

Usage

Required Options

Optional Flags

Example: Standard Run

What You Get

The unbundle Tool (Restore)

Usage

Required Options

Optional Flags

What You Get

Example: Standard run

Schedule Job Submission Example

Slurm Job Tips to Know

Next Steps: Submitting for IT Archival/Retrieval

Pushing TO Archive

Pulling FROM Archive

FAQ

What is archiving?

Why is archiving important?

Why do I have to do this?

What happens if my bundle fails or is too large?

How do I restore data if I only have the .tar.gz parts?

What if some parts are missing?

My Job Failed with "cancelled"

The `bundle` Tool (Prepare)

The `unbundle` Tool (Restore)

How do I restore data if I only have the `.tar.gz` parts?