Intelligent Retrieval Techniques for Archival Data from Amazon Glacier

Dipayan Das
9 min readMar 6, 2024

--

The purpose of this blog is to illuminate effective archival strategies and intelligent data retrieval methods from archives. Through an imaginary use case on the AWS platform, I’ll elucidate these concepts. Familiarity with AWS services and a grasp of data archival principles are prerequisites for understanding this article. I encourage you to refresh your basics before delving into the discussion. To facilitate comprehension, the article is divided into six sections for easy navigation.

· Section 1: Archival Search Request

· Section 2: Data Retrieval from Archival Storage

· Section 3: Consolidation of Data from Archival Storage Search

· Section 4: Archival storage Search Alert

· Section 5: Audit Tables for Archival Storage Search

· Section 6: Monthly Archival Storage Data Search Limit

Use Case: Considering client is a Network service provider and they want to archive their data in low-cost cold storage solution and enable intelligent search on it. Below are assumptions for the solution to describe:

  • AWS events stored in S3 buckets older than 365 days data are transferred to AWS Glacier to control storage cost.
  • Archival search request is made to download data from AWS Glacier based on the input from user prompt and analysis to provide the response.
  • A Cold Storage search request generated contains CustomerID, Start Date, End Date Search Parameters, User Email
  • It takes about 4–12 hours under the standard tier to retrieve data from cold storage stored in AWS Glacier
  • Monthly Permissible limit of archival storage search is based on Customer profile. If limit exceeds, the customer will be notified through email.
  • DynamoDb table Glacier Search Catalog captures all search related requests made by the customer.

Required Artifacts to be created before demonstrating Archival search request.

Step1: Create Glacier Vault

○ The diagram below highlights the schema details for the customer vaults in Glacier.

○ Every customer as part of Threat Manager will have its own Glacier Vault

○ Archives are created in these customer specific vaults per day.

○ The metadata related to the archives are stored in DynamoDB.

Step2: Saving data to Archival Storage

○ The diagram below highlights the process followed for moving events in the S3 bucket to a customer specific vault in Glacier

○ The important thing to note here is that every insertion into Glacier outputs an ArchiveID which is then stored in a Glacier Index Store in S3

Step 3: Saving the Index for each archive in Glacier Index Store S3 Bucket

Step 4: Save the index dataset in the following S3 bucket:

Glacier_Index_store/<Customer_ID>/<Glacier-ID>/<Year>/<Month>/<Day>/<Archive_ID>index.spark.dataset.csv

Step 5: Save the index data to DynamoDB with elements Customer_ID, Glacier-ID, Archive_ID, Archive_Date

Section 1: Archival Search Request

This section highlights all the AWS services involved as part of a Cold Storage Search Request. AWS services will be used for this functionalities are Glacier, Athena, Lambda, S3, Glue, Simple Email Service(SES), DynamoDB. Approach will be explained in a stepwise manner below:

Step 1: Archival Search Request Generation

· Amazon Lambda will be used to generate Glacier search. Let’s assume name of the Lambda function is “searchEventArchivesfor the explanation in this blog. This lambda function has 4 input parameters — CustomerID, Start Date, End Date and Search Parameters. CustomerID is the unique identifier for the Customer data be searched in Glacier Archive. Start Date and End Date are date range for the required archival data. Search parameters are the fields exist in archival data file, as per above example source_id,target_ip,source_hostname or any other. This is not mandatory field to invoke glacier search. Providing list of example environment variable of lambda as it can be easy to understand.

○ ATHENA_DATABASE: Name of the Athena database

○ ATHENA_OUTPUT_LOCATION: S3 location for Athena configuration

○ GLACIER_CATALOG_TABLE: Name of dynamodB table where search related information are captured

○ GLACIER_INDEX_STORE_BUCKET_NAME: Bucket name where glacier index store exists

○ GLACIER_OUTPUT_LOCATION: Bucket name where cold storage output (form Glacier) will be stored

· Every search request creates a unique SearchID

· A row is inserted in the Catalog dynamodb table with all search related information

Step 2: Find Archives from Glacier Index Store

  • Index files are created when the archive is uploaded to Glacier
  • Archive Ids are being identified from Glacier Index Store stored in S3
  • Index file captures the event data with distinct combination of 9 fields which can be used for searching purpose
  • Index file is stored in S3 with below prefix

CutomerID/VaultName/Year/Month/Day/ArchiveID

As example WB/WB-glacier-vault/2024/03/02/ACH101

  • Create athena table on location GlacierIndexBucket/CustomerID
  • Execute athena query to get ArchiveID from $PATH of the index file which consist data as per search parameter and day value as per start date, end date from $PATH

Step3: Generate Glacier Select Query

· Glacier Select allows to query data from Glacier

· Glacier has multiple vaults and one vault consists of multiple archives

· Glacier Select query is triggered at the archive level

· Glacier select query can be used for all ArchiveIDs which is stored in Glacier Index Store

Step 4: Output Folder Creation

  • Glacier output structure is created in Glacier Output S3 Bucket with below prefix:

§ CustomerID/SearchID/ArchiveID/JobID/

  • result_manifest.txt, job.txt and ‘results’ folder is created within above prefix as a result of Glacier Select Query
  • job.txt is generated during creation of cold storage search request.
  • result_manifest.txt and ‘results’ folder is created during creation of search result from Glacier. It takes 4–12 hours to get data from Glacier in Standard tier. You can also use expedited retrieval based on the use case.

Section 2: Data Retrieval from Archival Storage

· A search request could output Glacier files from multiple archives

· The result files are copied into a archiveID folder

· These files present under multiple archiveID’s need to be moved under one searchID. Let’s assume Lambda function “searchArchivesFileConsolidation ” that does this job is explained below:

o result_manifest.txt consists of the path of the files created by cold storage search.

o Cold storage search results is stored in ‘results’ folder

o Multiple files can be downloaded as result of Glacier Select query for one archive; it depends on size of the archive

o Once file is created within ‘results’ folder, it automatically copies file from that location to the location below by the same Lambda function called searchArchivesFileConsolidation

o CustomerID/SearchID/Consolidated

o All the files for a particular search from different archives will be stored in the above location and consolidation process will be carried out after this.

Section 3: Consolidation of Data from Archival Storage Search

· A search request could output Glacier files from multiple archives.

· The result files from Glacier are copied under a customer specific ArchiveID and eventually moved to a common location under CustomerID/SearchID/Consolidated

· Though these result files are present under one searchID, there can be multiple result files. These result files need to be consolidated before they are sent to the customer who initiated the search request.

· Let’s consider another lambda function “searchAlertArchives” is developed to consolidate the Glacier result files and send one file to the customer. This lambda function scheduled in every 5 minutes which performs the below steps

First run :

o Checks DynamoDB and extract all SearchIds where “RequestStatus = Pending”

o For every extracted SearchID checks if data retrieval process is complete.

o Creates a tabular view for the path CustomerID/SearchID/Consolidated using athena DDL query.

o Once table creation is completed it runs a AWS Glue ETL job with details of the newly created table as input and retrieves the glue job run id in return.It also performs below modifications in DynamoDB

§ Changes “RequestStatus” to “FileConsolidationStarted”

§ Adds an attribute “GlueJobID” to the DynamoDB element with value as the glue job run id which was retrieved from above mentioned step

§ It adds another attribute “GlueJobStatus” with value “RUNNING”

o Meanwhile the Glue job runs in background and stores consolidated output in S3 based on below conditions

§ If data retrieved from glacier is within 100 GB — 1 single consolidated output file

§ If data retrieved from glacier is between 100 GB and 300 GB — 5 partitioned output files

§ If data retrieved from glacier is more than 300 GB — 10 partitioned output files

o Path where GlueJob stores the file/files is CustomerID/SearchID/UnifiedOutput

Second and all further runs :

o From the second run onwards before performing the above steps on new search ids it will do another additional check on search ids for which Glue job execution has started in the previous run

§ First it will retrieve list of search ids for which glue job execution has started in previous run.

§ It will check the current status of that Glue Job based on the Glue job run id which was added to the search element in DynamoDB

§ There are three possible status “SUCCEEDED”, “FAILED”,”COMPLETED”

§ If status is “SUCCEEDED” it means Glue job has ran successfully for the search id and its consolidated file/files is available in S3.

§ It will change “RequestStatus” to “FileConsolidationCompleted” and “GlueJobStatus” to “SUCCEEDED

Section 4: Archival storage Search Alert

· Once the status changes explained in section 3 are done in DynamoDB, the function will execute its service to send the path of consolidated file/files to the user in an email.

· After sending the email it will change “RequestStatus” to ”ProcessComplete”.

Section 5: Audit Tables for Archival Storage Search

· List of dynamodb audit tables involved for Cold Storage data upload and search are given below:

o Glacier_Customer_Catalog — This table captures archive related information. Please find below table as example:

Hash Key: CustomerID

Range Key: Archive Date

o Glacier_Search_Catalog — Captures all search related information and indicates status of search request.Please find below table as example

Hash Key: CustomerID

Range Key: Archive Date

o Search_Result_Master — It keeps track of the monthly cold storage search volume for each customer.Please find below table as example

Hash Key: CustomerID

Range Key: Search Month Year

Section 6: Monthly Archival Storage Data Search Limit

  • Step 1: Search all the archives from Glacier Index store on the basis of start date, end date and search parameter
  • Step 2: Estimate number of rows involved in a particular search by reading ‘count’ value from a index file for each archive involved in search

· Note: Count value represents number of combination of 9 searchables fields (source_ip,target_ip,source_hostname,target_hostname,source_mac,target_mac,user_source,device_name,user_target) in a index file

  • Step 3: Estimate data size involved in a particular search by multiplying 1.2 KB with computed number of rows from step 2

· Note: As approximate row size for a particular event is 1.2 KB

  • Step 4: Check if row exist in Search_Result_Master dynamodb table for that customer and month-year
  • Step 5: If row doesn’t exist in Search_Result_Master table, insert values with estimated data size for that search as we computed in step 3. Search_Result_Master will look like below:
  • Step 6: If row exist in Search_Result_Master table, check value exist in Estimated Search Data Size(GB) element is more than 500GB or not
  • Step 7: If less than 500 GB, update attribute Estimated Search Data Size(GB) by adding estimated data size for that search as we computed in step 3 with existing value in table. Search_Result_Master will look like below:
  • Note: Assuming estimated data size for current search is 2 GB
  • Step 8: If more than 500 GB , trigger a mail to user email inserted during search request with below message:

--

--

Dipayan Das
Dipayan Das

Written by Dipayan Das

Dipayan is a Big Data Architect who is passionate to enhance the life quality using technology. Completed Masters in AI and Robotics.

No responses yet