Photo by Freddie Collins on Unsplash

Batch Processing- The Serverless Way

Simple yet elegant batch processing with S3, Lambda, DynamoDB and SNS

We’ll quickly leaf through the preamble of what we intend to achieve and then talk about some of the ‘potholes’ along the way to watch out for. Hopefully, with the friendly ‘AWS’ neighbourhood services its going to be a pleasant ride.

The Solution Through A Peephole

Figure-1: Schematic solution

OK, so what’s going on here? A decent lot actually. Here are the bullet points that defines the solution:

  • The CSV files containing a batch of records are placed in a predefined (and pre-configured) S3 bucket (named inbound-file-drop in the diagram)
  • Did we say ‘pre-configured’ ? Yes, actually the S3 bucket needs to be pre-configured to forward S3 event notifications to a destination. In this case, it happens to be a Lambda function (batch_processing_lambda)
  • The Lambda function itself is configured with a resource based policy that allows S3 events to invoke the function
  • With these configurations in place, whenever a CSV file is placed (PUT) in the S3 bucket, S3 triggers an event notification to Lambda. Internally, the event is stored in an event queue which is part of the Lambda infrastructure and subsequently submitted to the Lambda function for processing
  • The Lambda function, seen in the above diagram sporting a cap and evidently looking quite dapper in it, is actually configured with a Lambda execution role (IAM role). This role allows it retrieve the recently placed file from S3 bucket and push records from the file into a DynamoDB table. And eventually, after it is satisfied with the processing of the batch, this role allows the Lambda function to send out some pleasant (or in certain cases, not-so-pleasant) notifications

The Lambda function might also remove the files after processing, from inbound-file-drop and move these in an archival-bucket and/or error-bucket depending on the result of processing.

Additionally, the output of the batch process is snuggly placed into a DynamoDB table. Needless to say, we could have employed API Gateway coupled with a Lambda function to create an API that could query the data from this table.

However, we will skip these steps in this post, for the sake of brevity

Devil Is In The Details

Alright, the stage is set to meet the devil and we will do so step-by-step (yeah, as if it could be done in a jiffy, right ?! 😏 ). We will assume that AWS CLI has been configured with region code and appropriate credentials that has the required permissions to create and configure the components mentioned in the schematic solution (a role with AdministratorAccess permission is a good start but does not follow the ‘principle of least privilege’). Here’s the link on how to install AWS CLI version 2

Once installation is over, we can check the installation and configure the credentials with following commands

$ aws --version   # Check the installation
$ aws configure # Configure the credentials under default profile

In the steps that follow, CLI commands have used ‘<account-id>’ to represent AWS Account ID. These will have to be replaced with actual account identifier during execution. Also the region code will have to be appropriately updated, if required

A NOTE ON THE BATCH FILE

We assume the batch file has a Comma Separated Values (CSV) format and contains a header row along with data rows. For this example, we also comfortably assume a simple record structure with student Id, first name, last name and course name. Here’s a snapshot of a sample CSV conforming to the aforementioned structure.

student_id,fname,lname,course
A10001,Jack,Reacher,"Advanced Physical Security"
A10002,Sherlock,Holmes,"Science Of Deduction"
A10002,Sherlock,Holmes,"Modern Forensics"
A10003,Jane,Marple,"Knitting For Pros"

Things to keep in mind: Process and validate the CSV header row, if applicable.

Understand how an empty value is represented in the CSV (NULL, “”, etc.).

Remember to validate each record against a schema.

Consider the maximum size of the batch file. Lambda may not be the best choice for processing very large files

CREATE DYNAMODB TABLE

We will generally have to put some thought to the design of DynamoDB table. Understanding of access patterns, selection of PARTITION key and SORT key is very important. Furthermore, there might be a need to set specific read and write throughput. For this example, we assume that there will be two types of access patterns for this data-set.

✓ Fetch unique record by student_id and course

✓ Fetch all the courses attended by a student (identified by student_id)

We can easily observe that primary key (to uniquely identify an item/record) for this data-set is a combination of student_id and course. This actually makes student_id the PARTITION key and course the SORT key (based on the access patterns). With this knowledge, we can now create a DynamoDB table

$ aws dynamodb create-table --table-name PRELIM_STUDENT_DATA \
--attribute-definitions \
AttributeName=student_id,AttributeType=S \
AttributeName=course,AttributeType=S \
--key-schema \
AttributeName=student_id,KeyType=HASH \
AttributeName=course,KeyType=RANGE \
--billing-mode PROVISIONED \
--provisioned-throughput \
ReadCapacityUnits=5,WriteCapacityUnits=5

Things to keep in mind: Select the partition or hash key (and optionally the sort or range key) based on access patterns.

CREATE SNS TOPIC AND SUBSCRIPTION

Create a SNS topic with the following command and subsequently create an Email subscription.

Note: We will have to confirm the subscription by visiting our mailbox

$ aws sns create-topic --name batch_processing_notification$ aws sns subscribe \
--topic-arn arn:aws:sns:ap-south-1:<account-id>:batch_processing_notification \
--protocol email \
--notification-endpoint my-email@foobar.com

CREATE S3 BUCKET

It’s time to create the S3 bucket where the CSV files will be pushed for processing.

$ aws s3 mb s3://inbound-file-drop

CREATE LAMBDA FUNCTION AND EXECUTION ROLE

Before creating the Lambda function, let’s establish the Lambda execution role. However, we have to understand the difference between a trust policy and a policy document, first.

A ‘trust policy’ simply defines all the trusted entities that can assume a role. In this case, the trusted entity is Lambda service which can perform “sts:AssumeRole” action on the IAM role. Here’s the trust policy document

>>>>>>>>>> File Name: trustpolicy.json <<<<<<<<<<
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

A ‘policy document’, on the other hand, defines what all actions can be taken by the trusted entity(s) that are allowed to assume the role. In this case, we will allow the Lambda to perform the following actions:

✓ “s3:GetObject” on the inbound-file-drop S3 bucket

✓ “dynamodb:PutItem” on the specific DynamoDB table

✓ “sns:Publish” on the already created SNS topic

✓ Necessary permissions to log events into CloudWatch Logs

>>>>>>>>>> File Name: policy.json <<<<<<<<<<
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::inbound-file-drop/*"
]
},
{
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": [
"arn:aws:sns:ap-south-1:<account-id>:batch_processing_notification"
]
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem"
],
"Resource": [
"arn:aws:dynamodb:ap-south-1:<account-id>:table/PRELIM_STUDENT_DATA"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:<account-id>:log-group:/aws/lambda/batch_processing_lambda:log-stream:*",
"arn:aws:logs:*:<account-id>:log-group:/aws/lambda/batch_processing_lambda"
]
}
]
}

Things to keep in mind: While we are going to use DynamoDB ‘PutItem’ to push the records in the database, often ‘BatchWriteItem’ is used for bulk data creation or deletion. However, we must understand the limits associated with BatchWriteItem. For example: individual items can be up to 400 KB max and each call may comprise of as many as 25 put or delete requests, etc.

Now we will create the IAM policy with the policy document defined by policy.json, followed by the IAM role with the trust policy defined by trustpolicy.json and finally attach the policy to the role.

$ aws iam create-policy --policy-name BatchProcessingLambdaPolicy --policy-document file://./policy.json    # Create policy$ aws iam create-role --role-name BatchProcessingLambdaRole --assume-role-policy-document file://./trustpolicy.json  # Create role$ aws iam attach-role-policy --role-name BatchProcessingLambdaRole --policy-arn arn:aws:iam::<account-id>:policy/BatchProcessingLambdaPolicy # Attach policy to the role

Having created the Lambda execution role, we can push forward in our journey and create the actual Lambda function.

Here’s a sample Lambda function that extracts the name of the bucket and CSV file from S3 event notification. Then reads that file from S3 bucket and transforms all the records into JSON and attempts to push the records in a DynamoDB table. Finally, the function sends SNS notification giving out some details around the particular batch run.

The Environment variables needs to be set correctly for this Lambda to work

aws lambda create-function \
--function-name batch_processing_lambda \
--runtime nodejs12.x \
--zip-file fileb://batch_processing_lambda.zip \
--handler index.handler \
--role arn:aws:iam::<account-id>:role/BatchProcessingLambdaRole

While we assume a simple implementation of the batch processing Lambda function in this post, there are several interesting questions to be answered when implementing a batch processing solution:

✓ Do we need ordered processing of the files? If so, what is the ordering key?

✓ Is there a need to process the files sequentially as they arrive? In other words, is concurrent processing by Lambda, going to mess up the data?

✓ How much memory will be required for this processing?

✓ What should be the processing timeout?

Things to keep in mind: In case sequential processing is required, we will have to set the reserved concurrency setting of Lambda function to 1. However, we must also understand that this reserved concurrency is across all the versions of the function

ADD RESOURCE BASED POLICY TO LAMBDA FUNCTION

Now that the S3 bucket is enabled to push event notifications to the Lambda function, the Lambda in turn, should allow such an event to invoke it. Hence we add a resource based policy to the Lambda function with the following command.

$ aws lambda add-permission --function-name batch_processing_lambda --principal s3.amazonaws.com \
--statement-id s3invoke --action "lambda:InvokeFunction" \
--source-arn arn:aws:s3:::inbound-file-drop

CONFIGURE S3 TO SEND EVENT NOTIFICATION

In the event of an object being ‘Put’ in the S3 bucket, we will have to configure the bucket to send S3 event notifications to the Lambda function just created. For that, we will create a JSON file. We will additionally toss in a filter that will check if the file has an extension of ‘.csv’ and trigger the event notification only if the extension matches.

>>>>>>>>>> File Name: notification.json <<<<<<<<<<
{
"LambdaFunctionConfigurations": [
{
"LambdaFunctionArn": "arn:aws:lambda:ap-south-1:<account-id>:function:batch_processing_lambda",
"Events": [
"s3:ObjectCreated:Put"
],
"Filter": {
"Key": {
"FilterRules": [
{
"Name": "suffix",
"Value": ".csv"
}
]
}
}
}
]
}

Things to keep in mind: If name of the file to be processed has embedded spaces, then S3 will encode such characters as ‘+’ character

Finally, we will associate the notification JSON with the S3 bucket (inbound-file-drop) and we are done with the setup.

$ aws s3api put-bucket-notification-configuration --bucket inbound-file-drop --notification-configuration file://notification.json

Things to keep in mind: The S3 bucket should have appropriate bucket policies in Production environment.

Conclusion

For complex batch processing capabilities in AWS world, AWS Batch might be a more apt choice, specifically for batches with longer processing times. On the other hand, if one is already using a batch framework then using Amazon EKS or ECS would be a good choice.

However, for simple batch processing requirements, well known AWS services like S3, Lambda, DynamoDB & SNS could be knitted together to create a robust solution.

Happy learning…

Solutions architect by profession, programmer by passion and photographer by choice…