How to create a serverless app with AWS SAM for Big Data Handling
A developer journey through the pros and cons of AWS Lambda functions, SQS, DynamoDB, API Gateway, and Couldformation (with SAM templates) and the real power of Lambda Powertools and Pydantic.
Introduction
Have you ever had to manage big data, such as system logs? No worries — AWS SAM and DynamoDB are here to help you. The reason for this article is the same — we have system logs, stored in a MySQL server. One of our clients complained that the logs listing page takes too much time to load in the administration panel (there were about 1 million logs). MySQL wasn’t the right way to store this data, and we had to find a better solution, so we decided to separate the logs from the main application.
The big data can be handled in many different ways — serverless or with a dedicated server that processes the requests. But here (just like in every other task) the real question is “What exactly do we need?”. This question is crucial because it can push us in the right direction. There is a lot of background information behind the scenes — we do not want unnecessary services to maintain as this will make the task and the application too complex (like Elastic Search). At the same time, we want to handle all of the requested functionalities — searching, sorting, filtering, writing, etc.
Structure
Used resources:
- DynamoDB — for storing the logs
- Lambda functions — for reading and writing logs
- Simple Queue Service — for triggering the “logs writing” lambda function; two queues — LogsQueue is the main queue, DeadLetterQueue is the “failed jobs” queue
- API Gateway — for triggering the “logs reading” lambda function
- CloudFormation (via the Serverless Application Model templates) — for creating and managing all resources in the stack
The main application creates a job that contains the log information in the SQS Queue (LogsQueue). Then, the LogsQueue triggers a lambda function that writes the logs to the database (a DynamoDB table called LogsTable). As for the logs listing and filtering — there is an API Gateway with two available routes: “/” and “/{uuid}” associated with two lambda functions respectively — LogsReader (for retrieving and filtering all logs) and LogReader (for retrieving only a specific log using its UUID). These two functions read directly from the database. The whole structure is illustrated below.
SQS Queues
The Logs serverless application uses two queues — one main queue (LogsQueue) and one dead-letter queue (DeadLetterQueue). The main queue is the bridge that connects the main application and the lambda function for storing the logs in the database.
Messages can sometimes not be processed for a variety of reasons, including incorrect conditions in the producer or consumer application or an unanticipated state change that interferes with the code of your application. These messages are forwarded to the dead-letter queue. This way, you’re able to run the processes again (you can set the maximum trials count in the Queue settings). Because they allow you to isolate unconsumed messages and figure out why their processing fails, dead-letter queues are helpful for debugging your application as well.
Lambda Functions
With the compute service Lambda, you may run code without setting up or maintaining servers. Additionally, you can utilise layers to organise your app, and even better, you can use these layers in other lambda functions. By doing this, you can segregate the core code components (helpers, DB/Storage connections, PyDantic models, etc.) and reuse them throughout all lambda functions in the application without having to duplicate the code.
I’ve created one base layer (UtilsLayer) where I’ve put the DynamoDB connection and pagination clients (from boto3), the base models that the functions will inherit, helper functions, etc. Pydantic and Lambda Powertools played a big role here.
Lambda Powertools and PyDantic — the best combination for lambda functions data handling
This is an awesome collection of Python tools for AWS Lambda functions that make it easier to implement best practices like tracing, structured logging, validation, events parsing, and many more. The automatic event parsing is just fantastic — nice syntax, quick validation, and data management. You can check how it works here. Lambda Powertools can be used as a python package or directly as a layer in the application. Another very powerful aspect of this library is that it supports PyDantic. This raises the level of the application structure dramatically — you can implement the whole validation with just one decorator function (@parse_event). Feel free to check their documentation, it is worth it.
Here is an example of how Lambda Powertools and Pydantic are used together in the Logs Serverless Application.
You can see the base LogModel with all of its fields declared. It is located in the Utils Layer since all functions will use it.
from aws_lambda_powertools.utilities.parser import BaseModel, Field
class LogModel(BaseModel):
id: UUID = Field(default_factory=uuid4)
account_id: int = Field(gt=-1)
user_id: int = Field(gt=-1)
type: str = Field(min_length=1)
sub_type: str = Field(min_length=1)
url: str = Field(min_length=1)
payload: Optional[str] = Field(min_length=1)
submitter_id: Optional[str]
submitter_country: Optional[str]
submitter_city: Optional[str]
submitter_platform: Optional[str]
submitter_browser: Optional[str]
submitter_agent: Optional[str]
created_at: datetime = Field(default_factory=datetime.now)
def to_dict(self, *args, **kwargs):
data = self.dict(*args, **kwargs)
data['id'] = data['id'].hex
# Now the account_id and user_id are BigInt in the MySQL
# Converting them to string for future DB structure updates
if isinstance(data['account_id'], int):
data['account_id'] = str(data['account_id'])
if isinstance(data['user_id'], int):
data['user_id'] = str(data['user_id'])
data['created_at'] = data['created_at'].isoformat()
# Internal fields for the GSIs
data['account_id#type'] = f'{data["account_id"]}#{data["type"]}'
data['status'] = 'OK'
return data
And this is the LogsWriterFunction input validation model. It looks simple, doesn’t it?
from aws_lambda_powertools.utilities.parser.models import SqsModel, SqsRecordModel
from typing import List
from models import LogModel
class Params(SqsRecordModel):
body: LogModel
class WriteLogModel(SqsModel):
Records: List[Params]
And now comes the best part — the handler method (LogsWriterFunction). The whole complex validation logic happens behind the scenes. The code is shorter, simpler, and nicely structured.
@event_parser(model=WriteLogModel)
def lambda_handler(event: WriteLogModel, context: LambdaContext):
for record in event.Records:
save_log(record.body)
return {"statusCode": 200}
DynamoDB
Probably the most complex and hard-to-research part was the logs filtering. Unless you install ElasticSearch as an additional service to Dynamo, this database doesn’t offer a lot of options for searching (or at least, efficient options for searching). Yes, you can search using the SCAN method instead of Query but it’s slow and not recommended for a big amount of data. The real power of Dynamo is the storage partition separation — it’s ideal for big data storing.
In this project, we benefit from one feature called “Global Secondary Index” or shortly - “GSI”. It prevented us from creating a new ElasticSearch instance, which will be more expensive and will require maintenance. These indexes are a powerful tool for handling “not too complex” filtering cases.
A partition key and an optional sort key are required for each global secondary index. The base table schema and the index key schema can differ. It is possible to establish a global secondary index with a partition key as the composite primary key for a table with a simple primary key, or the opposite. Every GSI makes an internal duplicate of the main table, using the requested fields as partition and sort keys. This way, you can search and filter (by the sort key) very fast.
GSIs
- AccountIndex — used for filtering by account
* Partition key: account_id
* Sort key: created_at - TypeIndex — used for filtering by type
* Partition key: type
* Sort key: created_at - AccountTypeIndex — used for filtering by account and type simultaneously
* Partition key: account_id#type
* Sort key: created_at - SortingIndex — used for sorting all available logs; used in the “all logs” API Endpoint — the SCAN method cannot sort the logs because they are in different partitions so this sort key is the only way to “cheat” and sort them. This method has a lot of cons but it’s the only way for sorting the data. Because the data is located in one place, it’s recommended to use it with a “limit” and pagination.
* Partition key: status (it’s set to “OK” for all records so all records are located in the same partition)
* Sort key: created_at
Other fields
- id — UUID V4
- account_id — string
- user_id — string
- type — string
- sub_type — string
- url — string
- payload — string/json
- submitter_id — string, optional
- submitter_country — string, optional
- submitter_city — string, optional
- submitter_platform — string, optional
- submitter_browser — string, optional
- submitter_agent — string, optional
- created_at — string, ISO 8601
AWS SAM — A Cloudformation Templates Translator
The AWS Serverless Application Model (SAM) is an open-source framework for developing serverless apps. It offers a straightforward syntax for defining functions, APIs, databases, and mappings of event sources. You can define and model the application you want using YAML with just a few lines per resource. You can create serverless applications more quickly since SAM expands and translates the SAM syntax into AWS CloudFormation syntax during deployment.
The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. It uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application’s build environment and API. Using SAM CLI you can also easily deploy your application.
The best part is that the SAM templates are reusable — if you put them into a completely new account and deploy them using SAM CLI, everything will be set up after a few minutes.
The Logs Serverless Application uses SAM for creating the resources and their connections, for deployment and testing, most of the DevOps-related tasks in the project.
I hope this article gave you an overall idea of how powerful are the serverless application. Combined with tools like Lambda PowerTools can lead to amazing results and solutions.
Happy Coding!
Originally published on the MTR Design company website.
MTR Design is a Bulgarian web development company with expertise in a wide range of technologies (Python, PHP, NodeJS, React and React Native, VueJS, iOS and Android development). We are currently open for new projects and cooperations, so if there are any projects we can help with, please react out (use the contact form on our website, or email us at office@mtr-design.com) and we will be happy to discuss them.