AWS Glue Notes

Intro

Glue is a microservice that catalogs data from multiple sources like DynamoDB and S3 into tables that can then have transformation jobs run on them by services like Athena or machine learning models. It can combine data from multiple different files in S3 into one comprehensive table that can then be queried in a variety of ways.

This article is not going to be a full tutorial on Glue, i’m just going to note the parts that I find particularly useful for myself so if you are looking for a full walk-through please go elsewhere.

Tables

In AWS Glue a table is similar to a table in DynamoDB except that it can actually combine the data from hundreds or thousands of different files or locations into one source as long as those sources share a common schema.

For example let’s say I have an S3 bucket that gets a CSV file uploaded to it every time a customer makes a purchase at my company with the details of the purchase. There will be thousands of CSV files in that bucket that have a common schema. Glue can then combine all of that data into one table that can then be queried or transformed in a variety of ways.

You could also combine items from a variety of other databases like Redshift, MongoDB, Kafka and others.

Glue is the first step in the AWS environment to combine and organize disparate data sources for consumption by other AWS Services like Athena or Sagemaker. In fact Athena only reads from AWS Glue Data Catalogs, there is no other way to input data into the service.

Tables are organized into Databases, the tables do not have to have a similar schema to be in the same database. Databases are really sort of folders for tables, used to organize your tables into groups.

Tables can be generated manually where you create the schema yourself by hand, or they can automatically generated by crawlers. Crawlers are the 2nd most complicated thing you will do in Glue, behind python scripts.

Crawlers

Crawlers create a unified table containing all the data with matching schemas.

Crawlers can be instructed to ignore items in the folders they are crawling with a syntax almost identical to that used in a .gitignore file.

Datalake Structure

There are ways to structure a datalake in S3 that are more ideal than others for consumption by Glue Crawlers. It is important for the crawler to be able to partition your data so that it can be queried efficiently later.

📘 AWS Glue Docs: How Does a Crawler Determine When to Create Partitions?

For example at the base level of your S3 datalake file structure you should start by grouping all files with a similar schema together, and then expand to other similar elements like date and then lastly unique identifiers like user id or order number. This will allow Glue Crawlers to partition your data as efficiently as possible.

One of the useful things to note in that documentation is that we can actually name the partition automatically if we use the syntax partition=value in folder structure. For example:

S3://sales/year=2019/month=Jan/day=1
S3://sales/year=2019/month=Jan/day=2
S3://sales/year=2019/month=Feb/day=1
S3://sales/year=2019/month=Feb/day=2

will automatically generate the partitions year, month and day. It’s also nice to note that we can apparently have as many sort keys as we would like (or at least more than two, which is the limit in DynamoDB without generating additional indexes).

Note that this doesn’t actually seem like a good date format to use, I would strongly recommend using something like YYYYMMDD if you can help it

Bookmarks

Bookmarks are a Glue feature available in both Jobs and Crawlers that allow the Glue to only crawl or transform the files that have changed since the last action. This is an extremely useful, even necessary feature. If bookmarks are not enabled then a crawler will crawl all the files in a given location every time the crawler runs, which as the datalake grows will take longer and longer to perform. Not only are we duplicating effort, we are taking unnecessary time and cost to perform the same index or transform over and over. Always use bookmarks if your datalake is going to be growing over time.

One of the useful things to understand about bookmarks is that they function by looking at the last modified date of a file. This is useful to understand because if I want to know if I delete a file, and replace it with another file in the exact same location with the same name, will new file get crawled? The answer is yes it will, because the bookmarks use last modified and not the file name to keep track of the files that have been updated.

Jobs

Jobs are transformations, such as mapping or dropping fields, converting to another format or even just decompressing files. There is lots of information this out there. Some things to note:

  • If a job is set to output a file in an S3 bucket folder, and the folder does not exist, the job will create the folder.

Workflows

Workflows allow you to string together several Crawlers and Jobs using triggers which will fire when the previous step (or steps) are complete.

There are some limitations however, here are some things to keep in mind.

Max Crawlers Per Action

A Workflow action cannot trigger more than two Crawlers at a time. So if you are trying to trigger a large number of Crawlers simultaneously you will want to use something like Step Functions.

Workflow Has Reached Max Concurrency Error

There is a known bug with WorkFlows and Jobs that have the "STOPPED" status, and which also have max concurrency set to 1. If you attempt to start a Workflow with these two circumstances you will get a Workflow has reached max concurrency even though the Workflow or Job is not running any instances. This is a known bug and has been known for at least 8 months as of this writing and it’s still not fixed. The recommended fix is to set the max concurrency to two or higher. However this creates another issue, as multiple instances of a Job or Workflow running concurrently can break the bookmark feature, which is a big problem. The way that I have handled this is to set the maximum concurrency to two, and then when I trigger my Workflows using Step Functions I check the status of the Workflow first to make sure that it is not already running. If it is running (or stopping) the state machine goes into a wait loop until the current run is done. In this way there can only ever be one instance of the Workflow running, essentially manually setting the max concurrency to one.

step function that limits workflow to single concurrency

Glue Databrew

Glue Databrew provides a really nice UI interface where you can create jobs to transform data from data catalogs or other sources. There are some features that are missing right now that would make this really useful. For example the jobs and recipes that you create in databrew cannot be added to Glue workflows, can not be exported as Python to be used in a Glue Studio custom transform, and do not appear to be able to use bookmarks.

So they are really useful for a one time cleanup of a dataset, however if you are trying to include them into an automated workflow it gets much trickier.

The only way that I could see to use them right now in an automated workflow would be to have a Step Function orchestrate the following.

  • Glue crawler or job place the incoming files in a designated temporary S3 bucket
  • Trigger the Glue DataBrew job to process the files in that bucket
  • Trigger another standard Glue Workflow/Job/Crawler to move the files to the transformed data repository
  • Go back and delete the files from the temporary bucket/folder with another Step Function step.

In this way you could sidestep the bookmark issue and Glue integration by handling everything with Step Functions.