One of the issues that will commonly arise for me when querying Glue tables with Athena is that I will get a HIVE_PARTITION_SCHEMA_MISMATCH error. Usually what this means is that one of the columns in a data partition does not have a matching data type to the table schema.
How does something like that happen? There are many ways but here is one possible scenario. Let's say that the Glue Table is being created by a Glue Crawler, which is parsing the data types for all the columns using some set of sample data. It's possible that the initial parsing of the data types will not hold up in the future. For example if I have sales data and in the limited sample data set all the sales amounts are round integers, or zeros (EG 20 or 0). The crawler will assume that the datatype for this column is int. That will cause a HIVE_PARTITION_SCHEMA_MISMATCH error later if a new partition with a sales amount like 5.75 is added (which would be datatype double).
Unfortunately we can't simply solve this problem by running the Glue Crawler again. Let's take a look at the advanced options in the Glue Crawler.
Crawlers > Edit Crawler > Step 4: Set output and scheduling > Advanced Options
In my experience I have not had a lot of success with the "Update all new and existing partitions with metadata from the table" option. The ideal situation would be that if a partition came in with datatype double on our sales column, that would update the table definition to double, and then all the partition schemas would also be updated. In my experience it just doesn't work.
The most reliable method that I have found is to lock the table definition so that it can only be changed manually by selecting the "Ignore the change and don't update the table in the data catalog" and the "Update all new and existing partitions with metadata from the table" option together. In this way all of the incoming data partition schemas will be forced to have the same schema as the table schema, which should now be immutable unless changed manually. If the data coming in doesn't fit that schema you have a different problem entirely.
But this leaves us with the question of how to handle all the partitions that already exist in our table that don't match the schema.
The solution to this problem is to delete all the partitions and run the crawler again, which will lock all the new partitions into the correct schema.
Remember, deleting the table partitions will not affect the actual data in any way. This is just the index of the data for the table, which can be easily recreated with another crawl run. Unfortunately there is not a very simple way to delete all the partitions through the GUI.
We have to use the SDK or the console to accomplish this. Here is the lambda function that i've written which handles this.
This is confirmed functional in Lambda running Node v16.
constAWS=require('aws-sdk'); // Set the region and initialize the Glue client AWS.config.update({region:'us-east-1'}); const glue =newAWS.Glue(); exports.handler=async(event, context)=>{ // Specify the database and table names const databaseName = event.database; const tableName = event.table; console.log("database:", databaseName); console.log("table:", tableName); asyncfunctiongetAllPartitions(databaseName, tableName){ let partitions =[]; let nextToken; do{ const params ={ DatabaseName: databaseName, TableName: tableName, NextToken: nextToken, }; const response =await glue.getPartitions(params).promise(); partitions = partitions.concat(response.Partitions||[]); nextToken = response.NextToken; }while(nextToken); return partitions; } try{ // Get a list of partitions for the table const partitions =awaitgetAllPartitions(databaseName, tableName); console.log(`retrieved ${partitions.length} partitions`); // Delete all partitions in batches of 25 const partitionBatches =[]; for(let i =0; i < partitions.length; i +=25){ partitionBatches.push(partitions.slice(i, i +25)); } awaitPromise.all(partitionBatches.map((partitionBatch)=>{ let partitionValuesArray =[]; partitionBatch.forEach((partition=>{ partitionValuesArray.push({Values: partition.Values}); })); glue.batchDeletePartition({ DatabaseName: databaseName, TableName: tableName, PartitionsToDelete: partitionValuesArray, }).promise(); })); console.log(`Deleted ${partitions.length} partitions`); } catch(err){ console.error(err); } };
Once all of the table partitions are deleted you will have a clean slate to run the crawler again and recreate the partitions, which (if you've set the settings on your crawler as shown) will force the columns to match the schema that we have set in the table.
Free desktop AI Chat client, designed for developers and businesses. Unlocks advanced model settings only available in the API. Includes quality of life features like custom syntax highlighting.
React is a declarative, efficient, and flexible JavaScript library for building user interfaces. It lets you compose complex UIs from small and isolated pieces of code called “components”.
Bidbear is a report automation tool. It downloads Amazon Seller and Advertising reports, daily, to a private database. It then merges and formats the data into beautiful, on demand, exportable performance reports.
JavaScript is a lightweight, interpreted programming language with first-class functions that is known for its role in web development. It is an essential part of web applications and allows for interactive elements.
Docker is a platform for developers and sysadmins to develop, deploy, and run applications with containers. The use of Linux containers to deploy applications is called containerization.
React is a declarative, efficient, and flexible JavaScript library for building user interfaces. It lets you compose complex UIs from small and isolated pieces of code called “components”.
Redux is a predictable state container for JavaScript apps. It helps you write applications that behave consistently, run in different environments (client, server, and native), and are easy to test.
Vite is a modern, fast front-end build tool that significantly improves the frontend development experience. It features a dev server with Hot Module Replacement (HMR) and a build command that bundles your code with Rollup, which is configured to output highly optimized static assets for production.
Next.js is an open-source React front-end development web framework that enables functionality such as server-side rendering and generating static websites for React based web applications.
Docusaurus is a project for building, deploying, and maintaining open source project websites easily. It's optimized for documentation and can leverage the power of Markdown and React to enable seamless documentation sites with built-in search capabilities.
Stripe is a technology company that builds economic infrastructure for the internet. Businesses of every size use the company's software to accept payments and manage their businesses online.
Sentry is an open-source error tracking tool that helps developers monitor and fix crashes in real time. The tool streamlines the error resolution process and aggregates errors across different programming languages.
D3.js is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It uses HTML, SVG, and CSS to bring data to life through a data-driven approach to DOM manipulation.
React Flow is a highly customizable library for building interactive node-based editors, diagrams, and flow charts. It provides a set of tools that make it easy to drag, drop, and connect nodes on a canvas.
TipTap is a headless, framework-agnostic text editor capable of handling a variety of complex text editing needs. It is built on top of ProseMirror and is extendable and customizable to fit different editing requirements.
JavaScript is a lightweight, interpreted programming language with first-class functions that is known for its role in web development. It is an essential part of web applications and allows for interactive elements.
Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with its use of significant whitespace.
Amazon Cognito provides authentication, authorization, and user management for your web and mobile apps. Users can sign in directly with a username and password, or through a third party such as Facebook, Amazon, or Google.
Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. It acts as a front door for applications to access data, business logic, or functionality from your back-end services.
AWS Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda executes your code only when needed and scales automatically, from a few requests per day to thousands per second.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.
AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows so you can build and update apps quickly. It provides a reliable way to orchestrate the components of distributed applications and microservices using visual workflows.
Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. SQS eliminates the complexity and overhead associated with managing and operating message oriented middleware.
Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed database and supports both document and key-value store models.
Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.
Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. CloudWatch provides data and actionable insights to monitor applications, understand and respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health.
Amazon CloudFront is a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency, high transfer speeds, all within a developer-friendly environment.
Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS) web service. It is designed to give developers and businesses an extremely reliable and cost-effective way to route end users to Internet applications.
Amazon EventBridge is a serverless event bus service that makes it easy to connect applications together using data from your own applications, integrated SaaS applications, and AWS services. EventBridge helps facilitate event-driven architectures by routing events between AWS services, integrated SaaS applications, and your own applications.