MMS • Nsikan Essien
Article originally posted on InfoQ. Visit InfoQ
AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.
AWS Glue is a serverless data-integration service. The service is a suite of AWS integrations built around two major components: its data-cataloging functionality, Glue Data Catalog (based on Apache Hive Metastore), and its extract-transform-load (ETL) pipeline capability, Glue ETL (based on Apache Spark). A Glue Data Catalog, for which the Crawler history feature displays changes, represents a metadata store in the Glue ecosystem. The catalog houses table definitions, which describe the schema of data that exists in a location outside of Glue, such as AWS Simple Storage Service (S3) or Relational Database Service (RDS). The catalog can then be used by Glue ETL as a reference to sources or targets of data for its pipelines, as well as by other AWS analytics services such as AWS Athena. Table definitions can be added manually or created using Crawlers.
Crawlers are jobs that create or update table definitions in a Glue Catalog on their completion. They can be run ad hoc or to a schedule and interrogate the target data source by classifying and grouping the scanned data. Crawlers use built-in classifiers for inferring the data’s schema and format but can be enhanced with user-defined custom classifiers for more complex use cases.
On execution of a Crawler, the history feature shows contextual information such as the duration of the run, the associated computing costs, and the changes effected in the metadata store.
Given that AWS Glue is an amalgamation of synergistic tools, its components are often compared to other solutions rather than the entire offering. The Glue Catalog is often compared to the Apache Hive Metastore, while Glue ETL offers functionality that can be found with AWS’s Elastic MapReduce service. Yoni Augarten of lakeFS, in a comparison of Glue Catalog and Hive Metastore, recommended Hive for larger organizations heavily invested in the Hadoop ecosystem and Glue Catalog for smaller teams with more straightforward requirements.
The Crawler history feature can be used via the AWS console, programmatically via the ListCrawls Web API, or via any of the official AWS SDKs.