S3 json query

3/2/2023

The only major drawback is that it doesn't have native pandas support, but is very easy to convert. The files could be concatenated together into a single outfileĪvro can represent almost all Athena/Presto datatypes (except Map) and has excellent support through fastavro.The files don't need to be directly downloaded when parsing a S3 path to Pandas or using s3fs (this is usually slower).S3 files could be downloaded in parallel, which may be faster.The queries could be executed without blocking using the AsynchronousCursor.There's a lot that could be done to make this faster or more convenient: The full details (streaming instead of downloading) are available in the sample implementation.

Note that because it can be spread accross files, any sorting from the query may be lost unless you merge sort the input. The individual files can then be read in with fastavro for Avro, pyarrow for Parquet or json for JSON. # Optionally remove underlying S3 files here def create_table_as(cursor, table, query, format='AVRO'):Ĭursor.execute(f"CREATE TABLE IF EXISTS')

The input query in a CTAS to change the output format. S3_locations = table_file_location(cursor, table) Multiple files may be created in outfolder.Ĭreate_table_as(cursor, table, query, format) Note that all columns in query must be named for this to work """Use PyAthena cursor to download query to outfolder in format The final method looks like this: def download_table(cursor, outfolder, query, format='AVRO'): I am focus on Athena for this example, but the same method applies to Presto using ) with a few small changes to the queries. Instead it's much faster to export the data to S3 and then download it into python directly. However the fetch method of the default database cursor is very slow for large datasets (from around 10MB up). PyAthena is a good library for accessing Amazon Athena, and works seamlessly once you've configured the credentials. Note that since this article was originally written Athena has added an unload command for exporting a query result as a file type, and AWS Data Wrangler now has convenient wrappers for quickly exporting data from Athena by using a CTAS or unload query in the background. I have a sample implementation showing how to query avro with query_avro and using the CSV trick with query. There is another way, directly reading the output of an Athena query as a CSV from S3, but there are some limitations. I will focus on Athena but most of it will apply to Presto using presto-python-client with some minor changes to DDLs and authentication. This is very robust and for large data files is a very quick way to export the data.

Reading the data into memory using fastavro, pyarrow or Python's JSON library optionally using Pandas.
Wrapping the SQL into a Create Table As Statement (CTAS) to export the data to S3 as Avro, Parquet or JSON lines files.
Writing SQL to filter and transform the data into what you want to load into Python.
The most effective workflow I've found for exporting data from Athena or Presto into Python is:

They've got a very powerful query language and can process large volumes of data quickly in memory accross a cluster of commodity machines.įor this reason many tech companies like Facebook, Uber and Netflix use Presto/Athena as a core way to access their data platform. They can query data accross data files directly in S3 (and HDFS for Presto) and many common databases via Presto connectors or Athena's federated queries. Presto, and Amazon's managed version Athena are very powerful tools for preparing and exporting data. In many businesses larger datasets live in databases, in an object store (like Amazon S3) or the Hadoop File System.įor some use cases you can do the work where the data lives using SQL or Spark, but sometimes it's more convenient to load it into a language like Python (or R) with a wider range of tools. One necessary hurdle in doing data analysis or machine learning is loading the data.

0 Comments

S3 json query

Leave a Reply.

Author

Archives

Categories