The only major drawback is that it doesn't have native pandas support, but is very easy to convert. The files could be concatenated together into a single outfileĪvro can represent almost all Athena/Presto datatypes (except Map) and has excellent support through fastavro.The files don't need to be directly downloaded when parsing a S3 path to Pandas or using s3fs (this is usually slower).S3 files could be downloaded in parallel, which may be faster.The queries could be executed without blocking using the AsynchronousCursor.There's a lot that could be done to make this faster or more convenient: The full details (streaming instead of downloading) are available in the sample implementation. ![]() Note that because it can be spread accross files, any sorting from the query may be lost unless you merge sort the input. The individual files can then be read in with fastavro for Avro, pyarrow for Parquet or json for JSON. # Optionally remove underlying S3 files here def create_table_as(cursor, table, query, format='AVRO'):Ĭursor.execute(f"CREATE TABLE IF EXISTS') ![]() The input query in a CTAS to change the output format. S3_locations = table_file_location(cursor, table) Multiple files may be created in outfolder.Ĭreate_table_as(cursor, table, query, format) Note that all columns in query must be named for this to work """Use PyAthena cursor to download query to outfolder in format The final method looks like this: def download_table(cursor, outfolder, query, format='AVRO'): I am focus on Athena for this example, but the same method applies to Presto using ) with a few small changes to the queries. Instead it's much faster to export the data to S3 and then download it into python directly. However the fetch method of the default database cursor is very slow for large datasets (from around 10MB up). PyAthena is a good library for accessing Amazon Athena, and works seamlessly once you've configured the credentials. Note that since this article was originally written Athena has added an unload command for exporting a query result as a file type, and AWS Data Wrangler now has convenient wrappers for quickly exporting data from Athena by using a CTAS or unload query in the background. I have a sample implementation showing how to query avro with query_avro and using the CSV trick with query. There is another way, directly reading the output of an Athena query as a CSV from S3, but there are some limitations. I will focus on Athena but most of it will apply to Presto using presto-python-client with some minor changes to DDLs and authentication. This is very robust and for large data files is a very quick way to export the data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |