airflow.contrib.operators.mysql_to_gcs

Module Contents

airflow.contrib.operators.mysql_to_gcs.PY3[source]
class airflow.contrib.operators.mysql_to_gcs.MySqlToGoogleCloudStorageOperator(sql, bucket, filename, schema_filename=None, approx_max_file_size_bytes=1900000000, mysql_conn_id='mysql_default', google_cloud_storage_conn_id='google_cloud_default', schema=None, delegate_to=None, export_format='json', field_delimiter=', ', *args, **kwargs)[source]

Bases:airflow.models.BaseOperator

Copy data from MySQL to Google cloud storage in JSON or CSV format.

The JSON data files generated are newline-delimited to enable them to be loaded into BigQuery. Reference: https://cloud.google.com/bigquery/docs/ loading-data-cloud-storage-json#limitations

Parameters
  • sql (str) – The SQL to execute on the MySQL table.

  • bucket (str) – The bucket to upload to.

  • filename (str) – The filename to use as the object name when uploading to Google cloud storage. A {} should be specified in the filename to allow the operator to inject file numbers in cases where the file is split due to size.

  • schema_filename (str) – If set, the filename to use as the object name when uploading a .json file containing the BigQuery schema fields for the table that was dumped from MySQL.

  • approx_max_file_size_bytes (long) – This operator supports the ability to split large table dumps into multiple files (see notes in the filenamed param docs above). Google cloud storage allows for files to be a maximum of 4GB. This param allows developers to specify the file size of the splits.

  • mysql_conn_id (str) – Reference to a specific MySQL hook.

  • google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook.

  • schema (str or list) – The schema to use, if any. Should be a list of dict or a str. Pass a string if using Jinja template, otherwise, pass a list of dict. Examples could be seen: https://cloud.google.com/bigquery/docs /schemas#specifying_a_json_schema_file

  • delegate_to (str) – The account to impersonate, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • export_format (str) – Desired format of files to be exported.

  • field_delimiter (str) – The delimiter to be used for CSV files.

template_fields = ['sql', 'bucket', 'filename', 'schema_filename', 'schema'][source]
template_ext = ['.sql'][source]
ui_color = #a0e08c[source]
execute(self, context)[source]
_query_mysql(self)[source]

Queries mysql and returns a cursor to the results.

_write_local_data_files(self, cursor)[source]

Takes a cursor, and writes results to a local file.

Returns

A dictionary where keys are filenames to be used as object names in GCS, and values are file handles to local files that contain the data for the GCS objects.

_configure_csv_file(self, file_handle, schema)[source]

Configure a csv writer with the file_handle and write schema as headers for the new file.

_write_local_schema_file(self, cursor)[source]

Takes a cursor, and writes the BigQuery schema in .json format for the results to a local file system.

Returns

A dictionary where key is a filename to be used as an object name in GCS, and values are file handles to local files that contains the BigQuery schema fields in .json format.

_upload_to_gcs(self, files_to_upload)[source]

Upload all of the file splits (and optionally the schema .json file) to Google cloud storage.

static _convert_types(schema, col_type_dict, row)[source]

Takes a value from MySQLdb, and converts it to a value that’s safe for JSON/Google cloud storage/BigQuery. Dates are converted to UTC seconds. Decimals are converted to floats. Binary type fields are encoded with base64, as imported BYTES data must be base64-encoded according to Bigquery SQL date type documentation: https://cloud.google.com/bigquery/data-types

_get_col_type_dict(self)[source]

Return a dict of column name and column type based on self.schema if not None.

classmethod type_map(cls, mysql_type)[source]

Helper function that maps from MySQL fields to BigQuery fields. Used when a schema_filename is set.