airflow.providers.yandex.operators.yandexcloud_dataproc

Module Contents

class airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateClusterOperator(*, folder_id: Optional[str] = None, cluster_name: Optional[str] = None, cluster_description: str = '', cluster_image_version: str = '1.1', ssh_public_keys: Optional[Union[str, Iterable[str]]] = None, subnet_id: Optional[str] = None, services: Iterable[str] = ('HDFS', 'YARN', 'MAPREDUCE', 'HIVE', 'SPARK'), s3_bucket: Optional[str] = None, zone: str = 'ru-central1-b', service_account_id: Optional[str] = None, masternode_resource_preset: str = 's2.small', masternode_disk_size: int = 15, masternode_disk_type: str = 'network-ssd', datanode_resource_preset: str = 's2.small', datanode_disk_size: int = 15, datanode_disk_type: str = 'network-ssd', datanode_count: int = 2, computenode_resource_preset: str = 's2.small', computenode_disk_size: int = 15, computenode_disk_type: str = 'network-ssd', computenode_count: int = 0, connection_id: Optional[str] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates Yandex.Cloud Data Proc cluster.

Parameters
  • folder_id (Optional[str]) -- ID of the folder in which cluster should be created.

  • cluster_name (Optional[str]) -- Cluster name. Must be unique inside the folder.

  • cluster_description (str) -- Cluster description.

  • cluster_image_version (str) -- Cluster image version. Use default.

  • ssh_public_keys (Optional[Union[str, Iterable[str]]]) -- List of SSH public keys that will be deployed to created compute instances.

  • subnet_id (str) -- ID of the subnetwork. All Data Proc cluster nodes will use one subnetwork.

  • services (Iterable[str]) -- List of services that will be installed to the cluster. Possible options: HDFS, YARN, MAPREDUCE, HIVE, TEZ, ZOOKEEPER, HBASE, SQOOP, FLUME, SPARK, SPARK, ZEPPELIN, OOZIE

  • s3_bucket (Optional[str]) -- Yandex.Cloud S3 bucket to store cluster logs. Jobs will not work if the bucket is not specified.

  • zone (str) -- Availability zone to create cluster in. Currently there are ru-central1-a, ru-central1-b and ru-central1-c.

  • service_account_id (Optional[str]) -- Service account id for the cluster. Service account can be created inside the folder.

  • masternode_resource_preset (str) -- Resources preset (CPU+RAM configuration) for the master node of the cluster.

  • masternode_disk_size (int) -- Masternode storage size in GiB.

  • masternode_disk_type (str) -- Masternode storage type. Possible options: network-ssd, network-hdd.

  • datanode_resource_preset (str) -- Resources preset (CPU+RAM configuration) for the data nodes of the cluster.

  • datanode_disk_size (int) -- Datanodes storage size in GiB.

  • datanode_disk_type (str) -- Datanodes storage type. Possible options: network-ssd, network-hdd.

  • computenode_resource_preset (str) -- Resources preset (CPU+RAM configuration) for the compute nodes of the cluster.

  • computenode_disk_size (int) -- Computenodes storage size in GiB.

  • computenode_disk_type (str) -- Computenodes storage type. Possible options: network-ssd, network-hdd.

  • connection_id (Optional[str]) -- ID of the Yandex.Cloud Airflow connection.

execute(self, context)[source]
class airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocDeleteClusterOperator(*, connection_id: Optional[str] = None, cluster_id: Optional[str] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Deletes Yandex.Cloud Data Proc cluster.

Parameters
  • connection_id (Optional[str]) -- ID of the Yandex.Cloud Airflow connection.

  • cluster_id (Optional[str]) -- ID of the cluster to remove. (templated)

template_fields = ['cluster_id'][source]
execute(self, context)[source]
class airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateHiveJobOperator(*, query: Optional[str] = None, query_file_uri: Optional[str] = None, script_variables: Optional[Dict[str, str]] = None, continue_on_failure: bool = False, properties: Optional[Dict[str, str]] = None, name: str = 'Hive job', cluster_id: Optional[str] = None, connection_id: Optional[str] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Runs Hive job in Data Proc cluster.

Parameters
  • query (Optional[str]) -- Hive query.

  • query_file_uri (Optional[str]) -- URI of the script that contains Hive queries. Can be placed in HDFS or S3.

  • properties (Optional[Dist[str, str]]) -- A mapping of property names to values, used to configure Hive.

  • script_variables (Optional[Dist[str, str]]) -- Mapping of query variable names to values.

  • continue_on_failure (bool) -- Whether to continue executing queries if a query fails.

  • name (str) -- Name of the job. Used for labeling.

  • cluster_id (Optional[str]) -- ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if ot specified. (templated)

  • connection_id (Optional[str]) -- ID of the Yandex.Cloud Airflow connection.

template_fields = ['cluster_id'][source]
execute(self, context)[source]
class airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateMapReduceJobOperator(*, main_class: Optional[str] = None, main_jar_file_uri: Optional[str] = None, jar_file_uris: Optional[Iterable[str]] = None, archive_uris: Optional[Iterable[str]] = None, file_uris: Optional[Iterable[str]] = None, args: Optional[Iterable[str]] = None, properties: Optional[Dict[str, str]] = None, name: str = 'Mapreduce job', cluster_id: Optional[str] = None, connection_id: Optional[str] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Runs Mapreduce job in Data Proc cluster.

Parameters
  • main_jar_file_uri (Optional[str]) -- URI of jar file with job. Can be placed in HDFS or S3. Can be specified instead of main_class.

  • main_class (Optional[str]) -- Name of the main class of the job. Can be specified instead of main_jar_file_uri.

  • file_uris (Optional[Iterable[str]]) -- URIs of files used in the job. Can be placed in HDFS or S3.

  • archive_uris (Optional[Iterable[str]]) -- URIs of archive files used in the job. Can be placed in HDFS or S3.

  • jar_file_uris (Optional[Iterable[str]]) -- URIs of JAR files used in the job. Can be placed in HDFS or S3.

  • properties (Optional[Dist[str, str]]) -- Properties for the job.

  • args (Optional[Iterable[str]]) -- Arguments to be passed to the job.

  • name (str) -- Name of the job. Used for labeling.

  • cluster_id (Optional[str]) -- ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if ot specified. (templated)

  • connection_id (Optional[str]) -- ID of the Yandex.Cloud Airflow connection.

template_fields = ['cluster_id'][source]
execute(self, context)[source]
class airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateSparkJobOperator(*, main_class: Optional[str] = None, main_jar_file_uri: Optional[str] = None, jar_file_uris: Optional[Iterable[str]] = None, archive_uris: Optional[Iterable[str]] = None, file_uris: Optional[Iterable[str]] = None, args: Optional[Iterable[str]] = None, properties: Optional[Dict[str, str]] = None, name: str = 'Spark job', cluster_id: Optional[str] = None, connection_id: Optional[str] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Runs Spark job in Data Proc cluster.

Parameters
  • main_jar_file_uri (Optional[str]) -- URI of jar file with job. Can be placed in HDFS or S3.

  • main_class (Optional[str]) -- Name of the main class of the job.

  • file_uris (Optional[Iterable[str]]) -- URIs of files used in the job. Can be placed in HDFS or S3.

  • archive_uris (Optional[Iterable[str]]) -- URIs of archive files used in the job. Can be placed in HDFS or S3.

  • jar_file_uris (Optional[Iterable[str]]) -- URIs of JAR files used in the job. Can be placed in HDFS or S3.

  • properties (Optional[Dist[str, str]]) -- Properties for the job.

  • args (Optional[Iterable[str]]) -- Arguments to be passed to the job.

  • name (str) -- Name of the job. Used for labeling.

  • cluster_id (Optional[str]) -- ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if ot specified. (templated)

  • connection_id (Optional[str]) -- ID of the Yandex.Cloud Airflow connection.

template_fields = ['cluster_id'][source]
execute(self, context)[source]
class airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreatePysparkJobOperator(*, main_python_file_uri: Optional[str] = None, python_file_uris: Optional[Iterable[str]] = None, jar_file_uris: Optional[Iterable[str]] = None, archive_uris: Optional[Iterable[str]] = None, file_uris: Optional[Iterable[str]] = None, args: Optional[Iterable[str]] = None, properties: Optional[Dict[str, str]] = None, name: str = 'Pyspark job', cluster_id: Optional[str] = None, connection_id: Optional[str] = None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Runs Pyspark job in Data Proc cluster.

Parameters
  • main_python_file_uri (Optional[str]) -- URI of python file with job. Can be placed in HDFS or S3.

  • python_file_uris (Optional[Iterable[str]]) -- URIs of python files used in the job. Can be placed in HDFS or S3.

  • file_uris (Optional[Iterable[str]]) -- URIs of files used in the job. Can be placed in HDFS or S3.

  • archive_uris (Optional[Iterable[str]]) -- URIs of archive files used in the job. Can be placed in HDFS or S3.

  • jar_file_uris (Optional[Iterable[str]]) -- URIs of JAR files used in the job. Can be placed in HDFS or S3.

  • properties (Optional[Dist[str, str]]) -- Properties for the job.

  • args (Optional[Iterable[str]]) -- Arguments to be passed to the job.

  • name (str) -- Name of the job. Used for labeling.

  • cluster_id (Optional[str]) -- ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if ot specified. (templated)

  • connection_id (Optional[str]) -- ID of the Yandex.Cloud Airflow connection.

template_fields = ['cluster_id'][source]
execute(self, context)[source]

Was this entry helpful?