Yandex.Cloud Data Proc Operators

The Yandex.Cloud Data Proc is a service that helps to deploy Apache Hadoop®* and Apache Spark™ clusters in the Yandex.Cloud infrastructure.

You can control the cluster size, node capacity, and set of Apache® services (Spark, HDFS, YARN, Hive, HBase, Oozie, Sqoop, Flume, Tez, Zeppelin).

Apache Hadoop is used for storing and analyzing structured and unstructured big data.

Apache Spark is a tool for quick data-processing that can be integrated with Apache Hadoop as well as with other storage systems.

Prerequisite Tasks

  1. Install the yandexcloud package first, like so: pip install 'apache-airflow[yandexcloud]'.

  2. Restart the Airflow webserver and scheduler.

  3. Make sure the Yandex.Cloud connection type has been defined in Airflow. Open the connections list and look for a connection with 'yandexcloud' type.

  4. Fill the required fields in Yandex.Cloud connection.

Using the operators

See the usage examples in example DAGs

Was this entry helpful?