Yandex.Cloud Data Proc Operators¶

The Yandex.Cloud Data Proc is a service that helps to deploy Apache Hadoop®* and Apache Spark™ clusters in the Yandex.Cloud infrastructure.

You can control the cluster size, node capacity, and set of Apache® services (Spark, HDFS, YARN, Hive, HBase, Oozie, Sqoop, Flume, Tez, Zeppelin).

Apache Hadoop is used for storing and analyzing structured and unstructured big data.

Apache Spark is a tool for quick data-processing that can be integrated with Apache Hadoop as well as with other storage systems.

Prerequisite Tasks¶

Install the yandexcloud package first, like so: pip install 'apache-airflow[yandexcloud]'.
Restart the Airflow webserver and scheduler.
Make sure the Yandex.Cloud connection type has been defined in Airflow. Open the connections list and look for a connection with 'yandexcloud' type.
Fill the required fields in Yandex.Cloud connection.

See the usage examples in example DAGs