tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark

Example Airflow DAG for DataprocSubmitJobOperator with pyspark job.

Module Contents

tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.ENV_ID[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.DAG_ID = 'dataproc_pyspark'[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.PROJECT_ID[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.BUCKET_NAME[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.CLUSTER_NAME[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.REGION = 'europe-west1'[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.ZONE = 'europe-west1-b'[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.CLUSTER_CONFIG[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.JOB_FILE_NAME = 'dataproc-pyspark-job.py'[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.JOB_FILE_LOCAL_PATH[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.JOB_FILE_CONTENT = Multiline-String[source]
Show Value
"""from operator import add
from random import random

from pyspark.sql import SparkSession


def f(_: int) -> float:
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x**2 + y**2 <= 1 else 0


spark = SparkSession.builder.appName("PythonPi").getOrCreate()
partitions = 2
n = 100000 * partitions
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print(f"Pi is roughly {4.0 * count / n:f}")

spark.stop()
"""
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.PYSPARK_JOB[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.create_bucket[source]
tests.system.providers.google.cloud.dataproc.example_dataproc_pyspark.test_run[source]

Was this entry helpful?