Papermill

Apache Airflow supports integration with Papermill. Papermill is a tool for parameterizing and executing Jupyter Notebooks. Perhaps you have a financial report that you wish to run with different values on the first or last day of a month or at the beginning or end of the year. Using parameters in your notebook and using the PapermillOperator makes this a breeze.

Usage

Creating a notebook

To parameterize your notebook designate a cell with the tag parameters. Papermill looks for the parameters cell and treats this cell as defaults for the parameters passed in at execution time. Papermill will add a new cell tagged with injected-parameters with input parameters in order to overwrite the values in parameters. If no cell is tagged with parameters the injected cell will be inserted at the top of the notebook.

Note that Jupyter notebook has out of the box support for tags but you need to install the celltags extension for Jupyter Lab: jupyter labextension install @jupyterlab/celltags

Make sure that you save your notebook somewhere so that Airflow can access it. Papermill supports S3, GCS, Azure and Local. HDFS is not supported.

Example DAG

Use the PapermillOperator to execute a jupyter notebook:

airflow/contrib/example_dags/example_papermill_operator.pyView Source

run_this = PapermillOperator(
    task_id="run_example_notebook",
    input_nb="/tmp/hello_world.ipynb",
    output_nb="/tmp/out-{{ execution_date }}.ipynb",
    parameters={"msgs": "Ran from Airflow at {{ execution_date }}!"}
)

Was this entry helpful?