Papermill¶
Apache Airflow supports integration with Papermill. Papermill is a tool for parameterizing and executing Jupyter Notebooks. Perhaps you have a financial report that you wish to run with different values on the first or last day of a month or at the beginning or end of the year. Using parameters in your notebook and using the PapermillOperator makes this a breeze.
Usage¶
Creating a notebook¶
To parameterize your notebook designate a cell with the tag parameters. Papermill looks for the parameters cell and treats this cell as defaults for the parameters passed in at execution time. Papermill will add a new cell tagged with injected-parameters with input parameters in order to overwrite the values in parameters. If no cell is tagged with parameters the injected cell will be inserted at the top of the notebook.
Note that Jupyter notebook has out of the box support for tags but you need to install
the celltags extension for Jupyter Lab: jupyter labextension install @jupyterlab/celltags
Make sure that you save your notebook somewhere so that Airflow can access it. Papermill supports S3, GCS, Azure and Local. HDFS is not supported.
Example DAG¶
Use the PapermillOperator
to execute a jupyter notebook:
run_this = PapermillOperator(
task_id="run_example_notebook",
input_nb="/tmp/hello_world.ipynb",
output_nb="/tmp/out-{{ execution_date }}.ipynb",
parameters={"msgs": "Ran from Airflow at {{ execution_date }}!"}
)