![]() The DAG uses XComs to analyze cat facts that are retrieved from an API. In this section, you'll review a DAG that uses XCom to pass data between tasks. To learn how to implement a custom XCom backend follow this step-by-step tutorial. You can also implement your own serialization and deserialization methods to define how XComs are handled. Using a custom XCom backend means you can push and pull XComs to and from an external system such as S3, GCS, or HDFS rather than the default of Airflow's metadata database. If you need to serialize other data types you can do so using a custom XCom backend. Airflow supports JSON serialization, as well as Pandas dataframe serialization in version 2.6 and later. The second limitation in using the standard XCom backend is that only certain types of data can be serialized. If you think your data passed via XCom might exceed the size of your metadata database, either use a custom XCom backend or intermediary data storage. You can see that these limits aren't very big. When you use the standard XCom backend, the size-limit for an XCom is determined by your metadata database. While you can technically pass large amounts of data with XCom, be very careful when doing so and consider using a custom XCom backend and scaling your Airflow resources. For example, task metadata, dates, model accuracy, or single value query results are all ideal data to use with XCom. XComs should be used to pass small amounts of data between tasks. You can view your XComs in the Airflow UI by going to Admin > XComs. Similarly, xcom_pull() can be used in a task to receive an XCom. Tasks can also be configured to push XComs by calling the xcom_push() method. Any time a task returns a value (for example, when your Python callable for your PythonOperator has a return), that value is automatically pushed to XCom. When an XCom is pushed, it is stored in the Airflow metadata database and made available to all other tasks. XComs can be "pushed", meaning sent by a task, or "pulled", meaning received by a task. They are defined by a key, value, and timestamp. XComs allow tasks to exchange task metadata or small amounts of data. The first method for passing data between Airflow tasks is to use XCom, which is a key Airflow feature for sharing task data. Large data sets require a method making use of intermediate storage and possibly utilizing an external processing framework. As you'll learn, XComs are one method of passing data between tasks, but they are only appropriate for small amounts of data. Knowing the size of the data you are passing between Airflow tasks is important when deciding which implementation method to use. ![]() This helps with recovery and ensures no data is lost if a failure occurs. When designing a DAG that passes data between tasks, it's important that you ensure that each task is idempotent. If every task in your DAG is idempotent, your full DAG is idempotent as well. ![]() However, this concept also applies to tasks within your DAG. If you execute the same DAGRun multiple times, you will get the same result. This concept is often associated with your entire DAG. This is the property whereby an operation can be applied multiple times without changing the result. Ensure idempotency Īn important concept for any data pipeline, including an Airflow DAG, is idempotency. See DAG writing best practices in Apache Airflow.īefore you dive into the specifics, there are a couple of important concepts to understand before you write DAGs that pass data between tasks. To get the most out of this guide, you should have an understanding of: ![]() All code in this guide can be found in the Github repo. ![]()
0 Comments
Leave a Reply. |