Limit search to available items
Book Cover
E-book
Author Ruiter, Julian de

Title Data Pipelines with Apache Airflow
Published [Place of publication not identified] : Simon & Schuster : Manning, 2021

Copies

Description 1 online resource
Contents Intro -- inside front cover -- Data Pipelines with Apache Airflow -- Copyright -- brief contents -- contents -- front matter -- preface -- acknowledgments -- Bas Harenslak -- Julian de Ruiter -- about this book -- Who should read this book -- How this book is organized: A road map -- About the code -- LiveBook discussion forum -- about the authors -- about the cover illustration -- Part 1. Getting started -- 1 Meet Apache Airflow -- 1.1 Introducing data pipelines -- 1.1.1 Data pipelines as graphs -- 1.1.2 Executing a pipeline graph -- 1.1.3 Pipeline graphs vs. sequential scripts
1.1.4 Running pipeline using workflow managers -- 1.2 Introducing Airflow -- 1.2.1 Defining pipelines flexibly in (Python) code -- 1.2.2 Scheduling and executing pipelines -- 1.2.3 Monitoring and handling failures -- 1.2.4 Incremental loading and backfilling -- 1.3 When to use Airflow -- 1.3.1 Reasons to choose Airflow -- 1.3.2 Reasons not to choose Airflow -- 1.4 The rest of this book -- Summary -- 2 Anatomy of an Airflow DAG -- 2.1 Collecting data from numerous sources -- 2.1.1 Exploring the data -- 2.2 Writing your first Airflow DAG -- 2.2.1 Tasks vs. operators
2.2.2 Running arbitrary Python code -- 2.3 Running a DAG in Airflow -- 2.3.1 Running Airflow in a Python environment -- 2.3.2 Running Airflow in Docker containers -- 2.3.3 Inspecting the Airflow UI -- 2.4 Running at regular intervals -- 2.5 Handling failing tasks -- Summary -- 3 Scheduling in Airflow -- 3.1 An example: Processing user events -- 3.2 Running at regular intervals -- 3.2.1 Defining scheduling intervals -- 3.2.2 Cron-based intervals -- 3.2.3 Frequency-based intervals -- 3.3 Processing data incrementally -- 3.3.1 Fetching events incrementally
3.3.2 Dynamic time references using execution dates -- 3.3.3 Partitioning your data -- 3.4 Understanding Airflow's execution dates -- 3.4.1 Executing work in fixed-length intervals -- 3.5 Using backfilling to fill in past gaps -- 3.5.1 Executing work back in time -- 3.6 Best practices for designing tasks -- 3.6.1 Atomicity -- 3.6.2 Idempotency -- Summary -- 4 Templating tasks using the Airflow context -- 4.1 Inspecting data for processing with Airflow -- 4.1.1 Determining how to load incremental data -- 4.2 Task context and Jinja templating -- 4.2.1 Templating operator arguments
4.2.2 What is available for templating? -- 4.2.3 Templating the PythonOperator -- 4.2.4 Providing variables to the PythonOperator -- 4.2.5 Inspecting templated arguments -- 4.3 Hooking up other systems -- Summary -- 5 Defining dependencies between tasks -- 5.1 Basic dependencies -- 5.1.1 Linear dependencies -- 5.1.2 Fan-in/-out dependencies -- 5.2 Branching -- 5.2.1 Branching within tasks -- 5.2.2 Branching within the DAG -- 5.3 Conditional tasks -- 5.3.1 Conditions within tasks -- 5.3.2 Making tasks conditional -- 5.3.3 Using built-in operators -- 5.4 More about trigger rules
Summary Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You'll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline's needs
Notes © 2021 Manning Publications Co. All rights reserved. 2021
Vendor-supplied metadata
Subject Data mining.
Cloud computing.
Programming languages (Electronic computers)
Python (Computer program language)
Big data.
Machine learning.
Electronic data processing.
Information storage and retrieval systems -- Scalability.
Application program interfaces (Computer software)
Data Mining
Machine Learning
APIs (interfaces)
Python.
Programming Languages.
COMPUTERS.
Data Visualization.
Application program interfaces (Computer software)
Big data
Cloud computing
Data mining
Electronic data processing
Information storage and retrieval systems -- Scalability
Machine learning
Programming languages (Electronic computers)
Python (Computer program language)
Form Electronic book
Author Harenslak, Bas
ISBN 9781638356837
1638356831
9781617296901
1617296902