|
Posted
over 3 years
ago
Airflow User Survey 2022
This year’s survey has come and gone, and with it we’ve got a new batch of data for everyone! We collected 210 responses over two weeks. We continue to see growth in both contributions and downloads
... [More]
over the last two years, and expect that trend will continue through 2022.
The raw response data will be made available here soon, in the meantime, feel free to email [email protected] for a copy.
TL;DR
Overview of the user
Like previous years, more than half of the Airflow users are Data Engineers (54%). Solutions Architects (13%), Developers (12%), DevOps (6%) and Data Scientists (4%) are also active Airflow users! There was a slight increase in the representation of Solutions Architect roles compared to results from 2020 and 2019 .
Airflow is used and popular in bigger companies, 64% of Airflow users work for companies with 200+ employees which is an 11 percent increase compared to 2020.
62% of the survey participants have more than 6 Airflow users in their company.
More Airflow users (65.9%) are willing to recommend Apache Airflow compared to the survey results in 2020 and 2019. There is a general positive trend in a willingness to recommend Airflow, 93% of surveyed Airflow users are willing to recommend Airflow ( 85.7% in 2019 and 92% in 2020 ), only 1% of users are not likely to recommend (3.6% in 2019 and 3.5% in 2020).
Airflow documentation is a critical source of information, with more than 90% (15% increase compared to results from 2020) of survey participants using the documentation. Airflow documentation is also one of the top areas to improve! What’s interesting, also Stack Overflow usage is critical, with about 60% users declaring to use it as a source of information (24% increase compared to results from 2020).
Deployments
85% of the Airflow users have between 1 and 7 active Airflow instances. 62.5% of the Airflow users have between 11 and 250 DAGs in their largest Airflow instance. 75% of the surveyed Airflow users have between 1 and 100 tasks per DAG.
Close to 85% of users use one of the Airflow 2 versions, 9.2% users still use 1.10.15, while the remaining 6.3% are still using older Airflow 1 versions. The good news is that the majority of users on Airflow 1 are planning migration to Airflow 2 quite soon, with resources and capacity being the main blockers.
In comparison to results from 2020, more users were interested in monitoring in general and specifically in using tools such as external monitoring services (40.7%, up from 29.6%) and information from metabase (35.7%, up from 25.1%).
Celery (52.7%) and Kubernetes (39.4%) are the most common executors used.
Usage
81.3% of Airflow users who responded to the survey don’t have any customisation of Airflow.
Xcom (69.8%) is the most popular method to pass inputs and outputs between tasks, however Saving and Retrieving Inputs and Outputs from Storage still plays an important role (49%).
Lineage itself is a quite new topic for Airflow users, most of them don’t use lineage solutions but might be interested if supported by Airflow (47.5%), are not familiar with data lineage (29%) or that data lineage is not their concern (13%).
The Airflow web UI is used heavily for Monitoring Runs (95.9%), Accessing Task Logs (89.8%), Manually triggering DAGs (85.2%), Clearing Tasks (82.7%) and Marking Tasks as successful (60.7%). The top 3 views used are: List of DAGs, Task Logs and DAG Runs, which is very similar to results from 2020 and 2019.
Community and contribution
Most Airflow users (57.1%) are aware they could contribute but do not, and an additional 21.7% contribute very rarely. 14.8% of users were not aware they could contribute. There is much more to be done to engage our community to be more active contributors and raise the current 6.4% of users who actively contribute, especially considering that one important blocker for contribution is lack of knowledge on how to start (37.7%).
The future of Airflow
The top area for improvement is still the Airflow web UI (49.5%), closely followed by more telemetry for logging, monitoring and alerting purposes (48%). However all those efforts should go in line with improved documentation (36.6.%) and resources about using the Airflow, especially when we take into account the need of onboarding new users (36.6%).
DAG Versioning(66.2%) is a winner for new features in Airflow, and it’s not a surprise as this feature may positively impact daily work of Airflow users. It is followed by three other ideas: Dependency management and Data-driven scheduling (42.6%), More dynamic task structure (42.1%) and Multi-Tenancy (37.9%).
Overview of the user
What best describes your current occupation? (single choice)
No.
%
Data Engineer
114
54%
Solutions Architect
27
13%
Developer
25
12%
DevOps
12
6%
Data Scientist
8
4%
Support Engineer
5
2%
Data Analyst
3
1%
Business Analyst
2
1%
Other
14
7%
According to the survey, more than half of Airflow users are Data Engineers (54%). Roles of the remaining Airflow users might be broken down into Solutions Architects (13%), Developers (12%), DevOps (6%) and Data Scientists (4%). The 2022 results are similar to those from 2019 and 2020 with a slight increase in the representation of Solutions Architect roles.
How often do you interact with Airflow? (single choice)
No.
%
Every day
154
73%
At least once per week
36
17%
At least once per month
11
5%
Less than once per month
9
4%
Users who took the survey are actively using Airflow as part of their current role. 73% of Airflow users who responded use it on a daily basis, 17% weekly.
How many people work at your company? (single choice)
No.
%
201-5000
85
41%
5000+
49
23%
51-200
46
22%
11-50
20
10%
1-10
9
4%
Airflow is a framework that is used and popular in bigger companies, 64% of Airflow users who responded (compared to 52.7% in 2020) work for companies bigger than 200 employees (41% in companies size 201-5000 and 23% in companies size 5000+).
How many people at your company use Airflow? (single choice)
No.
%
6-20
80
38%
1-5
61
29%
51-200
49
24%
200+
18
9%
Airflow is generally used by small to medium-sized teams. 62% of the survey participants have more than 6 Airflow users in their company (38% have between 6 and 200 users, 24% between 51-200 users).
How likely are you to recommend Apache Airflow? (single choice)
% 2019
% 2020
% 2022
Very Likely
45.4%
61.6%
65.9%
Likely
40.3%
30.4%
26.9%
Neutral
10.7%
5.4%
6.3%
Unlikely
2.6%
1.5%
0.5%
Very Unlikely
1%
1%
0.5%
According to the survey, more Airflow users (65.9%) are willing to recommend Apache Airflow compared to the survey results in 2020 and 2019. There is a general positive trend in a willingness to recommend Airflow, 93% of surveyed Airflow users are willing to recommend Airflow (92% in 2020 and 85.7% in 2019), only 1% of users are not likely to recommend (3.6% in 2019 and 3.5% in 2020 ).
What is your source of information about Airflow? (multiple choice)
No.
%
Documentation
189
90.4%
Airflow website (Blog, etc.)
142
67.9%
Stack Overflow
126
60.3%
Github Issues
104
49.8%
Slack
96
45.9%
Airflow Summit Videos
88
42.1%
GitHub Discussions
76
36.4%
Airflow Community Webinars
41
19.6%
Astronomer Registry
51
24.4%
Airflow Mailing List
34
16.3%
Airflow documentation is a critical source of information, with more than 90% of survey participants using the documentation. It is of increasing importance compared to results from 2020 where documentation was at about 75% level. Moreover, more than 60% of users are getting information from the Airflow website (67.9% ) and Stack Overflow (60.3%) which is also a big increase compared to 36% level in 2020. What’s interesting is that Slack usage decreased from 63.05% in 2020 to 45.9% in 2022.
Deployments
How many active DAGs do you have in your largest Airflow instance? (single choice)
No.
%
51-250
66
31.7%
11-50
64
30.8%
5-10
25
12.0%
251-500
20
9.6%
<5
14
6.7%
1000+
10
4.8%
501-1000
9
4.3%
62.5% of the Airflow users surveyed have between 11 and 250 DAGs in their largest Airflow instance.
How many active Airflow instances do you have? (single choice)
No.
%
1
52
25.2%
2
46
22.3%
4-7
40
19.4%
3
37
18.0%
20+
19
9.2%
8-10
7
3.4%
11-20
5
2.4%
85% of the Airflow users surveyed have between 1 and 7 active Airflow instances, and nearly 50% have only 1 or 2.
What is the maximum number of tasks that you have used in a single DAG?(single choice)
No.
%
11-25
51
24.5%
26-50
41
19.7%
51-100
35
16.8%
<10
29
13.9%
101-250
23
11.1%
501-1000
9
4.3%
1000-2500
8
3.8%
251-500
8
3.8%
2500-5000
4
1.9%
75% of the surveyed Airflow users have between 1 and 100 tasks per DAG.
How many schedulers do you have in your largest Airflow instance? (single choice)
No.
%
1
113
55.1%
2
61
29.8%
3
18
8.8%
4+
13
6.3%
More than half of Airflow users who responded to the survey have 1 scheduler in their largest Airflow instance, however it’s important to notice that the second half of Airflow users decided to have 2 schedulers and more.
What executor type do you use? (multiple choice)
No.
%
Celery
107
52.7 %
Kubernetes
80
39.4%
Local
49
24.1%
Sequential
21
10.3%
CeleryKubernetes
14
6.9%
Celery (52.7%) and Kubernetes (39.4%) are the most common executors used. CeleryKubernetes (6.9%) executor also started to be noticed and used by Airflow users.
If you use the Celery executor, how many workers do you have in your largest Airflow instance? (single choice)
No.
%
2-5
64
44.8%
10+
28
19.6%
1
26
18.2%
6-10
25
17.5%
Amongst Celery executor users who responded to the survey, close to half the number (44.8%) have between 2 and 5 workers in their largest Airflow instance. It’s notable that nearly a fifth (19.6%) have more than 10 workers.
Which version of Airflow do you currently use? (single choice)
No.
%
1.10.14 or older
13
6.3%
1.10.15
19
9.2%
2.0.x
23
11.1%
2.1.x
24
11.6%
2.2.x
79
38.2%
2.3.x
49
23.7%
It’s good to see that close to 85% of users who responded to the survey use one of the Airflow 2 versions, 9.2% users still use 1.10.15, while the remaining 6.3% are still using older Airflow 1.10 versions.
The good news is that the majority of users on Airflow 1 are planning migration to Airflow 2 quite soon, as for now they have capacity constraints to undertake such a significant effort in their opinion. However, it can also be noticed in the survey’s comments that some users are generally skeptical towards migration to Airflow 2, they have negative opinions about the new scheduler or compatibility with the helm chart.
As to plans about migration to the newest version of Airflow 2, users who responded to the survey are committed and waiting especially for the features related to dynamic DAGs. However, some users also reported that they are waiting to solve some dependencies they have or they prefer to wait a little bit more for the community to test the new version before they decide to move on.
What metrics do you use to monitor Airflow? (multiple choice)
No.
%
External monitoring service
81
40.7%
Information from metadatabase
71
35.7%
Statsd
54
27.1%
I do not use monitoring
47
23.6%
Other
14
7%
In comparison to results from 2020, more users are monitoring airflow in some way. External monitoring services (40.7%) and information from metabase (35.7%) started to play a more important role in Airflow monitoring.
How do you deploy Airflow? (multiple choice)
No.
%
On virtual machines (for example using AWS EC2)
63
30.6 %
Using a managed service like Astronomer, Google Composer or AWS MWAA
54
26.2 %
On Kubernetes (using Apache Airflow’s helm chart)
46
22.3%
On premises
43
20.9%
On Kubernetes (using custom deployments)
39
18.9%
On Kubernetes (using another helm chart)
21
10.2%
Other
13
6.5%
More than half of Airflow users who responded (51.4%) deploy Airflow on Kubernetes. This is about 20 percent more than in 2020. The remaining top deployment methods are on virtual machines (30.6%) and via managed services (26.2%).
How do you distribute your DAGs from your developer environment to the cloud? (single choice)
No.
%
Using a synchronizing process (Git sync, GCS fuse, etc)
100
49%
Bake them into the docker image
51
25%
Shared files system
30
14.7%
Other
16
7.9%
I don’t know
7
3.4%
According to the survey responses, the most popular way of distributing DAGs is a synchronizing process, about half of Airflow users (49%) use this process to distribute DAGs from developer environments to the cloud.
Usage
Do you have any customisation of Airflow? (single choice)
No.
%
No, we use vanilla airflow
165
81.3%
Yes, we have a separate fork
13
6.4%
Yes, we use a 3rd-party fork
12
5.9%
Yes, we’ve backpropagated bug fixes to an older version
13
6.4%
More Airflow users (81.3%) don’t have any customisation of Airflow (compared to 75.9% in 2020). Those Airflow users who have customisations (18.7%) decided to introduce them mainly to separate development and production workflows, to backport bug fixes, due to security fixes or to run a backfill command on Kubernetes pod.
Which Metadata Database do you use? (single choice)
No.
%I
PostgreSQL 13
86
43.9%
PostgreSQL 12
74
37.8%
MySQL 8
22
11.2%
MySQL 5
9
4.6%
MariaDB
4
2.0%
MsSQL
1
0.5%
According to the survey responses, the most popular metadata databases are PostgreSQL 13 (43.9%) and PostgreSQL 12 (37.8%). This represents a sharp increase from 2020, up from 68.9% to 81.7% total on PostgreSQL, with a corresponding decrease in MySQL, down from 23% to 15%. This is an interesting result taking into account community discussion about not adding support for more database backend or even deciding on single database support.
What’s the primary method by which you integrate with providers and external services in your Airflow DAGs? (single choice)
No.
%
Using existing dedicated operators / hooks
70
34.5%
Using Bash/Python operators
58
28.6%
Using custom operators / hooks
50
24.6%
Using KubernetesPodOperator
25
12.3%
According to the survey responses, the following ways of using Airflow to connect to external services are the most popular: Using existing dedicated operators / hooks (34.5%), Using Bash/Python operators (28.6%), Using custom operators / hooks (24.6%). Using KubernetesPodOperator (12.3%) is less popular regarding the survey responses. The integration with providers and external services methods ranking is similar to the one from 2020.
What providers do you use in your Airflow DAGs? (multiple choice)
No.
%
Amazon Web Services
112
55.4%
Google Cloud Platform / Google APIs
79
39.1%
Internal company systems
75
37.1%
Hadoop / Spark / Flink / Other Apache software
57
28.2%
Microsoft Azure
17
8.4%
Other
21
10.5%
I do not use external services in my Airflow DAGs
14
6.9%
It’s not surprising that Amazon Web Services (55.4% vs 59.6% in 2020), on the next three positions Google Cloud Platform (39.1% vs 47.7% in 2020 ), Internal company systems (37.1% vs 55.6% in 2020), and other Apache products (28.2% vs 35.47% in 2020) are leading Airflow providers.
How frequently do you upgrade Airflow environments? (single choice)
No.
%
every 12 months
46
22.9%
every 6 months
49
24.4%
once a quarter
47
23.4%
Whenever there is a newer version
59
29.4%
Different frequencies of Airflow environments upgrades are almost equally popular amongst Airflow users who responded to the survey.
Do you upgrade providers separately from the core? (single choice)
No.
%
When I need it
83
42.8%
Never - always use the providers that come with Airflow
68
35.1%
I did not know I can upgrade providers separately
32
16.5%
I upgrade providers when they are released
11
5.7%
According to the survey responses, Airflow users most often upgrade providers when they need it (42.8%) or prefer to stay with providers that come with Airflow (35.1%). It’s surprising that 16.5% of Airflow users who responded to the survey were not aware that they can upgrade their providers separately from the core Airflow.
How do you pass inputs and outputs between tasks? (multiple choice)
No.
%
Xcom
141
69.8%
Saving and retrieving from Storage
99
49%
TaskFlow
37
18.3%
Other
5
2.5%
We don’t
29
14.4%
According to the survey responses, Xcom (69.8%) is the most popular method to pass inputs and outputs between tasks, however Saving and Retrieving Inputs and Outputs from Storage still plays an important role (49%). It’s interesting that close to 15% of Airflow users who responded to the survey declare to not pass any outputs or inputs between tasks.
Do you use a data lineage backend? (multiple choice)
No.
%
No, but I will use such feature if fully supported in Airflow
95
47.5%
I’m not familiar with data lineage
58
29%
No, data lineage isn’t a concern for my usage
26
13%
Yes, I send lineage to an Open Source lineage repository
15
7.5%
Yes, I send lineage to an Enterprise lineage repository
7
3.5%
Yes, I send lineage to a custom internal lineage repository
9
4.5%
When asked what lineage backend Airflow users use, the answers indicated that, while lineage itself is a quite new topic, there is interest in the feature as a whole. Most Airflow users responded that they don’t use lineage solutions currently but might be interested in the future if supported by Airflow (47.5%), are not familiar with data lineage (29%) or that data lineage is not their concern (13%).
Which interfaces of Airflow do you use as part of your current role? (multiple choice)
No.
%
Original Airflow Graphical User Interface
189
94%
CLI
98
48.8%
API
80
39.8%
Custom (own created) Airflow Graphical User Interface
12
6%
GCP Composer
1
0.5%
It’s clear that usage of Airflow web UI is important as 94% of users who responded to the survey declare to use it as a part of their current role. Usage of CLI (48.8%) and API (39.8%) goes in pairs but are not so common compared to Airflow web UI usage.
(If GUI Marked) What do you use the GUI for? (multiple choice)
No.
%
Monitoring Runs
188
95.9%
Accessing Task Logs
176
89.8%
Manually triggering DAGs
167
85.2%
Clearing Tasks
162
82.7%
Marking Tasks as successful
119
60.7%
Other
6
3%
Airflow web UI is used heavily for monitoring: Monitoring Runs (95.9%) and troubleshooting: Accessing Task Logs (89.8%), Manually triggering DAGs (85.2%), Clearing Tasks (82.7%) and Marking Tasks as successful (60.7%).
(if CLI Marked) What do you use the CLI For? (multiple choice)
No.
%
Backfilling
63
56.8%
Manually triggering DAGs
52
46.8%
Clearing Tasks
26
23.4%
Monitoring Runs
25
22.5%
Accessing Task Logs
21
18.9%
Marking Tasks as successful
11
9.9%
Other
17
15.3%
Compared to Airflow web UI, Airflow CLI is used mainly for Backfilling (56.8%) and Manually triggering DAGs (46.8%).
In Airflow, which UI views are important for you? (multiple choice)
No.
%
List of DAGs
178
89.4%
Task Logs
162
81.4%
DAG Runs
160
80.4%
Graph view
147
73.9%
Grid/Tree View
138
69.3%
Run Details
117
58.8%
DAG details
111
55.8%
Task Instances
102
51.3%
Task Duration
91
45.7%
Code
90
45.2%
Task Tries
60
30.2%
Gantt
48
21.4%
Landing Times
27
13.6%
Other
4
2%
UI views importance ranking shows that the majority Airflow users use Web UI mostly for monitoring and/or troubleshooting purposes, where the top 3 views are List of DAGs (89.4%), Task Logs (81.4%) and DAG Runs (80.4%). The results are very similar to those from 2020 and 2019.
Community and contribution
Are you participating in the Airflow community discussions? (single choice)
No.
%
I see them from time to time
99
48.3%
I regularly follow what’s being discussed but don’t participate
53
25.9%
I didn’t know I could
41
20.0%
I actively participate in the discussions
12
5.9%
No.
%
I know I can but I do not contribute
116
57.1%
Very rarely when it relates to what I need
44
21.7%
I do not know I could
30
14.8%
I regularly contribute by discussing, reviewing and submitting PR
13
6.4%
Results related to the Airflow contribution are very similar to those about participating in the Airflow community discussions. Most of the Airflow users (57.1%) who responded to the survey are aware but do not contribute or contribute very rarely (21.7%). 14.8% of users were not aware they could contribute. Once again, it’s a clear indicator that there is much more to be done to engage our community to be more active contributors and raise the current 6.4% of users who actively contribute.
If you do not contribute - why?
No.
%
I have no time to contribute even if would like to
65
38.9%
I don’t know how to start
63
37.7%
I don’t have a need to contribute
19
11.4%
I didn’t know I could
12
7.2%
My employer has policy that makes it difficult to contribute
8
4.8%
According to the survey results, the most important blocker for the Airflow contribution is limited time (38.9%), but surprisingly interesting and important blocker is also lack of knowledge on how to start (37.7%), followed by lack of knowledge that it’s possible to contribute (7.2%).
The future of Airflow
In your opinion, what could be improved in Airflow? (multiple choice)
No.
%
Web UI
100
49.5%
Logging, monitoring and alerting
97
48.0%
Examples, how-to, onboarding documentation
74
36.6%
Technical documentation
74
36.6%
Scheduler performance
56
27.7%
Reliability
52
25.7%
DAG authoring
48
23.8%
REST API
43
21.3%
Authentication and authorization
41
20.3%
External integration e.g. AWS, GCP, Apache products
41
20.3%
Better support for various deployments (Docker-compose/Nomad/Others)
39
19.3%
Everything works fine for me
19
9.4%
I don’t know
4
2.0%
The results are quite self-explanatory. According to the survey results, the top area for improvement is still the Airflow web UI (49.5%), closely followed by more telemetry for logging, monitoring and alerting purposes (48%). However all those efforts should go in line with improved documentation (36.6.%) and resources about using the Airflow, especially when we take into account the need of onboarding new users (36.6%).
Which features would you like to see in Airflow?
No.
%
DAG Versioning
129
66.2%
Dependency management and Data-driven scheduling
83
42.6%
More dynamic task structure
82
42.1%
Multi-Tenancy
74
37.9%
Signal-based scheduling
67
34.4%
Better Security (Isolation)
65
33.3%
Submitting new DAGs externally via API
53
27.2%
Composable Operators
46
23.6%
Support for native cloud executors (AWS/GCP/Azure etc.)
44
22.6%
Better support for Machine Learning
38
19.5%
Remote CLI
36
18.5%
Support for hybrid executors
22
11.3%
According to the survey results, DAG Versioning is a winner for new features in Airflow, and it’s not a surprise as this feature may positively impact daily work of Airflow users. It is followed by three other ideas: Dependency management and Data-driven scheduling (42.6%), More dynamic task structure (42.1%) and Multi-Tenancy (37.9%). Another interesting point from that question is that only 11.3% think that support for hybrid executors is needed in Airflow.
Data
If you’re interested in taking a look at the raw data yourself, it’s available here: (Airflow User Survey 2022.csv)[/data/survey-responses/airflow-user-survey-responses-2022.csv.zip]
[Less]
|
|
Posted
over 3 years
ago
Airflow User Survey 2022
This year’s survey has come and gone, and with it we’ve got a new batch of data for everyone! We collected 210 responses over two weeks. We continue to see growth in both contributions and downloads
... [More]
over the last two years, and expect that trend will continue through 2022.
The raw response data will be made available here soon, in the meantime, feel free to email [email protected] for a copy.
TL;DR
Overview of the user
Like previous years, more than half of the Airflow users are Data Engineers (54%). Solutions Architects (13%), Developers (12%), DevOps (6%) and Data Scientists (4%) are also active Airflow users! There was a slight increase in the representation of Solutions Architect roles compared to results from 2020 and 2019 .
Airflow is used and popular in bigger companies, 64% of Airflow users work for companies with 200+ employees which is an 11 percent increase compared to 2020.
62% of the survey participants have more than 6 Airflow users in their company.
More Airflow users (65.9%) are willing to recommend Apache Airflow compared to the survey results in 2020 and 2019. There is a general positive trend in a willingness to recommend Airflow, 93% of surveyed Airflow users are willing to recommend Airflow ( 85.7% in 2019 and 92% in 2020 ), only 1% of users are not likely to recommend (3.6% in 2019 and 3.5% in 2020).
Airflow documentation is a critical source of information, with more than 90% (15% increase compared to results from 2020) of survey participants using the documentation. Airflow documentation is also one of the top areas to improve! What’s interesting, also Stack Overflow usage is critical, with about 60% users declaring to use it as a source of information (24% increase compared to results from 2020).
Deployments
85% of the Airflow users have between 1 to 7 active Airflow instances. 62.5% of the Airflow users have between 11 to 250 DAGs in their largest Airflow instance. 75% of the surveyed Airflow users have between 1 to 100 tasks per DAG.
Close to 85% of users use one of the Airflow 2 versions, 9.2% users still use 1.10.15, while the remaining 6.3% are still using olderAirflow 1 versions. The good news is that the majority of users on Airflow 1 are planning migration to Airflow 2 quite soon, with resources and capacity being the main blockers.
In comparison to results from 2020, more users were interested in monitoring in general and specifically in using tools such as external monitoring services (40.7%, up from 29.6%) and information from metabase (35.7%, up from 25.1%).
Celery (52.7%) and Kubernetes (39.4%) are the most common executors used.
Usage
81.3% of Airflow users who responded to the survey don’t have any customisation of Airflow.
Xcom (69.8%) is the most popular method to pass inputs and outputs between tasks, however Saving and Retrieving Inputs and Outputs from Storage still plays an important role (49%).
Lineage itself is a quite new topic for Airflow users, most of them don’t use lineage solutions but might be interested if supported by Airflow (47.5%), are not familiar with data lineage (29%) or that data lineage is not their concern (13%).
The Airflow web UI is used heavily for Monitoring Runs (95.9%), Accessing Task Logs (89.8%), Manually triggering DAGs (85.2%), Clearing Tasks (82.7%) and Marking Tasks as successful (60.7%). The top 3 views used are: List of DAGs, Task Logs and DAG Runs, which is very similar to results from 2020 and 2019.
Community and contribution
Most Airflow users (57.1%) are aware they could contribute but do not, and an additional 21.7% contribute very rarely. 14.8% of users were not aware they could contribute. There is much more to be done to engage our community to be more active contributors and raise the current 6.4% of users who actively contribute, especially considering that one important blocker for contribution is lack of knowledge on how to start (37.7%).
The future of Airflow
The top area for improvement is still the Airflow web UI (49.5%), closely followed by more telemetry for logging, monitoring and alerting purposes (48%). However all those efforts should go in line with improved documentation (36.6.%) and resources about using the Airflow, especially when we take into account the need of onboarding new users (36.6%).
DAG Versioning(66.2%) is a winner for new features in Airflow, and it’s not a surprise as this feature may positively impact daily work of Airflow users. It is followed by three other ideas: Dependency management and Data-driven scheduling (42.6%), More dynamic task structure (42.1%) and Multi-Tenancy (37.9%).
Overview of the user
What best describes your current occupation? (single choice)
No.
%
Data Engineer
114
54%
Solutions Architect
27
13%
Developer
25
12%
DevOps
12
6%
Data Scientist
8
4%
Support Engineer
5
2%
Data Analyst
3
1%
Business Analyst
2
1%
Other
14
7%
According to the survey, more than half of Airflow users are Data Engineers (54%). Roles of the remaining Airflow users might be broken down into Solutions Architects (13%), Developers (12%), DevOps (6%) and Data Scientists (4%). The 2022 results are similar to those from 2019 and 2020 with a slight increase in the representation of Solutions Architect roles.
How often do you interact with Airflow? (single choice)
No.
%
Every day
154
73%
At least once per week
36
17%
At least once per month
11
5%
Less than once per month
9
4%
Users who took the survey are actively using Airflow as part of their current role. 73% of Airflow users who responded use it on a daily basis, 17% weekly.
How many people work at your company? (single choice)
No.
%
201-5000
85
41%
5000+
49
23%
51-200
46
22%
11-50
20
10%
1-10
9
4%
Airflow is a framework that is used and popular in bigger companies, 64% of Airflow users who responded (compared to 52.7% in 2020) work for companies bigger than 200 employees (41% in companies size 201-5000 and 23% in companies size 5000+).
How many people at your company use Airflow? (single choice)
No.
%
6-20
80
38%
1-5
61
29%
51-200
49
24%
200+
18
9%
Airflow is generally used by small to medium-sized teams. 62% of the survey participants have more than 6 Airflow users in their company (38% have between 6 and 200 users, 24% between 51-200 users).
How likely are you to recommend Apache Airflow? (single choice)
% 2019
% 2020
% 2022
Very Likely
45.4%
61.6%
65.9%
Likely
40.3%
30.4%
26.9%
Neutral
10.7%
5.4%
6.3%
Unlikely
2.6%
1.5%
0.5%
Very Unlikely
1%
1%
0.5%
According to the survey, more Airflow users (65.9%) are willing to recommend Apache Airflow compared to the survey results in 2020 and 2019. There is a general positive trend in a willingness to recommend Airflow, 93% of surveyed Airflow users are willing to recommend Airflow (92% in 2020 and 85.7% in 2019), only 1% of users are not likely to recommend (3.6% in 2019 and 3.5% in 2020 ).
What is your source of information about Airflow? (multiple choice)
No.
%
Documentation
189
90.4%
Airflow website (Blog, etc.)
142
67.9%
Stack Overflow
126
60.3%
Github Issues
104
49.8%
Slack
96
45.9%
Airflow Summit Videos
88
42.1%
GitHub Discussions
76
36.4%
Airflow Community Webinars
41
19.6%
Astronomer Registry
51
24.4%
Airflow Mailing List
34
16.3%
Airflow documentation is a critical source of information, with more than 90% of survey participants using the documentation. It is of increasing importance compared to results from 2020 where documentation was at about 75% level. Moreover, more than 60% of users are getting information from the Airflow website (67.9% ) and Stack Overflow (60.3%) which is also a big increase compared to 36% level in 2020. What’s interesting is that Slack usage decreased from 63.05% in 2020 to 45.9% in 2022.
Deployments
How many active DAGs do you have in your largest Airflow instance? (single choice)
No.
%
51-250
66
31.7%
11-50
64
30.8%
5-10
25
12.0%
251-500
20
9.6%
<5
14
6.7%
1000+
10
4.8%
501-1000
9
4.3%
62.5% of the Airflow users surveyed have between 11 to 250 DAGs in their largest Airflow instance.
How many active Airflow instances do you have? (single choice)
No.
%
1
52
25.2%
2
46
22.3%
4-7
40
19.4%
3
37
18.0%
20+
19
9.2%
8-10
7
3.4%
11-20
5
2.4%
85% of the Airflow users surveyed have between 1 and 7 active Airflow instances, and nearly 50% have only 1 or 2.
What is the maximum number of tasks that you have used in a single DAG?(single choice)
No.
%
11-25
51
24.5%
26-50
41
19.7%
51-100
35
16.8%
<10
29
13.9%
101-250
23
11.1%
501-1000
9
4.3%
1000-2500
8
3.8%
251-500
8
3.8%
2500-5000
4
1.9%
75% of the surveyed Airflow users have between 1 and 100 tasks per DAG.
How many schedulers do you have in your largest Airflow instance? (single choice)
No.
%
1
113
55.1%
2
61
29.8%
3
18
8.8%
4+
13
6.3%
More than half of Airflow users who responded to the survey have 1 scheduler in their largest Airflow instance, however it’s important to notice that the second half of Airflow users decided to have 2 schedulers and more.
What executor type do you use? (multiple choice)
No.
%
Celery
107
52.7 %
Kubernetes
80
39.4%
Local
49
24.1%
Sequential
21
10.3%
CeleryKubernetes
14
6.9%
Celery (52.7%) and Kubernetes (39.4%) are the most common executors used. CeleryKubernetes (6.9%) executor also started to be noticed and used by Airflow users.
If you use the Celery executor, how many workers do you have in your largest Airflow instance? (single choice)
No.
%
2-5
64
44.8%
10+
28
19.6%
1
26
18.2%
6-10
25
17.5%
Amongst Celery executor users who responded to the survey, close to half the number (44.8%) have between 2 to 5 workers in their largest Airflow instance. It’s notable that nearly a fifth (19.6%) have more than 10 workers.
Which version of Airflow do you currently use? (single choice)
No.
%
1.10.14 or older
13
6.3%
1.10.15
19
9.2%
2.0.x
23
11.1%
2.1.x
24
11.6%
2.2.x
79
38.2%
2.3.x
49
23.7%
It’s good to see that close to 85% of users who responded to the survey use one of the Airflow 2 versions, 9.2% users still use 1.10.15, while the remaining 6.3% are still using older Airflow 1.10 versions.
The good news is that the majority of users on Airflow 1 are planning migration to Airflow 2 quite soon, as for now they have capacity constraints to undertake such a significant effort in their opinion. However, it can also be noticed in the survey’s comments that some users are generally skeptical towards migration to Airflow 2, they have negative opinions about the new scheduler or compatibility with the helm chart.
As to plans about migration to the newest version of Airflow 2, users who responded to the survey are committed and waiting especially for the features related to dynamic DAGs. However, some users also reported that they are waiting to solve some dependencies they have or they prefer to wait a little bit more for the community to test the new version before they decide to move on.
What metrics do you use to monitor Airflow? (multiple choice)
No.
%
External monitoring service
81
40.7%
Information from metadatabase
71
35.7%
Statsd
54
27.1%
I do not use monitoring
47
23.6%
Other
14
7%
In comparison to results from 2020, more users are monitoring airflow in some way. External monitoring services (40.7%) and information from metabase (35.7%) started to play a more important role in Airflow monitoring.
How do you deploy Airflow? (multiple choice)
No.
%
On virtual machines (for example using AWS EC2)
63
30.6 %
Using a managed service like Astronomer, Google Composer or AWS MWAA
54
26.2 %
On Kubernetes (using Apache Airflow’s helm chart)
46
22.3%
On premises
43
20.9%
On Kubernetes (using custom deployments)
39
18.9%
On Kubernetes (using another helm chart)
21
10.2%
Other
13
6.5%
More than half of Airflow users who responded (51.4%) deploy Airflow on Kubernetes. This is about 20 percent more than in 2020. The remaining top deployment methods are on virtual machines (30.6%) and via managed services (26.2%).
How do you distribute your DAGs from your developer environment to the cloud? (single choice)
No.
%
Using a synchronizing process (Git sync, GCS fuse, etc)
100
49%
Bake them into the docker image
51
25%
Shared files system
30
14.7%
Other
16
7.9%
I don’t know
7
3.4%
According to the survey responses, the most popular way of distributing DAGs is a synchronizing process, about half of Airflow users (49%) use this process to distribute DAGs from developer environments to the cloud.
Usage
Do you have any customisation of Airflow? (single choice)
No.
%
No, we use vanilla airflow
165
81.3%
Yes, we have a separate fork
13
6.4%
Yes, we use a 3rd-party fork
12
5.9%
Yes, we’ve backpropagated bug fixes to an older version
13
6.4%
More Airflow users (81.3%) don’t have any customisation of Airflow (compared to 75.9% in 2020). Those Airflow users who have customisations (18.7%) decided to introduce them mainly to separate development and production workflows, to backport bug fixes, due to security fixes or to run a backfill command on Kubernetes pod.
Which Metadata Database do you use? (single choice)
No.
%I
PostgreSQL 13
86
43.9%
PostgreSQL 12
74
37.8%
MySQL 8
22
11.2%
MySQL 5
9
4.6%
MariaDB
4
2.0%
MsSQL
1
0.5%
According to the survey responses, the most popular metadata databases are PostgreSQL 13 (43.9%) and PostgreSQL 12 (37.8%). This represents a sharp increase from 2020, up from 68.9% to 81.7% total on PostgreSQL, with a corresponding decrease in MySQL, down from 23% to 15%. This is an interesting result taking into account community discussion about not adding support for more database backend or even deciding on single database support.
What’s the primary method by which you integrate with providers and external services in your Airflow DAGs? (single choice)
No.
%
Using existing dedicated operators / hooks
70
34.5%
Using Bash/Python operators
58
28.6%
Using custom operators / hooks
50
24.6%
Using KubernetesPodOperator
25
12.3%
According to the survey responses, the following ways of using Airflow to connect to external services are the most popular: Using existing dedicated operators / hooks (34.5%), Using Bash/Python operators (28.6%), Using custom operators / hooks (24.6%). Using KubernetesPodOperator (12.3%) is less popular regarding the survey responses. The integration with providers and external services methods ranking is similar to the one from 2020.
What providers do you use in your Airflow DAGs? (multiple choice)
No.
%
Amazon Web Services
112
55.4%
Google Cloud Platform / Google APIs
79
39.1%
Internal company systems
75
37.1%
Hadoop / Spark / Flink / Other Apache software
57
28.2%
Microsoft Azure
17
8.4%
Other
21
10.5%
I do not use external services in my Airflow DAGs
14
6.9%
It’s not surprising that Amazon Web Services (55.4% vs 59.6% in 2020), on the next three positions Google Cloud Platform (39.1% vs 47.7% in 2020 ), Internal company systems (37.1% vs 55.6% in 2020), and other Apache products (28.2% vs 35.47% in 2020) are leading Airflow providers.
How frequently do you upgrade Airflow environments? (single choice)
No.
%
every 12 months
46
22.9%
every 6 months
49
24.4%
once a quarter
47
23.4%
Whenever there is a newer version
59
29.4%
Different frequencies of Airflow environments upgrades are almost equally popular amongst Airflow users who responded to the survey.
Do you upgrade providers separately from the core? (single choice)
No.
%
When I need it
83
42.8%
Never - always use the providers that come with Airflow
68
35.1%
I did not know I can upgrade providers separately
32
16.5%
I upgrade providers when they are released
11
5.7%
According to the survey responses, Airflow users most often upgrade providers when they need it (42.8%) or prefer to stay with providers that come with Airflow (35.1%). It’s surprising that 16.5% of Airflow users who responded to the survey were not aware that they can upgrade their providers separately from the core Airflow.
How do you pass inputs and outputs between tasks? (multiple choice)
No.
%
Xcom
141
69.8%
Saving and retrieving from Storage
99
49%
TaskFlow
37
18.3%
Other
5
2.5%
We don’t
29
14.4%
According to the survey responses, Xcom (69.8%) is the most popular method to pass inputs and outputs between tasks, however Saving and Retrieving Inputs and Outputs from Storage still plays an important role (49%). It’s interesting that close to 15% of Airflow users who responded to the survey declare to not pass any outputs or inputs between tasks.
Do you use a data lineage backend? (multiple choice)
No.
%
No, but I will use such feature if fully supported in Airflow
95
47.5%
I’m not familiar with data lineage
58
29%
No, data lineage isn’t a concern for my usage
26
13%
Yes, I send lineage to an Open Source lineage repository
15
7.5%
Yes, I send lineage to an Enterprise lineage repository
7
3.5%
Yes, I send lineage to a custom internal lineage repository
9
4.5%
When asked what lineage backend Airflow users use, the answers indicated that, while lineage itself is a quite new topic, there is interest in the feature as a whole. Most Airflow users responded that they don’t use lineage solutions currently but might be interested in the future if supported by Airflow (47.5%), are not familiar with data lineage (29%) or that data lineage is not their concern (13%).
Which interfaces of Airflow do you use as part of your current role? (multiple choice)
No.
%
Original Airflow Graphical User Interface
189
94%
CLI
98
48.8%
API
80
39.8%
Custom (own created) Airflow Graphical User Interface
12
6%
GCP Composer
1
0.5%
It’s clear that usage of Airflow web UI is important as 94% of users who responded to the survey declare to use it as a part of their current role. Usage of CLI (48.8%) and API (39.8%) goes in pairs but are not so common compared to Airflow web UI usage.
(If GUI Marked) What do you use the GUI for? (multiple choice)
No.
%
Monitoring Runs
188
95.9%
Accessing Task Logs
176
89.8%
Manually triggering DAGs
167
85.2%
Clearing Tasks
162
82.7%
Marking Tasks as successful
119
60.7%
Other
6
3%
Airflow web UI is used heavily for monitoring: Monitoring Runs (95.9%) and troubleshooting: Accessing Task Logs (89.8%), Manually triggering DAGs (85.2%), Clearing Tasks (82.7%) and Marking Tasks as successful (60.7%).
(if CLI Marked) What do you use the CLI For? (multiple choice)
No.
%
Backfilling
63
56.8%
Manually triggering DAGs
52
46.8%
Clearing Tasks
26
23.4%
Monitoring Runs
25
22.5%
Accessing Task Logs
21
18.9%
Marking Tasks as successful
11
9.9%
Other
17
15.3%
Compared to Airflow web UI, Airflow CLI is used mainly for Backfilling (56.8%) and Manually triggering DAGs (46.8%).
In Airflow, which UI views are important for you? (multiple choice)
No.
%
List of DAGs
178
89.4%
Task Logs
162
81.4%
DAG Runs
160
80.4%
Graph view
147
73.9%
Grid/Tree View
138
69.3%
Run Details
117
58.8%
DAG details
111
55.8%
Task Instances
102
51.3%
Task Duration
91
45.7%
Code
90
45.2%
Task Tries
60
30.2%
Gantt
48
21.4%
Landing Times
27
13.6%
Other
4
2%
UI views importance ranking shows that the majority Airflow users use Web UI mostly for monitoring and/or troubleshooting purposes, where the top 3 views are List of DAGs (89.4%), Task Logs (81.4%) and DAG Runs (80.4%). The results are very similar to those from 2020 and 2019.
Community and contribution
Are you participating in the Airflow community discussions? (single choice)
No.
%
I see them from time to time
99
48.3%
I regularly follow what’s being discussed but don’t participate
53
25.9%
I didn’t know I could
41
20.0%
I actively participate in the discussions
12
5.9%
No.
%
I know I can but I do not contribute
116
57.1%
Very rarely when it relates to what I need
44
21.7%
I do not know I could
30
14.8%
I regularly contribute by discussing, reviewing and submitting PR
13
6.4%
Results related to the Airflow contribution are very similar to those about participating in the Airflow community discussions. Most of the Airflow users (57.1%) who responded to the survey are aware but do not contribute or contribute very rarely (21.7%). 14.8% of users were not aware they could contribute. Once again, it’s a clear indicator that there is much more to be done to engage our community to be more active contributors and raise the current 6.4% of users who actively contribute.
If you do not contribute - why?
No.
%
I have no time to contribute even if would like to
65
38.9%
I don’t know how to start
63
37.7%
I don’t have a need to contribute
19
11.4%
I didn’t know I could
12
7.2%
My employer has policy that makes it difficult to contribute
8
4.8%
According to the survey results, the most important blocker for the Airflow contribution is limited time (38.9%), but surprisingly interesting and important blocker is also lack of knowledge on how to start (37.7%), followed by lack of knowledge that it’s possible to contribute (7.2%).
The future of Airflow
In your opinion, what could be improved in Airflow? (multiple choice)
No.
%
Web UI
100
49.5%
Logging, monitoring and alerting
97
48.0%
Examples, how-to, onboarding documentation
74
36.6%
Technical documentation
74
36.6%
Scheduler performance
56
27.7%
Reliability
52
25.7%
DAG authoring
48
23.8%
REST API
43
21.3%
Authentication and authorization
41
20.3%
External integration e.g. AWS, GCP, Apache products
41
20.3%
Better support for various deployments (Docker-compose/Nomad/Others)
39
19.3%
Everything works fine for me
19
9.4%
I don’t know
4
2.0%
The results are quite self-explanatory. According to the survey results, the top area for improvement is still the Airflow web UI (49.5%), closely followed by more telemetry for logging, monitoring and alerting purposes (48%). However all those efforts should go in line with improved documentation (36.6.%) and resources about using the Airflow, especially when we take into account the need of onboarding new users (36.6%).
Which features would you like to see in Airflow?
No.
%
DAG Versioning
129
66.2%
Dependency management and Data-driven scheduling
83
42.6%
More dynamic task structure
82
42.1%
Multi-Tenancy
74
37.9%
Signal-based scheduling
67
34.4%
Better Security (Isolation)
65
33.3%
Submitting new DAGs externally via API
53
27.2%
Composable Operators
46
23.6%
Support for native cloud executors (AWS/GCP/Azure etc.)
44
22.6%
Better support for Machine Learning
38
19.5%
Remote CLI
36
18.5%
Support for hybrid executors
22
11.3%
According to the survey results, DAG Versioning is a winner for new features in Airflow, and it’s not a surprise as this feature may positively impact daily work of Airflow users. It is followed by three other ideas: Dependency management and Data-driven scheduling (42.6%), More dynamic task structure (42.1%) and Multi-Tenancy (37.9%). Another interesting point from that question is that only 11.3% think that support for hybrid executors is needed in Airflow.
Data
If you’re interested in taking a look at the raw data yourself, it’s available here: (Airflow User Survey 2022.csv)[/data/survey-responses/airflow-user-survey-responses-2022.csv.zip]
[Less]
|
|
Posted
over 3 years
ago
Airflow User Survey 2022
This year’s survey has come and gone, and with it we’ve got a new batch of data for everyone! We collected 210 responses over two weeks. We continue to see growth in both contributions and downloads
... [More]
over the last two years, and expect that trend will continue through 2022.
The raw response data will be made available here soon, in the meantime, feel free to email [email protected] for a copy.
TL;DR
Overview of the user
Like previous years, more than half of the Airflow users are Data Engineers (54%). Solutions Architects (13%), Developers (12%), DevOps (6%) and Data Scientists (4%) are also active Airflow users! There was a slight increase in the representation of Solutions Architect roles compared to results from 2020 and 2019 .
Airflow is used and popular in bigger companies, 64% of Airflow users work for companies with 200+ employees which is an 11 percent increase compared to 2020.
62% of the survey participants have more than 6 Airflow users in their company.
More Airflow users (65.9%) are willing to recommend Apache Airflow compared to the survey results in 2020 and 2019. There is a general positive trend in a willingness to recommend Airflow, 93% of surveyed Airflow users are willing to recommend Airflow ( 85.7% in 2019 and 92% in 2020 ), only 1% of users are not likely to recommend (3.6% in 2019 and 3.5% in 2020).
Airflow documentation is a critical source of information, with more than 90% (15% increase compared to results from 2020) of survey participants using the documentation. Airflow documentation is also one of the top areas to improve! What’s interesting, also Stack Overflow usage is critical, with about 60% users declaring to use it as a source of information (24% increase compared to results from 2020).
Deployments
85% of the Airflow users have between 1 to 7 active Airflow instances. 62.5% of the Airflow users have between 11 to 250 DAGs in their largest Airflow instance. 75% of the surveyed Airflow users have between 1 to 100 tasks per DAG.
Close to 85% of users use one of the Airflow 2 versions, 9.2% users still use 1.10.15, while the remaining 6.3% are still using olderAirflow 1 versions. The good news is that the majority of users on Airflow 1 are planning migration to Airflow 2 quite soon, with resources and capacity being the main blockers.
In comparison to results from 2020, more users were interested in monitoring in general and specifically in using tools such as external monitoring services (40.7%, up from 29.6%) and information from metabase (35.7%, up from 25.1%).
Celery (52.7%) and Kubernetes (39.4%) are the most common executors used.
Usage
81.3% of Airflow users who responded to the survey don’t have any customisation of Airflow.
Xcom (69.8%) is the most popular method to pass inputs and outputs between tasks, however Saving and Retrieving Inputs and Outputs from Storage still plays an important role (49%).
Lineage itself is a quite new topic for Airflow users, most of them don’t use lineage solutions but might be interested if supported by Airflow (47.5%), are not familiar with data lineage (29%) or that data lineage is not their concern (13%).
The Airflow web UI is used heavily for Monitoring Runs (95.9%), Accessing Task Logs (89.8%), Manually triggering DAGs (85.2%), Clearing Tasks (82.7%) and Marking Tasks as successful (60.7%). The top 3 views used are: List of DAGs, Task Logs and DAG Runs, which is very similar to results from 2020 and 2019.
Community and contribution
Most Airflow users (57.1%) are aware they could contribute but do not, and an additional 21.7% contribute very rarely. 14.8% of users were not aware they could contribute. There is much more to be done to engage our community to be more active contributors and raise the current 6.4% of users who actively contribute, especially considering that one important blocker for contribution is lack of knowledge on how to start (37.7%).
The future of Airflow
The top area for improvement is still the Airflow web UI (49.5%), closely followed by more telemetry for logging, monitoring and alerting purposes (48%). However all those efforts should go in line with improved documentation (36.6.%) and resources about using the Airflow, especially when we take into account the need of onboarding new users (36.6%).
DAG Versioning(66.2%) is a winner for new features in Airflow, and it’s not a surprise as this feature may positively impact daily work of Airflow users. It is followed by three other ideas: Dependency management and Data-driven scheduling (42.6%), More dynamic task structure (42.1%) and Multi-Tenancy (37.9%).
Overview of the user
What best describes your current occupation? (single choice)
No.
%
Data Engineer
114
54%
Solutions Architect
27
13%
Developer
25
12%
DevOps
12
6%
Data Scientist
8
4%
Support Engineer
5
2%
Data Analyst
3
1%
Business Analyst
2
1%
Other
14
7%
According to the survey, more than half of Airflow users are Data Engineers (54%). Roles of the remaining Airflow users might be broken down into Solutions Architects (13%), Developers (12%), DevOps (6%) and Data Scientists (4%). The 2022 results are similar to those from 2019 and 2020 with a slight increase in the representation of Solutions Architect roles.
How often do you interact with Airflow? (single choice)
No.
%
Every day
154
73%
At least once per week
36
17%
At least once per month
11
5%
Less than once per month
9
4%
Users who took the survey are actively using Airflow as part of their current role. 73% of Airflow users who responded use it on a daily basis, 17% weekly.
How many people work at your company? (single choice)
No.
%
201-5000
85
41%
5000+
49
23%
51-200
46
22%
11-50
20
10%
1-10
9
4%
Airflow is a framework that is used and popular in bigger companies, 64% of Airflow users who responded (compared to 52.7% in 2020) work for companies bigger than 200 employees (41% in companies size 201-5000 and 23% in companies size 5000+).
How many people at your company use Airflow? (single choice)
No.
%
6-20
80
38%
1-5
61
29%
51-200
49
24%
200+
18
9%
Airflow is generally used by small to medium-sized teams. 62% of the survey participants have more than 6 Airflow users in their company (38% have between 6 and 200 users, 24% between 51-200 users).
How likely are you to recommend Apache Airflow? (single choice)
% 2019
% 2020
% 2022
Very Likely
45.4%
61.6%
65.9%
Likely
40.3%
30.4%
26.9%
Neutral
10.7%
5.4%
6.3%
Unlikely
2.6%
1.5%
0.5%
Very Unlikely
1%
1%
0.5%
According to the survey, more Airflow users (65.9%) are willing to recommend Apache Airflow compared to the survey results in 2020 and 2019. There is a general positive trend in a willingness to recommend Airflow, 93% of surveyed Airflow users are willing to recommend Airflow (92% in 2020 and 85.7% in 2019), only 1% of users are not likely to recommend (3.6% in 2019 and 3.5% in 2020 ).
What is your source of information about Airflow? (multiple choice)
No.
%
Documentation
189
90.4%
Airflow website (Blog, etc.)
142
67.9%
Stack Overflow
126
60.3%
Github Issues
104
49.8%
Slack
96
45.9%
Airflow Summit Videos
88
42.1%
GitHub Discussions
76
36.4%
Airflow Community Webinars
41
19.6%
Astronomer Registry
51
24.4%
Airflow Mailing List
34
16.3%
Airflow documentation is a critical source of information, with more than 90% of survey participants using the documentation. It is of increasing importance compared to results from 2020 where documentation was at about 75% level. Moreover, more than 60% of users are getting information from the Airflow website (67.9% ) and Stack Overflow (60.3%) which is also a big increase compared to 36% level in 2020. What’s interesting is that Slack usage decreased from 63.05% in 2020 to 45.9% in 2022.
Deployments
How many active DAGs do you have in your largest Airflow instance? (single choice)
No.
%
51-250
66
31.7%
11-50
64
30.8%
5-10
25
12.0%
251-500
20
9.6%
<5
14
6.7%
1000+
10
4.8%
501-1000
9
4.3%
62.5% of the Airflow users surveyed have between 11 to 250 DAGs in their largest Airflow instance.
How many active Airflow instances do you have? (single choice)
No.
%
1
52
25.2%
2
46
22.3%
4-7
40
19.4%
3
37
18.0%
20+
19
9.2%
8-10
7
3.4%
11-20
5
2.4%
85% of the Airflow users surveyed have between 1 and 7 active Airflow instances, and nearly 50% have only 1 or 2.
What is the maximum number of tasks that you have used in a single DAG?(single choice)
No.
%
11-25
51
24.5%
26-50
41
19.7%
51-100
35
16.8%
<10
29
13.9%
101-250
23
11.1%
501-1000
9
4.3%
1000-2500
8
3.8%
251-500
8
3.8%
2500-5000
4
1.9%
75% of the surveyed Airflow users have between 1 and 100 tasks per DAG.
How many schedulers do you have in your largest Airflow instance? (single choice)
No.
%
1
113
55.1%
2
61
29.8%
3
18
8.8%
4+
13
6.3%
More than half of Airflow users who responded to the survey have 1 scheduler in their largest Airflow instance, however it’s important to notice that the second half of Airflow users decided to have 2 schedulers and more.
What executor type do you use? (multiple choice)
No.
%
Celery
107
52.7 %
Kubernetes
80
39.4%
Local
49
24.1%
Sequential
21
10.3%
CeleryKubernetes
14
6.9%
Celery (52.7%) and Kubernetes (39.4%) are the most common executors used. CeleryKubernetes (6.9%) executor also started to be noticed and used by Airflow users.
If you use the Celery executor, how many workers do you have in your largest Airflow instance? (single choice)
No.
%
2-5
64
44.8%
10+
28
19.6%
1
26
18.2%
6-10
25
17.5%
Amongst Celery executor users who responded to the survey, close to half the number (44.8%) have between 2 to 5 workers in their largest Airflow instance. It’s notable that nearly a fifth (19.6%) have more than 10 workers.
Which version of Airflow do you currently use? (single choice)
No.
%
1.10.14 or older
13
6.3%
1.10.15
19
9.2%
2.0.x
23
11.1%
2.1.x
24
11.6%
2.2.x
79
38.2%
2.3.x
49
23.7%
It’s good to see that close to 85% of users who responded to the survey use one of the Airflow 2 versions, 9.2% users still use 1.10.15, while the remaining 6.3% are still using older Airflow 1.10 versions.
The good news is that the majority of users on Airflow 1 are planning migration to Airflow 2 quite soon, as for now they have capacity constraints to undertake such a significant effort in their opinion. However, it can also be noticed in the survey’s comments that some users are generally skeptical towards migration to Airflow 2, they have negative opinions about the new scheduler or compatibility with the helm chart.
As to plans about migration to the newest version of Airflow 2, users who responded to the survey are committed and waiting especially for the features related to dynamic DAGs. However, some users also reported that they are waiting to solve some dependencies they have or they prefer to wait a little bit more for the community to test the new version before they decide to move on.
What metrics do you use to monitor Airflow? (multiple choice)
No.
%
External monitoring service
81
40.7%
Information from metadatabase
71
35.7%
Statsd
54
27.1%
I do not use monitoring
47
23.6%
Other
14
7%
In comparison to results from 2020, more users are monitoring airflow in some way. External monitoring services (40.7%) and information from metabase (35.7%) started to play a more important role in Airflow monitoring.
How do you deploy Airflow? (multiple choice)
No.
%
On virtual machines (for example using AWS EC2)
63
30.6 %
Using a managed service like Astronomer, Google Composer or AWS MWAA
54
26.2 %
On Kubernetes (using Apache Airflow’s helm chart)
46
22.3%
On premises
43
20.9%
On Kubernetes (using custom deployments)
39
18.9%
On Kubernetes (using another helm chart)
21
10.2%
Other
13
6.5%
More than half of Airflow users who responded (51.4%) deploy Airflow on Kubernetes. This is about 20 percent more than in 2020. The remaining top deployment methods are on virtual machines (30.6%) and via managed services (26.2%).
How do you distribute your DAGs from your developer environment to the cloud? (single choice)
No.
%
Using a synchronizing process (Git sync, GCS fuse, etc)
100
49%
Bake them into the docker image
51
25%
Shared files system
30
14.7%
Other
16
7.9%
I don’t know
7
3.4%
According to the survey responses, the most popular way of distributing DAGs is a synchronizing process, about half of Airflow users (49%) use this process to distribute DAGs from developer environments to the cloud.
Usage
Do you have any customisation of Airflow? (single choice)
No.
%
No, we use vanilla airflow
165
81.3%
Yes, we have a separate fork
13
6.4%
Yes, we use a 3rd-party fork
12
5.9%
Yes, we’ve backpropagated bug fixes to an older version
13
6.4%
More Airflow users (81.3%) don’t have any customisation of Airflow (compared to 75.9% in 2020). Those Airflow users who have customisations (18.7%) decided to introduce them mainly to separate development and production workflows, to backport bug fixes, due to security fixes or to run a backfill command on Kubernetes pod.
Which Metadata Database do you use? (single choice)
No.
%I
PostgreSQL 13
86
43.9%
PostgreSQL 12
74
37.8%
MySQL 8
22
11.2%
MySQL 5
9
4.6%
MariaDB
4
2.0%
MsSQL
1
0.5%
According to the survey responses, the most popular metadata databases are PostgreSQL 13 (43.9%) and PostgreSQL 12 (37.8%). This represents a sharp increase from 2020, up from 68.9% to 81.7% total on PostgreSQL, with a corresponding decrease in MySQL, down from 23% to 15%. This is an interesting result taking into account community discussion about not adding support for more database backend or even deciding on single database support.
What’s the primary method by which you integrate with providers and external services in your Airflow DAGs? (single choice)
No.
%
Using existing dedicated operators / hooks
70
34.5%
Using Bash/Python operators
58
28.6%
Using custom operators / hooks
50
24.6%
Using KubernetesPodOperator
25
12.3%
According to the survey responses, the following ways of using Airflow to connect to external services are the most popular: Using existing dedicated operators / hooks (34.5%), Using Bash/Python operators (28.6%), Using custom operators / hooks (24.6%). Using KubernetesPodOperator (12.3%) is less popular regarding the survey responses. The integration with providers and external services methods ranking is similar to the one from 2020.
What providers do you use in your Airflow DAGs? (multiple choice)
No.
%
Amazon Web Services
112
55.4%
Google Cloud Platform / Google APIs
79
39.1%
Internal company systems
75
37.1%
Hadoop / Spark / Flink / Other Apache software
57
28.2%
Microsoft Azure
17
8.4%
Other
21
10.5%
I do not use external services in my Airflow DAGs
14
6.9%
It’s not surprising that Amazon Web Services (55.4% vs 59.6% in 2020), on the next three positions Google Cloud Platform (39.1% vs 47.7% in 2020 ), Internal company systems (37.1% vs 55.6% in 2020), and other Apache products (28.2% vs 35.47% in 2020) are leading Airflow providers.
How frequently do you upgrade Airflow environments? (single choice)
No.
%
every 12 months
46
22.9%
every 6 months
49
24.4%
once a quarter
47
23.4%
Whenever there is a newer version
59
29.4%
Different frequencies of Airflow environments upgrades are almost equally popular amongst Airflow users who responded to the survey.
Do you upgrade providers separately from the core? (single choice)
No.
%
When I need it
83
42.8%
Never - always use the providers that come with Airflow
68
35.1%
I did not know I can upgrade providers separately
32
16.5%
I upgrade providers when they are released
11
5.7%
According to the survey responses, Airflow users most often upgrade providers when they need it (42.8%) or prefer to stay with providers that come with Airflow (35.1%). It’s surprising that 16.5% of Airflow users who responded to the survey were not aware that they can upgrade their providers separately from the core Airflow.
How do you pass inputs and outputs between tasks? (multiple choice)
No.
%
Xcom
141
69.8%
Saving and retrieving from Storage
99
49%
TaskFlow
37
18.3%
Other
5
2.5%
We don’t
29
14.4%
According to the survey responses, Xcom (69.8%) is the most popular method to pass inputs and outputs between tasks, however Saving and Retrieving Inputs and Outputs from Storage still plays an important role (49%). It’s interesting that close to 15% of Airflow users who responded to the survey declare to not pass any outputs or inputs between tasks.
Do you use a data lineage backend? (multiple choice)
No.
%
No, but I will use such feature if fully supported in Airflow
95
47.5%
I’m not familiar with data lineage
58
29%
No, data lineage isn’t a concern for my usage
26
13%
Yes, I send lineage to an Open Source lineage repository
15
7.5%
Yes, I send lineage to an Enterprise lineage repository
7
3.5%
Yes, I send lineage to a custom internal lineage repository
9
4.5%
When asked what lineage backend Airflow users use, the answers indicated that, while lineage itself is a quite new topic, there is interest in the feature as a whole. Most Airflow users responded that they don’t use lineage solutions currently but might be interested in the future if supported by Airflow (47.5%), are not familiar with data lineage (29%) or that data lineage is not their concern (13%).
Which interfaces of Airflow do you use as part of your current role? (multiple choice)
No.
%
Original Airflow Graphical User Interface
189
94%
CLI
98
48.8%
API
80
39.8%
Custom (own created) Airflow Graphical User Interface
12
6%
GCP Composer
1
0.5%
It’s clear that usage of Airflow web UI is important as 94% of users who responded to the survey declare to use it as a part of their current role. Usage of CLI (48.8%) and API (39.8%) goes in pairs but are not so common compared to Airflow web UI usage.
(If GUI Marked) What do you use the GUI for? (multiple choice)
No.
%
Monitoring Runs
188
95.9%
Accessing Task Logs
176
89.8%
Manually triggering DAGs
167
85.2%
Clearing Tasks
162
82.7%
Marking Tasks as successful
119
60.7%
Other
6
3%
Airflow web UI is used heavily for monitoring: Monitoring Runs (95.9%) and troubleshooting: Accessing Task Logs (89.8%), Manually triggering DAGs (85.2%), Clearing Tasks (82.7%) and Marking Tasks as successful (60.7%).
(if CLI Marked) What do you use the CLI For? (multiple choice)
No.
%
Backfilling
63
56.8%
Manually triggering DAGs
52
46.8%
Clearing Tasks
26
23.4%
Monitoring Runs
25
22.5%
Accessing Task Logs
21
18.9%
Marking Tasks as successful
11
9.9%
Other
17
15.3%
Compared to Airflow web UI, Airflow CLI is used mainly for Backfilling (56.8%) and Manually triggering DAGs (46.8%).
In Airflow, which UI views are important for you? (multiple choice)
No.
%
List of DAGs
178
89.4%
Task Logs
162
81.4%
DAG Runs
160
80.4%
Graph view
147
73.9%
Grid/Tree View
138
69.3%
Run Details
117
58.8%
DAG details
111
55.8%
Task Instances
102
51.3%
Task Duration
91
45.7%
Code
90
45.2%
Task Tries
60
30.2%
Gantt
48
21.4%
Landing Times
27
13.6%
Other
4
2%
UI views importance ranking shows that the majority Airflow users use Web UI mostly for monitoring and/or troubleshooting purposes, where the top 3 views are List of DAGs (89.4%), Task Logs (81.4%) and DAG Runs (80.4%). The results are very similar to those from 2020 and 2019.
Community and contribution
Are you participating in the Airflow community discussions? (single choice)
No.
%
I see them from time to time
99
48.3%
I regularly follow what’s being discussed but don’t participate
53
25.9%
I didn’t know I could
41
20.0%
I actively participate in the discussions
12
5.9%
No.
%
I know I can but I do not contribute
116
57.1%
Very rarely when it relates to what I need
44
21.7%
I do not know I could
30
14.8%
I regularly contribute by discussing, reviewing and submitting PR
13
6.4%
Results related to the Airflow contribution are very similar to those about participating in the Airflow community discussions. Most of the Airflow users (57.1%) who responded to the survey are aware but do not contribute or contribute very rarely (21.7%). 14.8% of users were not aware they could contribute. Once again, it’s a clear indicator that there is much more to be done to engage our community to be more active contributors and raise the current 6.4% of users who actively contribute.
If you do not contribute - why?
No.
%
I have no time to contribute even if would like to
65
38.9%
I don’t know how to start
63
37.7%
I don’t have a need to contribute
19
11.4%
I didn’t know I could
12
7.2%
My employer has policy that makes it difficult to contribute
8
4.8%
According to the survey results, the most important blocker for the Airflow contribution is limited time (38.9%), but surprisingly interesting and important blocker is also lack of knowledge on how to start (37.7%), followed by lack of knowledge that it’s possible to contribute (7.2%).
The future of Airflow
In your opinion, what could be improved in Airflow? (multiple choice)
No.
%
Web UI
100
49.5%
Logging, monitoring and alerting
97
48.0%
Examples, how-to, onboarding documentation
74
36.6%
Technical documentation
74
36.6%
Scheduler performance
56
27.7%
Reliability
52
25.7%
DAG authoring
48
23.8%
REST API
43
21.3%
Authentication and authorization
41
20.3%
External integration e.g. AWS, GCP, Apache products
41
20.3%
Better support for various deployments (Docker-compose/Nomad/Others)
39
19.3%
Everything works fine for me
19
9.4%
I don’t know
4
2.0%
The results are quite self-explanatory. According to the survey results, the top area for improvement is still the Airflow web UI (49.5%), closely followed by more telemetry for logging, monitoring and alerting purposes (48%). However all those efforts should go in line with improved documentation (36.6.%) and resources about using the Airflow, especially when we take into account the need of onboarding new users (36.6%).
Which features would you like to see in Airflow?
No.
%
DAG Versioning
129
66.2%
Dependency management and Data-driven scheduling
83
42.6%
More dynamic task structure
82
42.1%
Multi-Tenancy
74
37.9%
Signal-based scheduling
67
34.4%
Better Security (Isolation)
65
33.3%
Submitting new DAGs externally via API
53
27.2%
Composable Operators
46
23.6%
Support for native cloud executors (AWS/GCP/Azure etc.)
44
22.6%
Better support for Machine Learning
38
19.5%
Remote CLI
36
18.5%
Support for hybrid executors
22
11.3%
According to the survey results, DAG Versioning is a winner for new features in Airflow, and it’s not a surprise as this feature may positively impact daily work of Airflow users. It is followed by three other ideas: Dependency management and Data-driven scheduling (42.6%), More dynamic task structure (42.1%) and Multi-Tenancy (37.9%). Another interesting point from that question is that only 11.3% think that support for hybrid executors is needed in Airflow.
Data
If you’re interested in taking a look at the raw data yourself, it’s available here: (Airflow User Survey 2022.csv)[/data/survey-responses/airflow-user-survey-responses-2022.csv.zip]
[Less]
|
|
Posted
over 3 years
ago
The biggest Airflow Event of the Year returns May 23–27! Airflow Summit 2022 will bring together the global
community of Apache Airflow practitioners and data leaders.
What’s on the Agenda
During the free conference
... [More]
, you will hear about Apache Airflow best practices, trends in building data
pipelines, data governance, Airflow and machine learning, and the future of Airflow. There will also be
a series of presentations on non-code contributions driving the open-source project.
How to Attend
This year’s edition will include a variety of online sessions across different time zones.
Additionally, you can take part in local in-person events organized worldwide for data
communities to watch the event and network.
Interested?
🪶 Register for Airflow Summit 2022 today
🤝 Check out the in-person events planned for Airflow Summit 2022.
[Less]
|
|
Posted
over 3 years
ago
The biggest Airflow Event of the Year returns May 23–27! Airflow Summit 2022 will bring together the global
community of Apache Airflow practitioners and data leaders.
What’s on the Agenda
During the free conference, you
... [More]
will hear about Apache Airflow best practices, trends in building data
pipelines, data governance, Airflow and machine learning, and the future of Airflow. There will also be
a series of presentations on non-code contributions driving the open-source project.
How to Attend
This year’s edition will include a variety of online sessions across different time zones.
Additionally, you can take part in local in-person events organized worldwide for data
communities to watch the event and network.
Interested?
🪶 Register for Airflow Summit 2022 today
🤝 Check out the in-person events planned for Airflow Summit 2022.
[Less]
|
|
Posted
over 3 years
ago
Apache Airflow 2.3.0 contains over 700 commits since 2.2.0 and includes 50 new features, 99 improvements, 85 bug fixes, and several doc changes.
Details:
📦 PyPI: https://pypi.org/project/apache-airflow/2.3.0/
📚 Docs:
... [More]
https://airflow.apache.org/docs/apache-airflow/2.3.0/
🛠️ Release Notes: https://airflow.apache.org/docs/apache-airflow/2.3.0/release_notes.html
🐳 Docker Image: docker pull apache/airflow:2.3.0
🚏 Constraints: https://github.com/apache/airflow/tree/constraints-2.3.0
As the changelog is quite large, the following are some notable new features that shipped in this release.
Dynamic Task Mapping(AIP-42)
There’s now first-class support for dynamic tasks in Airflow. What this means is that you can generate tasks dynamically at runtime. Much like using a for loop
to create a list of tasks, here you can create the same tasks without having to know the exact number of tasks ahead of time.
You can have a task generate the list to iterate over, which is not possible with a for loop.
Here is an example:
@task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
@task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
More information can be found here: Dynamic Task Mapping
Grid View replaces Tree View
Grid view replaces tree view in Airflow 2.3.0.
Screenshots:
Purge history from metadata database
Airflow 2.3.0 introduces a new airflow db clean command that can be used to purge old data from the metadata database.
You would want to use this command if you want to reduce the size of the metadata database.
More information can be found here: Purge history from metadata database
LocalKubernetesExecutor
There is a new executor named LocalKubernetesExecutor. This executor helps you run some tasks using LocalExecutor and run another set of tasks using the KubernetesExecutor in the same deployment based on the task’s queue.
More information can be found here: LocalKubernetesExecutor
DagProcessorManager as standalone process (AIP-43)
As of 2.3.0, you can run the DagProcessorManager as a standalone process. Because DagProcessorManager runs user code, separating it from the scheduler process and running it as an independent process in a different host is a good idea.
The airflow dag-processor cli command will start a new process that will run the DagProcessorManager in a separate process. Before you can run the DagProcessorManager as a standalone process, you need to set the [scheduler] standalone_dag_processor to True.
More information can be found here: dag-processor CLI command
JSON serialization for connections
You can now create connections using the json serialization format.
airflow connections add 'my_prod_db' \
--conn-json '{
"conn_type": "my-conn-type",
"login": "my-login",
"password": "my-password",
"host": "my-host",
"port": 1234,
"schema": "my-schema",
"extra": {
"param1": "val1",
"param2": "val2"
}
}'
You can also use json serialization format when setting the connection in environment variables.
More information can be found here: JSON serialization for connections
Airflow db downgrade and Offline generation of SQL scripts
Airflow 2.3.0 introduced a new command airflow db downgrade that will downgrade the database to your chosen version.
You can also generate the downgrade/upgrade SQL scripts for your database and manually run it against your database or just view the SQL queries that would be run by the downgrade/upgrade command.
More information can be found here: Airflow db downgrade and Offline generation of SQL scripts
Reuse of decorated tasks
You can now reuse decorated tasks across your dag files. A decorated task has an override method that allows you to override its arguments.
Here’s an example:
@task
def add_task(x, y):
print(f"Task args: x={x}, y={y}")
return x + y
@dag(start_date=datetime(2022, 1, 1))
def mydag():
start = add_task.override(task_id="start")(1, 2)
for i in range(3):
start >> add_task.override(task_id=f"add_start_{i}")(start, i)
More information can be found here: Reuse of decorated DAGs
Other small features
This isn’t a comprehensive list, but some noteworthy or interesting small features include:
Support different timeout value for dag file parsing
airflow dags reserialize command to reserialize dags
Events Timetable
SmoothOperator - Operator that does literally nothing except logging a YouTube link to
Sade’s “Smooth Operator”. Enjoy!
Contributors
Thanks to everyone who contributed to this release: Ash Berlin-Taylor, Brent Bovenzi, Daniel Standish, Elad, Ephraim Anierobi, Jarek Potiuk, Jed Cunningham, Josh Fell, Kamil Breguła, Kanthi, Kaxil Naik, Khalid Mammadov, Malthe Borch, Ping Zhang, Tzu-ping Chung and many others who keep making Airflow better for everyone.
[Less]
|
|
Posted
over 3 years
ago
Apache Airflow 2.3.0 contains over 700 commits since 2.2.0 and includes 50 new features, 99 improvements, 85 bug fixes, and several doc changes.
Details:
📦 PyPI: https://pypi.org/project/apache-airflow/2.3.0/
📚 Docs:
... [More]
https://airflow.apache.org/docs/apache-airflow/2.3.0/
🛠️ Release Notes: https://airflow.apache.org/docs/apache-airflow/2.3.0/release_notes.html
🐳 Docker Image: docker pull apache/airflow:2.3.0
🚏 Constraints: https://github.com/apache/airflow/tree/constraints-2.3.0
As the changelog is quite large, the following are some notable new features that shipped in this release.
Dynamic Task Mapping(AIP-42)
There’s now first-class support for dynamic tasks in Airflow. What this means is that you can generate tasks dynamically at runtime. Much like using a for loop
to create a list of tasks, here you can create the same tasks without having to know the exact number of tasks ahead of time.
You can have a task generate the list to iterate over, which is not possible with a for loop.
Here is an example:
@task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
@task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
More information can be found here: Dynamic Task Mapping
Grid View replaces Tree View
Grid view replaces tree view in Airflow 2.3.0.
Screenshots:
Purge history from metadata database
Airflow 2.3.0 introduces a new airflow db clean command that can be used to purge old data from the metadata database.
You would want to use this command if you want to reduce the size of the metadata database.
More information can be found here: Purge history from metadata database
LocalKubernetesExecutor
There is a new executor named LocalKubernetesExecutor. This executor helps you run some tasks using LocalExecutor and run another set of tasks using the KubernetesExecutor in the same deployment based on the task’s queue.
More information can be found here: LocalKubernetesExecutor
DagProcessorManager as standalone process (AIP-43)
As of 2.3.0, you can run the DagProcessorManager as a standalone process. Because DagProcessorManager runs user code, separating it from the scheduler process and running it as an independent process in a different host is a good idea.
The airflow dag-processor cli command will start a new process that will run the DagProcessorManager in a separate process. Before you can run the DagProcessorManager as a standalone process, you need to set the [scheduler] standalone_dag_processor to True.
More information can be found here: dag-processor CLI command
JSON serialization for connections
You can now create connections using the json serialization format.
airflow connections add 'my_prod_db' \
--conn-json '{
"conn_type": "my-conn-type",
"login": "my-login",
"password": "my-password",
"host": "my-host",
"port": 1234,
"schema": "my-schema",
"extra": {
"param1": "val1",
"param2": "val2"
}
}'
You can also use json serialization format when setting the connection in environment variables.
More information can be found here: JSON serialization for connections
Airflow db downgrade and Offline generation of SQL scripts
Airflow 2.3.0 introduced a new command airflow db downgrade that will downgrade the database to your chosen version.
You can also generate the downgrade/upgrade SQL scripts for your database and manually run it against your database or just view the SQL queries that would be run by the downgrade/upgrade command.
More information can be found here: Airflow db downgrade and Offline generation of SQL scripts
Reuse of decorated tasks
You can now reuse decorated tasks across your dag files. A decorated task has an override method that allows you to override it’s arguments.
Here’s an example:
@task
def add_task(x, y):
print(f"Task args: x={x}, y={y}")
return x + y
@dag(start_date=datetime(2022, 1, 1))
def mydag():
start = add_task.override(task_id="start")(1, 2)
for i in range(3):
start >> add_task.override(task_id=f"add_start_{i}")(start, i)
More information can be found here: Reuse of decorated DAGs
Other small features
This isn’t a comprehensive list, but some noteworthy or interesting small features include:
Support different timeout value for dag file parsing
airflow dags reserialize command to reserialize dags
Events Timetable
SmoothOperator - Operator that does literally nothing except logging a YouTube link to
Sade’s “Smooth Operator”. Enjoy!
Contributors
Thanks to everyone who contributed to this release: Ash Berlin-Taylor, Brent Bovenzi, Daniel Standish, Elad, Ephraim Anierobi, Jarek Potiuk, Jed Cunningham, Josh Fell, Kamil Breguła, Kanthi, Kaxil Naik, Khalid Mammadov, Malthe Borch, Ping Zhang, Tzu-ping Chung and many others who keep making Airflow better for everyone.
[Less]
|
|
Posted
over 3 years
ago
Apache Airflow 2.3.0 contains over 700 commits since 2.2.0 and includes 50 new features, 99 improvements, 85 bug fixes, and several doc changes.
Details:
📦 PyPI: https://pypi.org/project/apache-airflow/2.3.0/
📚 Docs:
... [More]
https://airflow.apache.org/docs/apache-airflow/2.3.0/
🛠️ Release Notes: https://airflow.apache.org/docs/apache-airflow/2.3.0/release_notes.html
🐳 Docker Image: docker pull apache/airflow:2.3.0
🚏 Constraints: https://github.com/apache/airflow/tree/constraints-2.3.0
As the changelog is quite large, the following are some notable new features that shipped in this release.
Dynamic Task Mapping(AIP-42)
There’s now first-class support for dynamic tasks in Airflow. What this means is that you can generate tasks dynamically at runtime. Much like using a for loop
to create a list of tasks, here you can create the same tasks without having to know the exact number of tasks ahead of time.
You can have a task generate the list to iterate over, which is not possible with a for loop.
Here is an example:
@task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
@task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
More information can be found here: Dynamic Task Mapping
Grid View replaces Tree View
Grid view replaces tree view in Airflow 2.3.0.
Screenshots:
Purge history from metadata database
Airflow 2.3.0 introduces a new airflow db clean command that can be used to purge old data from the metadata database.
You would want to use this command if you want to reduce the size of the metadata database.
More information can be found here: Purge history from metadata database
LocalKubernetesExecutor
There is a new executor named LocalKubernetesExecutor. This executor helps you run some tasks using LocalExecutor and run another set of tasks using the KubernetesExecutor in the same deployment based on the task’s queue.
More information can be found here: LocalKubernetesExecutor
DagProcessorManager as standalone process (AIP-43)
As of 2.3.0, you can run the DagProcessorManager as a standalone process. Because DagProcessorManager runs user code, separating it from the scheduler process and running it as an independent process in a different host is a good idea.
The airflow dag-processor cli command will start a new process that will run the DagProcessorManager in a separate process. Before you can run the DagProcessorManager as a standalone process, you need to set the [scheduler] standalone_dag_processor to True.
More information can be found here: dag-processor CLI command
JSON serialization for connections
You can now create connections using the json serialization format.
airflow connections add 'my_prod_db' \
--conn-json '{
"conn_type": "my-conn-type",
"login": "my-login",
"password": "my-password",
"host": "my-host",
"port": 1234,
"schema": "my-schema",
"extra": {
"param1": "val1",
"param2": "val2"
}
}'
You can also use json serialization format when setting the connection in environment variables.
More information can be found here: JSON serialization for connections
Airflow db downgrade and Offline generation of SQL scripts
Airflow 2.3.0 introduced a new command airflow db downgrade that will downgrade the database to your chosen version.
You can also generate the downgrade/upgrade SQL scripts for your database and manually run it against your database or just view the SQL queries that would be run by the downgrade/upgrade command.
More information can be found here: Airflow db downgrade and Offline generation of SQL scripts
Reuse of decorated tasks
You can now reuse decorated tasks across your dag files. A decorated task has an override method that allows you to override it’s arguments.
Here’s an example:
@task
def add_task(x, y):
print(f"Task args: x={x}, y={y}")
return x + y
@dag(start_date=datetime(2022, 1, 1))
def mydag():
start = add_task.override(task_id="start")(1, 2)
for i in range(3):
start >> add_task.override(task_id=f"add_start_{i}")(start, i)
More information can be found here: Reuse of decorated DAGs
Other small features
This isn’t a comprehensive list, but some noteworthy or interesting small features include:
Support different timeout value for dag file parsing
airflow dags reserialize command to reserialize dags
Events Timetable
SmoothOperator - Operator that does literally nothing except logging a YouTube link to
Sade’s “Smooth Operator”. Enjoy!
Contributors
Thanks to everyone who contributed to this release: Ash Berlin-Taylor, Brent Bovenzi, Daniel Standish, Elad, Ephraim Anierobi, Jarek Potiuk, Jed Cunningham, Josh Fell, Kamil Breguła, Kanthi, Kaxil Naik, Khalid Mammadov, Malthe Borch, Ping Zhang, Tzu-ping Chung and many others who keep making Airflow better for everyone.
[Less]
|
|
Posted
over 3 years
ago
Apache Airflow 2.3.0 contains over 700 commits since 2.2.0 and includes 50 new features, 99 improvements, 85 bug fixes, and several doc changes.
Details:
📦 PyPI: https://pypi.org/project/apache-airflow/2.3.0/
📚 Docs:
... [More]
https://airflow.apache.org/docs/apache-airflow/2.3.0/
🛠️ Release Notes: https://airflow.apache.org/docs/apache-airflow/2.3.0/release_notes.html
🐳 Docker Image: docker pull apache/airflow:2.3.0
🚏 Constraints: https://github.com/apache/airflow/tree/constraints-2.3.0
As the changelog is quite large, the following are some notable new features that shipped in this release.
Dynamic Task Mapping(AIP-42)
There’s now first-class support for dynamic tasks in Airflow. What this means is that you can generate tasks dynamically at runtime. Much like using a for loop
to create a list of tasks, here you can create the same tasks without having to know the exact number of tasks ahead of time.
You can have a task generate the list to iterate over, which is not possible with a for loop.
Here is an example:
@task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
@task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
More information can be found here: Dynamic Task Mapping
Grid View replaces Tree View
Grid view replaces tree view in Airflow 2.3.0.
Screenshots:
Purge history from metadata database
Airflow 2.3.0 introduces a new airflow db clean command that can be used to purge old data from the metadata database.
You would want to use this command if you want to reduce the size of the metadata database.
More information can be found here: Purge history from metadata database
LocalKubernetesExecutor
There is a new executor named LocalKubernetesExecutor. This executor helps you run some tasks using LocalExecutor and run another set of tasks using the KubernetesExecutor in the same deployment based on the task’s queue.
More information can be found here: LocalKubernetesExecutor
DagProcessorManager as standalone process (AIP-43)
As of 2.3.0, you can run the DagProcessorManager as a standalone process. Because DagProcessorManager runs user code, separating it from the scheduler process and running it as an independent process in a different host is a good idea.
The airflow dag-processor cli command will start a new process that will run the DagProcessorManager in a separate process. Before you can run the DagProcessorManager as a standalone process, you need to set the [scheduler] standalone_dag_processor to True.
More information can be found here: dag-processor CLI command
JSON serialization for connections
You can now create connections using the json serialization format.
airflow connections add 'my_prod_db' \
--conn-json '{
"conn_type": "my-conn-type",
"login": "my-login",
"password": "my-password",
"host": "my-host",
"port": 1234,
"schema": "my-schema",
"extra": {
"param1": "val1",
"param2": "val2"
}
}'
You can also use json serialization format when setting the connection in environment variables.
More information can be found here: JSON serialization for connections
Airflow db downgrade and Offline generation of SQL scripts
Airflow 2.3.0 introduced a new command airflow db downgrade that will downgrade the database to your chosen version.
You can also generate the downgrade/upgrade SQL scripts for your database and manually run it against your database or just view the SQL queries that would be run by the downgrade/upgrade command.
More information can be found here: Airflow db downgrade and Offline generation of SQL scripts
Reuse of decorated tasks
You can now reuse decorated tasks across your dag files. A decorated task has an override method that allows you to override it’s arguments.
Here’s an example:
@task
def add_task(x, y):
print(f"Task args: x={x}, y={y}")
return x + y
@dag(start_date=datetime(2022, 1, 1))
def mydag():
start = add_task.override(task_id="start")(1, 2)
for i in range(3):
start >> add_task.override(task_id=f"add_start_{i}")(start, i)
More information can be found here: Reuse of decorated DAGs
Other small features
This isn’t a comprehensive list, but some noteworthy or interesting small features include:
Support different timeout value for dag file parsing
airflow dags reserialize command to reserialize dags
Events Timetable
SmoothOperator - Operator that does literally nothing except logging a YouTube link to
Sade’s “Smooth Operator”. Enjoy!
Contributors
Thanks to everyone who contributed to this release: Ash Berlin-Taylor, Brent Bovenzi, Daniel Standish, Elad, Ephraim Anierobi, Jarek Potiuk, Jed Cunningham, Josh Fell, Kamil Breguła, Kanthi, Kaxil Naik, Khalid Mammadov, Malthe Borch, Ping Zhang, Tzu-ping Chung and many others who keep making Airflow better for everyone.
[Less]
|
|
Posted
about 4 years
ago
I’m proud to announce that Apache Airflow 2.2.0 has been released. It contains over 600 commits since 2.1.4 and includes 30 new features, 84 improvements, 85 bug fixes, and many internal and doc changes.
Details:
📦 PyPI:
... [More]
https://pypi.org/project/apache-airflow/2.2.0/
📚 Docs: https://airflow.apache.org/docs/apache-airflow/2.2.0/
🛠️ Changelog: https://airflow.apache.org/docs/apache-airflow/2.2.0/changelog.html
🐳 Docker Image: docker pull apache/airflow:2.2.0
🚏 Constraints: https://github.com/apache/airflow/tree/constraints-2.2.0
As the changelog is quite large, the following are some notable new features that shipped in this release.
Custom Timetables (AIP-39)
Airflow has historically used cron expressions and timedeltas to represent when a DAG should run. This worked for a lot of use cases, but not all. For example, running daily on Monday-Friday, but not on weekends wasn’t possible.
To provide more scheduling flexibility, determining when a DAG should run is now done with Timetables. Of course, backwards compatibility has been maintained - cron expressions and timedeltas are still fully supported, however, timetables are pluggable so you can add your own custom timetable to fit your needs! For example, you could write a timetable to schedule a DagRun
execution_date has long been confusing to new Airflowers, so as part of this change a new concept has been added to Airflow to replace it named data_interval, which is the period of data that a task should operate on. The following are now available:
logical_date (aka execution_date)
data_interval_start (same value as execution_date for cron)
data_interval_end (aka next_execution_date)
If you write your own timetables, keep in mind they should be idempotent and fast as they are used in the scheduler to create DagRuns.
More information can be found at: Customizing DAG Scheduling with Timetables
Deferrable Tasks (AIP-40)
Deferrable tasks allows operators or sensors to defer themselves until a light-weight async check passes, at which point they can resume executing. Most importantly, this results in the worker slot, and most notably any resources used by it, to be returned to Airflow. This allows simple things like monitoring a job in an external system or watching for an event to be much cheaper.
To support this feature, a new component has been added to Airflow, the triggerer, which is the daemon process that runs the asyncio event loop.
Airflow 2.2.0 ships with 2 deferrable sensors, DateTimeSensorAsync and TimeDeltaSensorAsync, both of which are drop-in replacements for the existing corresponding sensor.
More information can be found at:
Deferrable Operators & Triggers
Custom @task decorators and @task.docker
Airflow 2.2.0 allows providers to create custom @task decorators in the TaskFlow interface.
The @task.docker decorator is one such decorator that allows you to run a function in a docker container. Airflow handles getting the code into the container and returning xcom - you just worry about your function. This is particularly useful when you have conflicting dependencies between Airflow itself and tasks you need to run.
More information on creating custom @task decorators can be found at: Creating Custom @task Decorators
More information on the @task.docker decorator can be found at: Using the Taskflow API with Docker or Virtual Environments
Validation of DAG params
You can now apply validation on DAG params by passing a Param object for each param. The Param object supports the full json-schema validation specifications.
Currently this only functions with manually triggered DAGs, but it does set the stage for future params related functionality.
More information can be found at: Params
Other small features
This isn’t a comprehensive list, but some noteworthy or interesting small features include:
Testing Connections from the UI - test the credentials for your Connection actually work
Duplication Connections from the UI
DAGs “Next run” info is shown in the UI, including when the run will actually start
airflow standalone command runs all of the Airflow components directly without docker - great for local development
Contributors
Thanks to everyone who contributed to this release: Andrew Godwin, Ash Berlin-Taylor, Brent Bovenzi, Elad Kalif, Ephraim Anierobi, James Timmins, Jarek Potiuk, Jed Cunningham, Josh Fell, Kamil Breguła, Kaxil Naik, Malthe Borch, Sam Wheating, Sumit Maheshwari, Tzu-ping Chung and many others
[Less]
|