Post

Self-Hosting Apache Airflow on AWS - Part 1

Self-Hosting Apache Airflow on AWS - Part 1

An orchestrator is a vital component of any modern data engineering stack. It allows you to define, schedule, and monitor complex data pipelines as code, ensuring tasks run in the right order and failures are handled gracefully. There are quite a few choices when it comes to orchestrators, but the most popular ones are Apache Airflow, Prefect, and Dagster, with Apache Airflow being the most widely adopted.

All three can be either self-hosted or managed by their creators in the case of Dagster and Prefect, or by multiple vendors/clouds in the case of Airflow. In this blog post, we will focus on Airflow.

Just Use ‘X’ (AWS MWAA, GCP Composer, or Astronomer)

For many teams, it’s the right call. It simplifies operations significantly and lets you focus on what’s important (unless your business model is running Airflow SaaS). Managing your own infrastructure requires skills that are not necessarily adjacent to pipeline or software development, often requiring a dedicated person to manage it.

The initial setup is just part of the burden, as you also have to monitor, patch/update, handle disaster recovery, and more - all while dealing with potential time sinks that can pull you away from revenue-generating activities. However, not everything is rainbows. Like everything in software - it depends.

Startups and smaller teams that need velocity should use managed platforms. They prioritize time to market and enable fast, efficient data processing. Larger teams with dedicated resources can venture into self-hosting to optimize costs and increase flexibility. With modern cloud providers offering easy-to-manage compute, self-hosted solutions now feel almost “serverless.”

AWS MWAA: Managed Doesn’t Always Mean Better

AWS is my cloud platform of choice so let’s look at their offering for managed Airflow - MWAA. Despite its warts and limitations, it’s a genuinely solid product. It provides all the features you’d expect from an Airflow deployment - high availability, automatically scaling workers based on the number of queued tasks, and even a web server that can scale to handle increased user traffic. Supports custom plugins, airflow providers, monitoring is built-in, and you can also install any python package that is compatible with your Airflow version. Nonetheless, there are important reasons why one might go self-managed route instead of MWAA.

  • MWAA pricing scales steeply with environment size. At medium-to-large workloads, self-hosted setups on right-sized EC2 or Fargate can cut workflow infrastructure costs by 40–70%.
  • MWAA lags behind Airflow releases by at least a minor version. The most current version as of today - 3.0.6 - was released back in August 2025. That’s usually fine, however, this particular release has an executor bug due to deprecation of fork in python 3.12, causing numerous SIGKILLs
  • MWAA imposes constraints on plugin packages and their sizes. Complex dependency trees, private PyPI mirrors, or custom operators can be painful to shoehorn in.

That got me thinking, how crazy would it be to try and host your own Airflow on AWS? Can’t be too hard, right? Throw a bunch of containers in ECS, provision a RDS database, connect these and you are set! And that’s exactly what I did, and I’ll walk you through that process.

Self-Hosted Airflow

The full code is available on GitHub

Before we start digging further, I want to give the lay of the land first.

The repo is divided into two layers - platform and services. Platform layer contains core AWS resources that services utilize to build upon. Terraform is used to provision the infrastructure, so it makes sense to split these layers into two different states. I don’t want my services changes to affect my core platform resources.

I will also skip over some resources in the terraform code snippets to prevent turning this post into a manual. Many resources such as IAM roles and policies are essential, and many AWS services won’t function without them, so please refer to the GitHub repo to get the full picture.

That being said, here are some components that make up the Airflow service:

  • ECS for container orchestration
  • RDS for metadata database (PostgreSQL)
  • SSM parameters for configuration storage
  • Tailscale router for secure access to my VPC
  • Tailscale proxy for direct access to Airflow UI

ECS is awesome, especially Fargate. You get a lot of features for a fraction of complexity of Kubernetes, so it’s a no-brainer choice in our scenario. I decided to try out a recently added mode called Managed Instances, basically a hybrid between ECS on EC2 and Fargate. The value proposition here is that AWS will manage the underlying EC2 instances for a small fee, providing you with more flexibility to choose different compute families (can even pick VMs with GPUs) and avoiding paying premium for Fargate, which quickly gets expensive in always-on workloads.

Tailscale is another cool piece of tech, it is a VPN service that makes it easy to connect your devices and servers into a private network. You install the Tailscale client on each device, join the tailscale network, and they instantly see each other over a private IP range (100.x.x.x), regardless of whether they’re behind NAT or firewalls.

Platform

Networking

AWS account already has a default VPC that you can use, but I prefer to ignore it and create a separate VPC fully from scratch - called it main. There are 2 private and 1 public subnets. Both are dual stack networks meaning deployed resources will get both ipv4 and ipv6 addresses. The benefit here is that we can route ipv6 only traffic through egress-only gateway, which is free, and avoid paying NAT gateway associated costs if our destinations support ipv6. NAT is still used for services that don’t support ipv6.

Tailscale is deployed on ec2 as a subnet router in a public subnet, allowing for peer-to-peer traffic to authorize clients. I’m using public terraform module here to simplify the setup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
module "tailscale" {
  source  = "masterpointio/tailscale/aws"
  version = "2.1.0"

  vpc_id     = aws_vpc.main.id
  subnet_ids = [aws_subnet.public.id] # Ensure subnet router is in a public subnet

  advertise_routes = [
    aws_subnet.public.cidr_block,
    aws_subnet.private_a.cidr_block,
    aws_subnet.private_b.cidr_block,
  ]

  additional_security_group_ids = [aws_security_group.tailscale.id]  # Attach the security group to the subnet router
  tailscaled_extra_flags        = ["--port=${local.tailscale_port}"] # Ensure `tailscaled` listens on the same port as the security group is configured

  instance_type           = "t4g.nano"
  desired_capacity        = 1
  ssh_enabled             = false
  session_logging_enabled = false
  ssm_state_enabled       = true

  name        = "tailscale-subnet-router"
  primary_tag = "router"

  tags = local.common_tags
}

By advertising all my subnets, I can securely access my deployed resources from my laptop with Tailscale client. We can even go further and use AWS hostnames assigned to resources like RDS, instead of IPs, by utilizing the split DNS feature in Tailscale. All I had to do was point to the AWS DNS server at the base of the VPC subnet + 2 (10.0.0.2) and provide a region-specific domain, rds.amazonaws.com. You can read more about it here.

ECS

ECS cluster is defined in the platform layer because many services can utilize it:

1
2
3
resource "aws_ecs_cluster" "this" {
  name = "data-platform-prod"
}

As mentioned previously, I’m using ECS Managed Instances, so it requires a custom capacity provider to provision the VMs containers will be running on. Here is the snippet of the provider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Capacity provider for ECS managed instances – used for Airflow core components that require EC2 hosts.
resource "aws_ecs_capacity_provider" "managed_instances" {
  name    = "managed-instances-provider"
  cluster = aws_ecs_cluster.this.name

  managed_instances_provider {
    infrastructure_role_arn = aws_iam_role.ecs_infrastructure.arn

    instance_launch_template {
      ec2_instance_profile_arn = aws_iam_instance_profile.ecs_managed_instance.arn
      monitoring               = "BASIC"

      storage_configuration {
        storage_size_gib = 20
      }

      instance_requirements {
        cpu_manufacturers = ["amazon-web-services"]

        bare_metal            = "excluded"
        burstable_performance = "included"

        allowed_instance_types = var.ecs_instance_types

        memory_mib {
          min = var.ecs_instance_memory_mib.min
          max = var.ecs_instance_memory_mib.max
        }

        vcpu_count {
          min = var.ecs_instance_vcpu_count.min
          max = var.ecs_instance_vcpu_count.max
        }
      }

      network_configuration {
        subnets         = [for s in local.ecs_subnets : s.id]
        security_groups = [aws_security_group.managed_instances.id]
      }
    }
  }
}

As you can see, you can configure quite a few parameters here to best satisfy your computing needs. After that, you have to register the provider with the active cluster like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
resource "aws_ecs_cluster_capacity_providers" "this" {
  cluster_name = aws_ecs_cluster.this.name

  capacity_providers = [
    aws_ecs_capacity_provider.managed_instances.name,
    "FARGATE"
  ]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.managed_instances.name
    weight            = 100
  }

  depends_on = [time_sleep.capacity_provider_active]
}

Make sure to specify the default_capacity_provider_strategy otherwise every service in the cluster must specify the strategy. I also had to artificially sleep in between the creation of the provider and its registration using the time_sleep resource, because it takes awhile for AWS backend to register the creation fact.

That’s all the auxiliary resources that we need to start provisioning Airflow. Done with the platform!

Airflow Service

The most interesting resources are probably the RDS setup and the actual task definitions that will be running in the cluster. Let’s look at the RDS first.

RDS

We create a subnet group distributed across two availability zones of our private subnets. RDS deployment requires at least two AZs even if you are not using High Availability setup.

1
2
3
4
resource "aws_db_subnet_group" "this" {
  name       = "airflow-db-subnet-group"
  subnet_ids = data.aws_subnets.private_subnets.ids
}

Next, we create a security group so that only our ECS airflow cluster and the Tailscale subnet router can access the db, so we can connect to RDS with SQL developer tools like DBeaver.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
resource "aws_security_group" "rds" {
  name        = "airflow-rds-sg"
  description = "Controls access to the Metadata RDS instance."
  vpc_id      = data.aws_vpc.this.id

  ingress {
    description = "DB port from allowed security groups"
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    security_groups = [
      aws_security_group.airflow.id,       # Allow access from Airflow tasks
      data.aws_security_group.tailscale.id # Allow access from Tailscale router (for maintenance/debugging)
    ]
  }
}

Next comes the database instance itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
resource "aws_db_instance" "this" {
  identifier = "airflow-db"

  engine         = "postgres"
  engine_version = "16"

  instance_class    = var.db_instance_class
  allocated_storage = var.db_storage_size

  db_name = local.airflow_db

  username = random_pet.admin_user.id
  password = random_password.admin_user.result

  # Networking
  db_subnet_group_name   = aws_db_subnet_group.this.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  network_type           = "DUAL"
  multi_az               = false

  # Updates
  auto_minor_version_upgrade  = true
  allow_major_version_upgrade = false

  # Backups
  deletion_protection     = false
  skip_final_snapshot     = true
  backup_retention_period = 1

  # Metrics
  database_insights_mode       = "standard"
  performance_insights_enabled = false

  # Encryption at rest
  storage_encrypted = false

  # No public endpoint – access only from within the VPC
  publicly_accessible = false
}

Admin username and password are generated by Terraform’s random provider and securely stored in SSM. The “randomness” is attached to a dedicated variable called var.db_admin_credentials_ver that dictates the current version of those credentials. Changes to it, will update the admin credentials.

Make sure to increase the backup retention period and consider keeping the final snapshot or enabling deletion protection in case of accidental deletion


Airflow requires a dedicated user with the ownership of the database being present before running its migration process. We could package a custom Lambda or an ECS task and run the setup code. Instead, we can use the declarative power of terraform and networking capabilities of Tailscale to perform the bootstrapping with postgresql provider.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
resource "postgresql_role" "airflow_user" {
  name     = local.u_airflow
  password = local.u_airflow_pass
  login    = true

  # Do not drop the role on destroy, as it may still own database objects.
  skip_drop_role = true

  depends_on = [aws_db_instance.this]
}

resource "postgresql_grant" "db_privileges" {
  database    = local.airflow_db
  role        = local.u_airflow
  object_type = "database"
  privileges  = ["ALL"]
}

resource "postgresql_grant" "airflow_user_public_schema" {
  database    = local.airflow_db
  role        = local.u_airflow
  schema      = "public"
  object_type = "schema"
  privileges  = ["ALL"] # ALL on a schema = CREATE + USAGE

  depends_on = [postgresql_grant.db_privileges]
}

Because I configured the Tailscale router earlier, we can access the RDS instance from the Tailscale network when connected. This allows us to declaratively create an Airflow user and define its permissions.

ECS Service and Task Definitions

Airflow can be deployed in two ways: as a standalone setup, where all components (scheduler, UI, triggerer, and DAG parser) run in the same container, or with each component deployed separately and fronted by a service. The second approach is more complex than the simpler standalone option. However, with a standalone deployment, you lose important benefits such as high availability and horizontal scaling. The good news is that ECS makes it easy to transition from a standalone setup later, so it may be better to start simple and refactor when the added complexity is justified. We’ll begin with the first option.

Here is the ECS service for Airflow task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
resource "aws_ecs_service" "airflow_service" {
  name                = "airflow-service"
  cluster             = var.cluster
  desired_count       = 0
  scheduling_strategy = "REPLICA"
  task_definition     = aws_ecs_task_definition.airflow.arn

  propagate_tags = "SERVICE"

  deployment_minimum_healthy_percent = 0
  deployment_maximum_percent         = 100
  enable_ecs_managed_tags            = true

  deployment_configuration {
    strategy = "ROLLING"
  }

  # If a deployment fails, roll back to the last known good state
  # Consider subscribing to EventBridge events for ECS deployment failures to trigger alerts
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  network_configuration {
    assign_public_ip = false
    subnets          = data.aws_subnets.private_subnets.ids
    security_groups  = [aws_security_group.airflow.id]
  }

  lifecycle {
    ignore_changes = [desired_count, capacity_provider_strategy]
  }

  tags = merge(local.common_tags, {
    Component    = "standalone"
    ImageVersion = var.airflow_version
    Type         = "managed-instances"
  })
}

Notice that desired_count is set to 0 and effectively ignored. We manage this parameter through deployment workflows outside of Terraform. We also enable the deployment circuit breaker to automatically roll back to the previous task revision if the latest deployment fails to stabilize. Additionally, we set deployment_minimum_healthy_percent to 0 and deployment_maximum_percent to 100, allowing all running container instances to be deprovisioned when a new deployment is triggered.

While this approach causes brief service interruptions, it allows ECS to provision appropriately sized infrastructure for Airflow without over provisioning, since we don’t need to support two Airflow instances running simultaneously. If uninterrupted service is required, it may be better to deploy each Airflow component separately to achieve true high availability.

Let’s look at the tasks now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# This task definition is used for running one-off commands on Airflow cluster. It won't be used for the main Airflow service.
resource "aws_ecs_task_definition" "airflow_utils" {
  family             = "airflow-utils"
  task_role_arn      = aws_iam_role.airflow_task.arn
  execution_role_arn = aws_iam_role.execution.arn

  cpu    = "512"
  memory = "1024"

  network_mode             = "awsvpc"
  requires_compatibilities = ["MANAGED_INSTANCES"]

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "ARM64"
  }

  container_definitions = jsonencode([
    {
      name      = "airflow-utils"
      image     = "apache/airflow:${var.airflow_version}"
      essential = true
      command   = []

      secrets = [{
        name      = "AIRFLOW__DATABASE__SQL_ALCHEMY_CONN"
        valueFrom = aws_ssm_parameter.airflow_db_connection_str.arn
      }]

      environment = [
        {
          name  = "AIRFLOW__CORE__AUTH_MANAGER"
          value = "airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager"
        }
      ]

      mountPoints    = []
      portMappings   = []
      systemControls = []
      volumesFrom    = []

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          awslogs-group         = aws_cloudwatch_log_group.airflow.name
          awslogs-region        = local.aws_region
          awslogs-stream-prefix = "tasks"
        }
      }
    }
  ])
}

airflow_utils will be used by our deployment system to run commands on the cluster, such as creating an initial user, running migrations, and more. We use the RDS connection string stored in SSM and reference it via a parameter. We also use FabAuth as the authentication mechanism, since newer Airflow versions default to a more minimal variant. There is also an experimental AWS Auth manager that I think is worth exploring, as it would allow us to continue using familiar IAM constructs to manage Airflow users and their permissions.

Now, the airflow task itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
resource "aws_ecs_task_definition" "airflow" {
  family             = "airflow"
  task_role_arn      = aws_iam_role.airflow_task.arn
  execution_role_arn = aws_iam_role.execution.arn

  cpu    = "1024"
  memory = "2048"

  network_mode             = "awsvpc"
  requires_compatibilities = ["MANAGED_INSTANCES"]

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "ARM64"
  }

  volume {
    name = "ts-serve-config"
  }

  container_definitions = jsonencode([
    {
      name       = "ts-config-init"
      image      = "amazon/aws-cli"
      essential  = false
      entryPoint = ["/bin/sh", "-c"]
      command = [
        "aws ssm get-parameter --name ${aws_ssm_parameter.tailscale_serve_config.name} --region ${local.aws_region} --query Parameter.Value --output text > /var/ts-config/serve.json"
      ]

      mountPoints = [{
        sourceVolume  = "ts-serve-config"
        containerPath = "/var/ts-config"
        readOnly      = false
      }]

      environment    = []
      portMappings   = []
      systemControls = []
      volumesFrom    = []

    },

    {
      name      = "airflow"
      image     = "apache/airflow:${var.airflow_version}"
      essential = true
      command = [
        "bash",
        "-c",
        <<-EOC
          airflow api-server --port 8080 & \
          airflow scheduler & \
          airflow dag-processor & \
          airflow triggerer
        EOC
      ]

      secrets = [{
        name      = "AIRFLOW__DATABASE__SQL_ALCHEMY_CONN"
        valueFrom = aws_ssm_parameter.airflow_db_connection_str.arn
      }]

      environment = [
        {
          name  = "AIRFLOW__CORE__AUTH_MANAGER"
          value = "airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager"
        }
      ]
      mountPoints = []

      portMappings = [{
        containerPort = 8080
        hostPort      = 8080
        protocol      = "tcp"
      }]

      systemControls = []
      volumesFrom    = []

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:8080/api/v2/monitor/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          awslogs-group         = aws_cloudwatch_log_group.airflow.name
          awslogs-region        = local.aws_region
          awslogs-stream-prefix = "airflow"
        }
      }
    },

    {
      name      = "tailscale"
      image     = "tailscale/tailscale:stable"
      essential = true

      secrets = [{
        name      = "TS_AUTHKEY"
        valueFrom = data.aws_ssm_parameter.tailscale_auth_key.arn
      }]

      environment = [
        {
          name  = "TS_HOSTNAME"
          value = "airflow"
        },
        {
          name  = "TS_USERSPACE"
          value = "false"
        },
        {
          name  = "TS_SERVE_CONFIG"
          value = "/var/ts-config/serve.json"
        },
        {
          name  = "TS_TAILSCALED_EXTRA_ARGS"
          value = "--state=${local.ssm_tailscale_state_arn}"
        }

      ]

      mountPoints = [{
        sourceVolume  = "ts-serve-config"
        containerPath = "/var/ts-config"
        readOnly      = true
      }]

      linuxParameters = {
        capabilities = {
          add  = ["NET_ADMIN", "NET_RAW"]
          drop = []
        }
        devices = [{
          hostPath      = "/dev/net/tun"
          containerPath = "/dev/net/tun"
          permissions   = ["read", "write", "mknod"]
        }]
      }

      portMappings   = []
      systemControls = []
      volumesFrom    = []

      dependsOn = [
        {
          containerName = "ts-config-init"
          condition     = "SUCCESS"
        },
        {
          containerName = "airflow"
          condition     = "HEALTHY"
        }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          awslogs-group         = aws_cloudwatch_log_group.airflow.name
          awslogs-region        = local.aws_region
          awslogs-stream-prefix = "tailscale"
        }
      }
    },
  ])
}

There are a few things going on here, so let’s break it down. ECS allows you to define sidecars, which are supporting containers that run alongside your application. They run on the same network, meaning they can communicate over localhost. In this task definition, I’m setting up a Tailscale proxy that will route traffic from its network interface to the Airflow UI running on port 8080.

In the section below, we are fetching a Tailscale proxy config from SSM, storing it in the container volume, and exiting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  name       = "ts-config-init"
  image      = "amazon/aws-cli"
  essential  = false
  entryPoint = ["/bin/sh", "-c"]
  command = [
    "aws ssm get-parameter --name ${aws_ssm_parameter.tailscale_serve_config.name} --region ${local.aws_region} --query Parameter.Value --output text > /var/ts-config/serve.json"
  ]

  mountPoints = [{
    sourceVolume  = "ts-serve-config"
    containerPath = "/var/ts-config"
    readOnly      = false
  }]

  environment    = []
  portMappings   = []
  systemControls = []
  volumesFrom    = []
}

Notice that this side-car is marked as nonessential so that airflow service won’t try to bring it up again after it exits.

Second container here is our main airflow standalone deployment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  name      = "airflow"
  image     = "apache/airflow:${var.airflow_version}"
  essential = true
  command = [
    "bash",
    "-c",
    <<-EOC
      airflow api-server --port 8080 & \
      airflow scheduler & \
      airflow dag-processor & \
      airflow triggerer
    EOC
  ]

  secrets = [{
    name      = "AIRFLOW__DATABASE__SQL_ALCHEMY_CONN"
    valueFrom = aws_ssm_parameter.airflow_db_connection_str.arn
  }]

  environment = [
    {
      name  = "AIRFLOW__CORE__AUTH_MANAGER"
      value = "airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager"
    }
  ]
  mountPoints = []

  portMappings = [{
    containerPort = 8080
    hostPort      = 8080
    protocol      = "tcp"
  }]

  systemControls = []
  volumesFrom    = []

  healthCheck = {
    command     = ["CMD-SHELL", "curl -f http://localhost:8080/api/v2/monitor/health || exit 1"]
    interval    = 30
    timeout     = 5
    retries     = 3
    startPeriod = 60
  }
  ...
}

We run the start command sequence and define a health check to curl the health endpoint every 30 seconds with 3 attempts before failing. We mark the container as essential and provide the same environment variables as our airflow_utils task.

For the Tailscale proxy, we provide the required config, auth key, and other configuration values. We make sure to wait until Airflow is healthy and the first sidecar has exited. We also request some container capabilities like NET_ADMIN and NET_RAW to allow the Tailscale container to configure its network. These capabilities can only be specified when running ECS on EC2 or Managed Instances; Fargate is not supported. However, this is not a big problem, as you can configure Tailscale to run in user space, so no special kernel capabilities are required when deploying on Fargate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
{
  name      = "tailscale"
  image     = "tailscale/tailscale:stable"
  essential = true

  secrets = [{
    name      = "TS_AUTHKEY"
    valueFrom = data.aws_ssm_parameter.tailscale_auth_key.arn
  }]

  environment = [
    {
      name  = "TS_HOSTNAME"
      value = "airflow"
    },
    {
      name  = "TS_USERSPACE"
      value = "false"
    },
    {
      name  = "TS_SERVE_CONFIG"
      value = "/var/ts-config/serve.json"
    },
    {
      name  = "TS_TAILSCALED_EXTRA_ARGS"
      value = "--state=${local.ssm_tailscale_state_arn}"
    }

  ]

  mountPoints = [{
    sourceVolume  = "ts-serve-config"
    containerPath = "/var/ts-config"
    readOnly      = true
  }]

  linuxParameters = {
    capabilities = {
      add  = ["NET_ADMIN", "NET_RAW"]
      drop = []
    }
    devices = [{
      hostPath      = "/dev/net/tun"
      containerPath = "/dev/net/tun"
      permissions   = ["read", "write", "mknod"]
    }]
  }

  dependsOn = [
    {
      containerName = "ts-config-init"
      condition     = "SUCCESS"
    },
    {
      containerName = "airflow"
      condition     = "HEALTHY"
    }
  ]

  ...
}

This is what Tailscale serve config looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
resource "aws_ssm_parameter" "tailscale_serve_config" {
  name        = "/tailscale/airflow-serve-config"
  description = "Tailscale serve config for Airflow (used to expose the Airflow webserver securely without a load balancer)"
  type        = "String"
  value = jsonencode({
    TCP = {
      "443" = {
        HTTPS = true
      }
    }
    Web = {
      "$${TS_CERT_DOMAIN}:443" = {
        Handlers = {
          "/" = {
            Proxy = "http://127.0.0.1:8080"
          }
        }
      }
    }
  })
}

With such a setup, we can access the Airflow UI directly using a Tailscale-assigned hostname and automatic SSL certificates. No browser warnings!

You can already tell Tailscale is pretty awesome when it takes the spotlight away from Airflow so frequently :)

This also removes the need for an ALB; however, once you have several Airflow UI instances running, you would introduce the ALB again. That concludes the infrastructure part, now on to the deployment glue.

Bootstrapping Script

Once the infrastructure is provisioned, remember that the desired count on the Airflow service is 0. That means nothing will be running until we change it to >0.

However, before we change the count, we have to run a few steps:

  • Migrate the database
  • Create an admin user
  • Update the service to a desired count of 1

The python script below does exactly that. We also parse terraform outputs to use in the subsequent commands. These commands are executed via airflow_utils task that we have seen earlier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/deployments/main.py
def deploy_airflow(terraform_dir: str) -> None:
    """
    Deployment pipeline for Airflow service.
    """

    logger.info(f"Parsing Terraform output from: {terraform_dir}")
    tf_output = parse_terraform_output(terraform_dir)

    # Create reusable context and AWS clients
    ctx = AirflowContext.from_terraform_output(tf_output)
    ecs = boto3.client("ecs")
    waiter = ecs.get_waiter("tasks_stopped")

    # Start db-migrate task
    logger.info(f"Launching db-migrate task in ECS cluster '{ctx.cluster_name}'...")
    db_migrate(ctx, ecs, waiter)

    # Start create-admin-user task
    logger.info(
        f"Launching create-admin-user task in ECS cluster '{ctx.cluster_name}'..."
    )
    create_admin_user(ctx, ecs, waiter)

    # Update airflow service to 1 desired count
    logger.info(
        f"Updating Airflow service '{ctx.service_name}' to 1 desired instance..."
    )
    ecs.update_service(
        cluster=ctx.cluster_name, service=ctx.service_name, desiredCount=1
    )
    logger.info("Airflow service update complete.")

What’s Next

The skeleton is built, now it’s time to put some meat on the frame. Notice how we are using a public image instead of building one ourselves.

While it’s convenient to use a public image, a lot of airflow use cases will require at least some degree of modifications to the base image. A proper CI/CD workflow with GitHub Actions is also needed to support the Airflow life cycle (DB backups, updates, rollbacks, etc.). And we haven’t even gotten to configuring the Airflow itself, so stay tuned for part 2!

This post is licensed under CC BY 4.0 by the author.

Trending Tags