Traefik 3.0 service discovery in Docker Swarm mode

I recently decided to set up a Docker swarm cluster for a project I was working on. If you aren’t familiar with Swarm mode, it is similar in some ways to k8s but with much less complexity and it is built into Docker. If you are looking for a fairly straightforward way to deploy containers across a number of nodes without all the overhead of k8s it can be a good choice, however it isn’t a very popular or widespread solution these days.

Anyway, I set up a VM scaling set in Azure with 10 Ubuntu 22.04 vms and wrote some Ansible scripts to automate the process of installing Docker on each machine as well as setting 3 up as swarm managers and the other 7 as worker nodes. I ssh’d into the primary manager node and created a docker compose file for launching an observability stack.

Here is what that docker-compose.yml looks like:

---
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.88.0
    volumes:
      - /home/user/repo/common/devops/observability/otel-config.yaml:/etc/otel/config.yaml
      - /home/user/repo/log:/log/otel
    command: --config /etc/otel/config.yaml
    environment:
      JAEGER_ENDPOINT: 'tempo:4317'
      LOKI_ENDPOINT: 'http://loki:3100/loki/api/v1/push'
    ports:
      - '8889:8889' # Prometheus metrics exporter (scrape endpoint)
      - '13133:13133' # health_check extension
      - '55679:55679' # ZPages extension
    deploy:
      placement:
        constraints:
          - node.hostname==dockerswa2V8BY4
    networks:
      - traefik
  prometheus:
    container_name: prometheus
    image: prom/prometheus:v2.42.0
    volumes:
      - /home/user/repo/common/devops/observability/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - '9090:9090'
    deploy:
      placement:
        constraints:
          - node.hostname==dockerswa2V8BY4
    networks:
      - traefik
  loki:
    container_name: loki
    image: grafana/loki:2.7.4
    ports:
      - '3100:3100'
    networks:
      - traefik
  grafana:
    container_name: grafana
    image: grafana/grafana:9.4.3
    volumes:
      - /home/user/repo/common/devops/observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: 'false'
      GF_AUTH_ANONYMOUS_ORG_ROLE: 'Admin'
    expose:
      - '3000'
    labels:
      - traefik.constraint-label=traefik
      - traefik.http.middlewares.https-redirect.redirectscheme.scheme=https
      - traefik.http.middlewares.https-redirect.redirectscheme.permanent=true
      - traefik.http.routers.grafana-http.rule=Host(`swarm-grafana.mydomain.com`)
      - traefik.http.routers.grafana-http.entrypoints=http
      - traefik.http.routers.grafana-http.middlewares=https-redirect
      # traefik-https the actual router using HTTPS
      # Uses the environment variable DOMAIN
      - traefik.http.routers.grafana-https.rule=Host(`swarm-grafana.mydomain.com`)
      - traefik.http.routers.grafana-https.entrypoints=https
      - traefik.http.routers.grafana-https.tls=true
      # Use the special Traefik service api@internal with the web UI/Dashboard
      - traefik.http.routers.grafana-https.service=grafana
      # Use the "le" (Let's Encrypt) resolver created below
      - traefik.http.routers.grafana-https.tls.certresolver=le
      # Enable HTTP Basic auth, using the middleware created above
      - traefik.http.services.grafana.loadbalancer.server.port=3000
    deploy:
      placement:
        constraints:
          - node.hostname==dockerswa2V8BY4
    networks:
      - traefik
  # Tempo runs as user 10001, and docker compose creates the volume as root.
  # As such, we need to chown the volume in order for Tempo to start correctly.
  init:
    image: &tempoImage grafana/tempo:latest
    user: root
    entrypoint:
      - 'chown'
      - '10001:10001'
      - '/var/tempo'
    volumes:
      - /home/user/repo/tempo-data:/var/tempo
    deploy:
      placement:
        constraints:
          - node.hostname==dockerswa2V8BY4

  tempo:
    image: *tempoImage
    container_name: tempo
    command: ['-config.file=/etc/tempo.yaml']
    volumes:
      - /home/user/repo/common/devops/observability/tempo.yaml:/etc/tempo.yaml
      - /home/user/repo/tempo-data:/var/tempo
    deploy:
      placement:
        constraints:
          - node.hostname==dockerswa2V8BY4
    ports:
      - '14268' # jaeger ingest
      - '3200' # tempo
      - '4317' # otlp grpc
      - '4318' # otlp http
      - '9411' # zipkin
    depends_on:
      - init
    networks:
      - traefik
networks:
  traefik:
    external: true

Pretty straightforward so I proceed to deploy it into the swarm

docker stack deploy -c docker-compose.yml observability

Everything deploys properly but when I view the Traefik logs there is an issue with all the services except for the grafana service. I get errors like this:

traefik_traefik.1.tm5iqb9x59on@dockerswa2V8BY4    | 2024-05-11T13:14:16Z ERR error="service \"observability-prometheus\" error: port is missing" container=observability-prometheus-37i852h4o36c23lzwuu9pvee9 providerName=swarm

It drove me crazy for about half a day or so. I couldn’t find any reason why the grafana service worked as expected but none of the others did. Part of my love/hate relationship with Traefik stems from the fact that configuration issues like this can be hard to track and debug. Ultimately after lots of searching and banging my head against a wall I found the answer in the Traefik docs and thought I would share here for anyone else who might run into this issue. Again, this solution is specific to Docker Swarm mode.

https://doc.traefik.io/traefik/providers/swarm/#configuration-examples

Expand that first section and you will see the solution:

It turns out I just needed to update my docker-compose.yml and nest the labels under a deploy section, redeploy and everything was working as expected.