I recently decided to set up a Docker swarm cluster for a project I was working on. If you aren’t familiar with Swarm mode, it is similar in some ways to k8s but with much less complexity and it is built into Docker. If you are looking for a fairly straightforward way to deploy containers across a number of nodes without all the overhead of k8s it can be a good choice, however it isn’t a very popular or widespread solution these days.
Anyway, I set up a VM scaling set in Azure with 10 Ubuntu 22.04 vms and wrote some Ansible scripts to automate the process of installing Docker on each machine as well as setting 3 up as swarm managers and the other 7 as worker nodes. I ssh’d into the primary manager node and created a docker compose file for launching an observability stack.
Here is what that docker-compose.yml
looks like:
---
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.88.0
volumes:
- /home/user/repo/common/devops/observability/otel-config.yaml:/etc/otel/config.yaml
- /home/user/repo/log:/log/otel
command: --config /etc/otel/config.yaml
environment:
JAEGER_ENDPOINT: 'tempo:4317'
LOKI_ENDPOINT: 'http://loki:3100/loki/api/v1/push'
ports:
- '8889:8889' # Prometheus metrics exporter (scrape endpoint)
- '13133:13133' # health_check extension
- '55679:55679' # ZPages extension
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
networks:
- traefik
prometheus:
container_name: prometheus
image: prom/prometheus:v2.42.0
volumes:
- /home/user/repo/common/devops/observability/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- '9090:9090'
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
networks:
- traefik
loki:
container_name: loki
image: grafana/loki:2.7.4
ports:
- '3100:3100'
networks:
- traefik
grafana:
container_name: grafana
image: grafana/grafana:9.4.3
volumes:
- /home/user/repo/common/devops/observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
environment:
GF_AUTH_ANONYMOUS_ENABLED: 'false'
GF_AUTH_ANONYMOUS_ORG_ROLE: 'Admin'
expose:
- '3000'
labels:
- traefik.constraint-label=traefik
- traefik.http.middlewares.https-redirect.redirectscheme.scheme=https
- traefik.http.middlewares.https-redirect.redirectscheme.permanent=true
- traefik.http.routers.grafana-http.rule=Host(`swarm-grafana.mydomain.com`)
- traefik.http.routers.grafana-http.entrypoints=http
- traefik.http.routers.grafana-http.middlewares=https-redirect
# traefik-https the actual router using HTTPS
# Uses the environment variable DOMAIN
- traefik.http.routers.grafana-https.rule=Host(`swarm-grafana.mydomain.com`)
- traefik.http.routers.grafana-https.entrypoints=https
- traefik.http.routers.grafana-https.tls=true
# Use the special Traefik service api@internal with the web UI/Dashboard
- traefik.http.routers.grafana-https.service=grafana
# Use the "le" (Let's Encrypt) resolver created below
- traefik.http.routers.grafana-https.tls.certresolver=le
# Enable HTTP Basic auth, using the middleware created above
- traefik.http.services.grafana.loadbalancer.server.port=3000
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
networks:
- traefik
# Tempo runs as user 10001, and docker compose creates the volume as root.
# As such, we need to chown the volume in order for Tempo to start correctly.
init:
image: &tempoImage grafana/tempo:latest
user: root
entrypoint:
- 'chown'
- '10001:10001'
- '/var/tempo'
volumes:
- /home/user/repo/tempo-data:/var/tempo
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
tempo:
image: *tempoImage
container_name: tempo
command: ['-config.file=/etc/tempo.yaml']
volumes:
- /home/user/repo/common/devops/observability/tempo.yaml:/etc/tempo.yaml
- /home/user/repo/tempo-data:/var/tempo
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
ports:
- '14268' # jaeger ingest
- '3200' # tempo
- '4317' # otlp grpc
- '4318' # otlp http
- '9411' # zipkin
depends_on:
- init
networks:
- traefik
networks:
traefik:
external: true
Pretty straightforward so I proceed to deploy it into the swarm
docker stack deploy -c docker-compose.yml observability
Everything deploys properly but when I view the Traefik logs there is an issue with all the services except for the grafana service. I get errors like this:
traefik_traefik.1.tm5iqb9x59on@dockerswa2V8BY4 | 2024-05-11T13:14:16Z ERR error="service \"observability-prometheus\" error: port is missing" container=observability-prometheus-37i852h4o36c23lzwuu9pvee9 providerName=swarm
It drove me crazy for about half a day or so. I couldn’t find any reason why the grafana service worked as expected but none of the others did. Part of my love/hate relationship with Traefik stems from the fact that configuration issues like this can be hard to track and debug. Ultimately after lots of searching and banging my head against a wall I found the answer in the Traefik docs and thought I would share here for anyone else who might run into this issue. Again, this solution is specific to Docker Swarm mode.
https://doc.traefik.io/traefik/providers/swarm/#configuration-examples
Expand that first section and you will see the solution:
It turns out I just needed to update my docker-compose.yml
and nest the labels under a deploy section, redeploy and everything was working as expected.