Deploying k3s server with Traefik in 2023

April 21, 2023

Introduction

I've been running digrin.com on a Scaleway.com VM instance since 2017 without too many issues. It was a server with 4GB of RAM and 4 shared cores, it was running Rancher 1.6, that was orchestrating docker containers. It costs like $10 monthly. However, I was running close to memory limit, I had to merge workers to a single container and limit redis memory. Rancher 1.6 was deprecated for quite few years now, I originally installed it, because we used it at work and I wanted to get familiar with it. While I didn't experience outages or unknown issues, I was fine with running it even when it was deprecated. But on 13.3.2023 I got an email from server provider titled "Deprecation of Bootscript feature", so I thought I would try to migrate to something newer. As we use kubernetes at work, I decided I will give that a try. I will write my notes and links here, in case I need to come back to them in the future.

Requirements

As I mentioned, I'm running digrin.com on a Scaleway.com instance for $10. I like to keep my costs low, so I don't need to squeeze money out of digrin.com. I compared hetzner to scaleway and hetzner was a bit cheaper. I got 8GB of RAM and 4 shared cores for €15.72. Two weeks later hetzner released new ARM64 servers, 8GB of RAM and 4 shared cores for €7.73 (I tried to run my stack on ARM64 server, but Gitlab does not support builds there) I was considering managed PostgreSQL, Redis and k8s cluster, but prices there are nowhere near what I am willing to pay for hobby projects. IMO I should be fine with 8GB of RAM and 4 cores and I will not need multiple servers for quite some time. I chose to use k3s, because it's lightweight, and I've read Rancher made quite nice job there.

Setup

I got a server from hetzner with latest Debian.

I installed k3s with curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server" sh, but it didn't work out of the box, I also had to install

$ apt update && apt install apparmor apparmor-utils

then install k3s

root@thor:~# curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server" sh
[INFO]  Finding release for channel stable
[INFO]  Using v1.26.3+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.26.3+k3s1/sha256sum-amd64.txt
[INFO]  Skipping binary downloaded, installed k3s matches hash
[INFO]  Skipping installation of SELinux RPM
[INFO]  Skipping /usr/local/bin/kubectl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/crictl symlink to k3s, already exists
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

As I already have some work servers in my kubeconfig, I could not just copy kubeconfig from k3s server. I used info from here to copy just required parts in kubeconfig file, so now I have 3 contexts in my kubeconfig file.

Redis container

Next step was to install redis container I will share with all my apps. I used this tutorial with memory limits and this configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
data:
  redis-config: |
    # limit memory usage to something sane
    maxmemory 700mb
    # drop only least recently used keys on overuse
    maxmemory-policy volatile-lru

Postgres container

I used this tutorial to setup postgres container, but when I tested restarting deployment, it always broke the database within few restarts. I googled if it's common issue and found this command, but that is dangerous and data loss might happen (it happened to me, only 4 tables out of 40 were there :D ), so this was not acceptable:

docker run --rm -it -u=postgres -e POSTGRES_PASSWORD={password} -v /mnt/data:/some_dir postgres:15.2 /bin/bash -c "pg_resetwal -f /some_dir"

As this was my first day playing with k3s, I was quite sad random tutorial wasn't foolproof. I asked about it in stackoverflow and it turns out I should have used StatefulSet instead of Deployment. In Rancher 1.6 I was using PostgreSQL 13, so I thought I might migrate to postgres:15.2 here. Works fine, no breaking changes so far.

Secrets

In Rancher 1.6 I used environment variables to pass secrets to containers, in k3s I use kubernetes secrets. I didn't want to manage vault service, I created a separate private repo where I will store secrets.

Container registry

Gitlab.com provides registry, which I was using in Rancher 1.6. But in Rancher 1.6 there was a UI, where I would just input login and password and it would just work. I even had a backup registry there, because one day gitlab registry was down and my server could not pull the image -> downtime.

In k3s it seems to be a bit harder, I used this tutorial to connect to gitlab registry.

Staging web service

As I had Redis, PostgreSQL and registry setup, I could finally start working on the web service. I didn't have sandbox/staging in Rancher 1.6 because of memory limits, but I could build one here to make sure everything works before migrating production.

First thing was to setup ALLOWED_HOSTS in Django settings, so I could access the service from k3s cluster. In k3s I needed django-allow-cidr middleware for django, following this tutorial. I got stuck here for some time, as I was forcing CIDR ranges into ALLOWED_HOSTS instead of ALLOWED_CIDR_NETS (names totally look the same :D ).

Ingress controller

In Rancher 1.6 I used nginx container. So in k3s, I also wanted nginx ingress controller, but as I didn't disable Traefik from first k3s install, I never made it work and thought if Traefik is default, I will just use that. There is a simple command to forward ports and I could view Traefik dashboard on localhost (not that this was useful, but I wanna keep command here for later use just in case):

kubectl port-forward -n kube-system traefik-56b8c5fb5c-sp67z 9000:9000

Dashboard is available on http://localhost:9000/dashboard/#/

In Rancher 1.6, I had to run certbot to update https certificates quarterly (thank you Letsencrypt!). There was a template for that, so it was easy. With k3s, I found out Traefik can do that by default. It even supports multiple providers, and as Cloudflare was one of them, I used that. I found a github gist that worked out of the box.

Static files

In Rancher 1.6, I was using nginx container, which mounted static files from host and served them. In k3s I wasn't using nginx, so I moved static files to web container with whitenoise. I liked that nginx handles static files and I thought it is fast. With whitenoise I have a bit more work in web container, but I guess at least it's closer to the app. Apart from installing whitenoise there, I also had to exclude static files from logging (I don't like polluted logs).

Observability

logz.io was my chosen logging in Rancher 1.6, as we also used to use it at work in 2017. But there is just 1 day of logs in free tier. At work, we started playing with grafana. I thought I will give that a try. Shipping logs there was easy, just running one command grafana provided. I would prefer manifests I can edit instead of one command, so I just dumped manifests from k3s. For k3s logs, I had to set cri: {}, so I can json parse logs in grafana (my app produces json logs). Free tier in grafana seems to be superior to logz.io (14 days retention), and I believe I can hook up metrics there was well (maybe in the future).

I used to use datadog in rancher 1.6, but I had to remove it when I was facing memory issues. Only time will tell, if grafana will be sufficient, or I will need to install datadog agent as well.

Release

With Rancher 1.6, I was releasing from gitlab CI job. Gitlab had a support for kubernetes, but they deprecated it in favour of some agent. As I thought releasing is just running one kubectl command, I wanted to avoid having unnecessary agents running on my server. So I use gitlab CI to run a release job, as mentioned in this blogpost.

deploy-web:
  stage: deploy-thor-production
  image:
    name: bitnami/kubectl:latest
    entrypoint: [""]
  script:
    - kubectl config set-cluster k3s --server="${K3S_SERVER}"
    - kubectl config set clusters.k3s.certificate-authority-data ${CERTIFICATE_AUTHORITY_DATA}
    - kubectl config set-credentials gitlab --token="${GITLAB_SA_TOKEN}"
    - kubectl config set-context default --cluster=k3s --user=gitlab
    - kubectl config use-context default
    - sed -i "s/<VERSION>/${CI_COMMIT_SHA}/g" k8s/digrin-production/deployment-web.yaml
    - kubectl apply -f k8s/digrin-production/deployment-web.yaml
  when: manual

When my deployment-web.yaml might look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: digrin-web
  labels:
    app: digrin-web
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0%
  progressDeadlineSeconds: 120
  selector:
    matchLabels:
      app: digrin-web
  template:
    metadata:
      labels:
        app: digrin-web
    spec:
      containers:
        - image: registry.gitlab.com/digrin/digrin:<VERSION>
          name: digrin-web
          envFrom:
            - secretRef:
                name: digrin-production-secret
          command: ['/bin/sh', 'extras/docker-entrypoint.sh']
          ports:
            - containerPort: 8000
              name: gunicorn
          livenessProbe:
            httpGet:
              path: /robots.txt
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 20
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 9
          readinessProbe:
            httpGet:
              path: /robots.txt
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 20
            periodSeconds: 2
            failureThreshold: 5
            timeoutSeconds: 2
      imagePullSecrets:
        - name: registry-credentials

Production Migration Plan

Disable caching on Cloudflare. Add firewall rule to block all traffic except my IP (cloudflare doesn't have maintenance mode)

Create new backup from up-to-date production database. Download backup and push it to k3s PostgreSQL server:

PGPASSWORD={password} psql -h {host/ip} -p {exposed_pg_port} -d {db_name} -U {pg_user} -f digrin_dump.sql -W

Release production version with async and periodic tasks from production (they were disabled while testing)
Run script to backup PostgreSQL database
Test web works fine:
- Login
- Check portfolio and all tabs
- Create portfolio
- Edit portfolio
- Import test portfolio
Update DNS records to point to k3s server.
Disable firewall rule on cloudflare.com.
Profit!

Production Migration

In reality, export and import DB took longer than I expected:

9:57 Cloudflare firewall rule applied.
10:06 psql DB load finished.
10:07 Workers being released.
10:11 DNS updated.
10:17 DNS applied, website runs.

Conclusion

I was quite happy with Rancher 1.6, but now I am even happier with k3s. I like having all manifests versioned now. It was not the case with rancher 1.6, where I was storing docker compose files, but I believe they were mostly outdated. Same for secrets/env variables.

I started with rancher 1.6, because as Python developer, I was quite lost when something devops related had to be done at work. With installing my own rancher 1.6 instance, I lost a few hair in the learning curve, but was much more confident after setting it up. I believe it is the same case with k3s, I was struggling for a few evenings (few hours every other day), but now I am more confident with kubernetes (still noob, but It runs my hobby projects). If you also have a technology you are not confident in, I recommend to play with it!