Self-hosted n8n does not recover automatically after postgres restart

vkarbovnichy · August 31, 2024, 4:26pm

Describe the problem/error/question

Self-hosted n8n does not recover automatically after underlying Postgres restart.
When opening n8n in browser it normally fetches all the js/css, but /rest/login endpoint never finishes (indefinitely waiting for response), so a completely blank screen is shown.
This happened several times over last weeks.
I have to manually restart n8n container.

What is the error message (if any)?

in n8n container logs I see this (100% relevant - postgres container is dead):

Error: getaddrinfo EAI_AGAIN postgres

or (don’t know if that is relevant)

Error: Connection terminated unexpectedly

in postgres container logs I see this:

2024-08-31 10:04:23.001 UTC [480778] LOG:  checkpoint starting: time
2024-08-31 10:04:31.693 UTC [480778] LOG:  checkpoint complete: wrote 52 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=8.424 s, sync=0.002 s, total=8.656 s; sync files=16, longest=0.001 s, average=0.001 s; distance=343 kB, estimate=713 kB; lsn=0/E5AC52D8, redo lsn=0/E5AC52A0
2024-08-31 10:07:34.843 UTC [1] LOG:  server process (PID 901754) exited with exit code 2
2024-08-31 10:07:34.894 UTC [1] LOG:  terminating any other active server processes
2024-08-31 10:07:35.882 UTC [1] LOG:  all server processes terminated; reinitializing
2024-08-31 10:07:41.025 UTC [901757] LOG:  database system was interrupted; last known up at 2024-08-31 10:04:31 UTC
2024-08-31 10:07:44.640 UTC [901757] LOG:  database system was not properly shut down; automatic recovery in progress
2024-08-31 10:07:45.083 UTC [901757] LOG:  redo starts at 0/E5AC52A0
2024-08-31 10:07:45.084 UTC [901757] LOG:  invalid record length at 0/E5AC5388: expected at least 24, got 0
2024-08-31 10:07:45.084 UTC [901757] LOG:  redo done at 0/E5AC5350 system usage: CPU: user: 0.00 s, system: 0.07 s, elapsed: 0.18 s
2024-08-31 10:07:45.210 UTC [901758] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-08-31 10:07:45.648 UTC [901758] LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.025 s, sync=0.091 s, total=0.544 s; sync files=3, longest=0.090 s, average=0.030 s; distance=0 kB, estimate=0 kB; lsn=0/E5AC5388, redo lsn=0/E5AC5388
2024-08-31 10:07:48.478 UTC [1] LOG:  database system is ready to accept connections

Please share your workflow

n/a

Share the output returned by the last node

n/a

Information on your n8n setup

n8n version: 1.47.3
Database (default: SQLite): postgres v16
n8n EXECUTIONS_PROCESS setting (default: own, main): not set (so, probably own)
Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
Operating system: Ubuntu 22.04

docker-compose

services:
  traefik:
    image: traefik:v3.1
    restart: always
    command:
      - '--api=true'
      - '--api.insecure=true'
      - '--api.dashboard=true'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
      - '--entrypoints.web.http.redirections.entryPoint.to=websecure'
      - '--entrypoints.web.http.redirections.entryPoint.scheme=https'
      - '--certificatesresolvers.mytlschallenge.acme.tlschallenge=true'
      - '--certificatesresolvers.mytlschallenge.acme.email=${SSL_EMAIL}'
      - '--certificatesresolvers.mytlschallenge.acme.storage=/letsencrypt/acme.json'
    ports:
      - '80:80'
      - '443:443'
    volumes:
      - ${LETSENCRYPT_DATA_PATH}:/letsencrypt
      - /var/run/docker.sock:/var/run/docker.sock:ro

  postgres:
    image: postgres:16
    restart: always
    environment:
      - POSTGRES_USER
      - POSTGRES_PASSWORD
      - POSTGRES_DB
      - POSTGRES_NON_ROOT_USER
      - POSTGRES_NON_ROOT_PASSWORD
    volumes:
      - ${POSTGRES_DATA_PATH}:/var/lib/postgresql/data
      - ./init-data.sh:/docker-entrypoint-initdb.d/init-data.sh
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -h localhost -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 10

  n8n:
    image: docker.n8n.io/n8nio/n8n:1.47.3
    restart: always
    environment:
      - GENERIC_TIMEZONE=Europe/London
      - TZ=Europe/London
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_PORT=5432
      - DB_POSTGRESDB_DATABASE=${POSTGRES_DB}
      - DB_POSTGRESDB_USER=${POSTGRES_NON_ROOT_USER}
      - DB_POSTGRESDB_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
      - N8N_HOST=${DOMAIN_NAME}
      - N8N_PORT=5678
      - N8N_PROTOCOL=https
      - NODE_ENV=production
      - WEBHOOK_URL=https://${DOMAIN_NAME}
      - N8N_AUTH_EXCLUDE_ENDPOINTS=api
      - N8N_DEFAULT_BINARY_DATA_MODE=filesystem
    ports:
      - '5678:5678'
    labels:
      - traefik.enable=true
      - traefik.http.routers.n8n.rule=Host(`${DOMAIN_NAME}`)
      - traefik.http.routers.n8n.tls=true
      - traefik.http.routers.n8n.entrypoints=websecure
      - traefik.http.routers.n8n.tls.certresolver=mytlschallenge
      - traefik.http.middlewares.n8n.headers.SSLRedirect=true
      - traefik.http.middlewares.n8n.headers.STSSeconds=315360000
      - traefik.http.middlewares.n8n.headers.browserXSSFilter=true
      - traefik.http.middlewares.n8n.headers.contentTypeNosniff=true
      - traefik.http.middlewares.n8n.headers.forceSTSHeader=true
      - traefik.http.middlewares.n8n.headers.SSLHost=${DOMAIN_NAME}
      - traefik.http.middlewares.n8n.headers.STSIncludeSubdomains=true
      - traefik.http.middlewares.n8n.headers.STSPreload=true
      - traefik.http.routers.n8n.middlewares=n8n@docker
    volumes:
      - ${N8N_DATA_PATH}:/home/node/.n8n
      - ${N8N_LOCAL_FILES_PATH}:/files
    depends_on:
      postgres:
        condition: service_healthy

n8n · August 31, 2024, 4:26pm

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

vkarbovnichy · August 31, 2024, 8:14pm

I found a great thread about healthchecks for docker containers

I tried to implement that strategy, but the /healthz endpoint still return HTTP 200 OK when my postgres container is stopped.

This is the proof:

root@user:~# docker ps -a
CONTAINER ID   IMAGE                            COMMAND                  CREATED       STATUS                      PORTS
                                              NAMES
04b49fd7a88d   docker.n8n.io/n8nio/n8n:1.47.3   "tini -- /docker-ent…"   12 days ago   Up 48 minutes               0.0.0.0:5678->5678/tcp, :::5678->5678/tcp                                  n8n-infra-n8n-1
c7a5d6677707   postgres:16                      "docker-entrypoint.s…"   12 days ago   Exited (0) 31 minutes ago
                                              n8n-infra-postgres-1
7140108c7903   traefik:v3.1                     "/entrypoint.sh --ap…"   12 days ago   Up 7 days                   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   n8n-infra-traefik-1
root@user:~# docker exec n8n-infra-n8n-1 wget -S http://127.0.0.1:5678/healthz -O /dev/null
Connecting to 127.0.0.1:5678 (127.0.0.1:5678)
  HTTP/1.1 200 OK
  Content-Type: application/json; charset=utf-8
  Content-Length: 15
  ETag: W/"f-VaSQ4oDUiZblZNAEkkN+sX+q3Sg"
  Date: Sat, 31 Aug 2024 20:08:13 GMT
  Connection: close

saving to '/dev/null'
null                 100% |********************************|    15  0:00:00 ETA
'/dev/null' saved

These are last log lines of the n8n container:

Editor is now accessible via:
https://n8n.octoflow.ru:5678/
DatabaseError: terminating connection due to administrator command
Error: Connection terminated unexpectedly
Error: Connection terminated unexpectedly

I tried to check what /rest/login gives in this situation, but it works normally somehow

root@user:~# docker exec n8n-infra-n8n-1 wget -S http://127.0.0.1:5678/rest/login -O /dev/null
Connecting to 127.0.0.1:5678 (127.0.0.1:5678)
  HTTP/1.1 401 Unauthorized
wget: server returned error: HTTP/1.1 401 Unauthorized

In my browser it does not finish because browser sends correct cookies and n8n parses them normally and then tries to go to the database, not able to continue, so doesn’t work.

Now I have no idea what to test to know if n8n is working normally.

vkarbovnichy · August 31, 2024, 8:58pm

One more iteration of research.

added more logging to n8n
docker-compose down/up
all started
stopped postgres container manually
opened n8n, my token is now incorrect, it redirected to signin page
got a bunch of db error logs
started postgres container manually
signed it, all works! weird.

n8n logs

Editor is now accessible via:
https://n8n.octoflow.ru:5678/
Error: getaddrinfo EAI_AGAIN postgres
    at /usr/local/lib/node_modules/n8n/node_modules/pg-pool/index.js:45:11
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at PostgresDriver.obtainMasterConnection (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresDriver.js:883:28)
    at PostgresQueryRunner.query (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresQueryRunner.js:178:36)
    at SelectQueryBuilder.loadRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2192:25)
    at SelectQueryBuilder.executeEntitiesAndRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2040:26)
    at SelectQueryBuilder.getRawAndEntities (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:684:29)
    at SelectQueryBuilder.getOne (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:711:25)
    at AuthService.resolveJwt (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:96:22)
    at AuthService.authMiddleware (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:48:28)
2024-08-31T20:47:18.061Z | error    | Error: getaddrinfo EAI_AGAIN postgres "{ file: 'LoggerProxy.js', function: 'exports.error' }"
Error: getaddrinfo EAI_AGAIN postgres
    at /usr/local/lib/node_modules/n8n/node_modules/pg-pool/index.js:45:11
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at PostgresDriver.obtainMasterConnection (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresDriver.js:883:28)
    at PostgresQueryRunner.query (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresQueryRunner.js:178:36)
    at SelectQueryBuilder.loadRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2192:25)
    at SelectQueryBuilder.executeEntitiesAndRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2040:26)
    at SelectQueryBuilder.getRawAndEntities (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:684:29)
    at SelectQueryBuilder.getOne (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:711:25)
    at AuthService.resolveJwt (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:96:22)
    at AuthService.authMiddleware (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:48:28)
2024-08-31T20:48:31.969Z | error    | DatabaseError: the database system is starting up "{ file: 'LoggerProxy.js', function: 'exports.error' }"
2024-08-31T20:52:02.173Z | verbose  | Execution for workflow REDACTED was assigned id 33884 "{\n  executionId: '33884',\n  file: 'WorkflowRunner.js',\n  function: 'runMainProcess'\n}"

So, I don’t know how to reproduce now, but this problem occured several times over 2 weeks, and every time I had to restart n8n container while postgres was already alive for a long time.

It’s a separate issue why postgres crashes, I will investigate later.

How to correctly perform a healthcheck is still an open question.

Jon · September 16, 2024, 11:06am

Hey @vkarbovnichy,

This looks like an interesting one, I suspect if the database is offline for a long time it will just give up.

At some point we will likely add an option in our /health endpoint to also check the database status which may help but for now the only thing I can think of would be to monitor the logs as a health check.

vkarbovnichy · October 5, 2024, 9:44pm

Does this PR solve the problem for non-worker nodes too? I’m not quite sure

github.com/n8n-io/n8n

perf(core): Optimize worker healthchecks

n8n-io:master ← n8n-io:optimize-worker-healthchecks

opened 08:47AM - 04 Oct 24 UTC

ivov

+37 -44

This PR changes the `/healthz` endpoint on the worker to respond regardless of d…atabase and Redis readiness, and introduces `/healthz/readiness` on the worker to report on database + Redis readiness. This mirrors behavior on `/healthz`. Also we no longer ping the database and Redis on every health check request. https://linear.app/n8n/issue/PAY-1973

So basically I can try to utilize /healthz/readiness for the original problem when the PR gets merged and published in a release.