Self-hosted n8n does not recover automatically after underlying Postgres restart.
When opening n8n in browser it normally fetches all the js/css, but /rest/login endpoint never finishes (indefinitely waiting for response), so a completely blank screen is shown.
This happened several times over last weeks.
I have to manually restart n8n container.
What is the error message (if any)?
in n8n container logs I see this (100% relevant - postgres container is dead):
Error: getaddrinfo EAI_AGAIN postgres
or (don’t know if that is relevant)
Error: Connection terminated unexpectedly
in postgres container logs I see this:
2024-08-31 10:04:23.001 UTC [480778] LOG: checkpoint starting: time
2024-08-31 10:04:31.693 UTC [480778] LOG: checkpoint complete: wrote 52 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=8.424 s, sync=0.002 s, total=8.656 s; sync files=16, longest=0.001 s, average=0.001 s; distance=343 kB, estimate=713 kB; lsn=0/E5AC52D8, redo lsn=0/E5AC52A0
2024-08-31 10:07:34.843 UTC [1] LOG: server process (PID 901754) exited with exit code 2
2024-08-31 10:07:34.894 UTC [1] LOG: terminating any other active server processes
2024-08-31 10:07:35.882 UTC [1] LOG: all server processes terminated; reinitializing
2024-08-31 10:07:41.025 UTC [901757] LOG: database system was interrupted; last known up at 2024-08-31 10:04:31 UTC
2024-08-31 10:07:44.640 UTC [901757] LOG: database system was not properly shut down; automatic recovery in progress
2024-08-31 10:07:45.083 UTC [901757] LOG: redo starts at 0/E5AC52A0
2024-08-31 10:07:45.084 UTC [901757] LOG: invalid record length at 0/E5AC5388: expected at least 24, got 0
2024-08-31 10:07:45.084 UTC [901757] LOG: redo done at 0/E5AC5350 system usage: CPU: user: 0.00 s, system: 0.07 s, elapsed: 0.18 s
2024-08-31 10:07:45.210 UTC [901758] LOG: checkpoint starting: end-of-recovery immediate wait
2024-08-31 10:07:45.648 UTC [901758] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.025 s, sync=0.091 s, total=0.544 s; sync files=3, longest=0.090 s, average=0.030 s; distance=0 kB, estimate=0 kB; lsn=0/E5AC5388, redo lsn=0/E5AC5388
2024-08-31 10:07:48.478 UTC [1] LOG: database system is ready to accept connections
Please share your workflow
n/a
Share the output returned by the last node
n/a
Information on your n8n setup
n8n version: 1.47.3
Database (default: SQLite): postgres v16
n8n EXECUTIONS_PROCESS setting (default: own, main): not set (so, probably own)
Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
I found a great thread about healthchecks for docker containers
I tried to implement that strategy, but the /healthz endpoint still return HTTP 200 OK when my postgres container is stopped.
This is the proof:
root@user:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
04b49fd7a88d docker.n8n.io/n8nio/n8n:1.47.3 "tini -- /docker-ent…" 12 days ago Up 48 minutes 0.0.0.0:5678->5678/tcp, :::5678->5678/tcp n8n-infra-n8n-1
c7a5d6677707 postgres:16 "docker-entrypoint.s…" 12 days ago Exited (0) 31 minutes ago
n8n-infra-postgres-1
7140108c7903 traefik:v3.1 "/entrypoint.sh --ap…" 12 days ago Up 7 days 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp n8n-infra-traefik-1
root@user:~# docker exec n8n-infra-n8n-1 wget -S http://127.0.0.1:5678/healthz -O /dev/null
Connecting to 127.0.0.1:5678 (127.0.0.1:5678)
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 15
ETag: W/"f-VaSQ4oDUiZblZNAEkkN+sX+q3Sg"
Date: Sat, 31 Aug 2024 20:08:13 GMT
Connection: close
saving to '/dev/null'
null 100% |********************************| 15 0:00:00 ETA
'/dev/null' saved
These are last log lines of the n8n container:
Editor is now accessible via:
https://n8n.octoflow.ru:5678/
DatabaseError: terminating connection due to administrator command
Error: Connection terminated unexpectedly
Error: Connection terminated unexpectedly
I tried to check what /rest/login gives in this situation, but it works normally somehow
root@user:~# docker exec n8n-infra-n8n-1 wget -S http://127.0.0.1:5678/rest/login -O /dev/null
Connecting to 127.0.0.1:5678 (127.0.0.1:5678)
HTTP/1.1 401 Unauthorized
wget: server returned error: HTTP/1.1 401 Unauthorized
In my browser it does not finish because browser sends correct cookies and n8n parses them normally and then tries to go to the database, not able to continue, so doesn’t work.
Now I have no idea what to test to know if n8n is working normally.
opened n8n, my token is now incorrect, it redirected to signin page
got a bunch of db error logs
started postgres container manually
signed it, all works! weird.
n8n logs
Editor is now accessible via:
https://n8n.octoflow.ru:5678/
Error: getaddrinfo EAI_AGAIN postgres
at /usr/local/lib/node_modules/n8n/node_modules/pg-pool/index.js:45:11
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at PostgresDriver.obtainMasterConnection (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresDriver.js:883:28)
at PostgresQueryRunner.query (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresQueryRunner.js:178:36)
at SelectQueryBuilder.loadRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2192:25)
at SelectQueryBuilder.executeEntitiesAndRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2040:26)
at SelectQueryBuilder.getRawAndEntities (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:684:29)
at SelectQueryBuilder.getOne (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:711:25)
at AuthService.resolveJwt (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:96:22)
at AuthService.authMiddleware (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:48:28)
2024-08-31T20:47:18.061Z | error | Error: getaddrinfo EAI_AGAIN postgres "{ file: 'LoggerProxy.js', function: 'exports.error' }"
Error: getaddrinfo EAI_AGAIN postgres
at /usr/local/lib/node_modules/n8n/node_modules/pg-pool/index.js:45:11
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at PostgresDriver.obtainMasterConnection (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresDriver.js:883:28)
at PostgresQueryRunner.query (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/driver/postgres/PostgresQueryRunner.js:178:36)
at SelectQueryBuilder.loadRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2192:25)
at SelectQueryBuilder.executeEntitiesAndRawResults (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:2040:26)
at SelectQueryBuilder.getRawAndEntities (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:684:29)
at SelectQueryBuilder.getOne (/usr/local/lib/node_modules/n8n/node_modules/@n8n/typeorm/query-builder/SelectQueryBuilder.js:711:25)
at AuthService.resolveJwt (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:96:22)
at AuthService.authMiddleware (/usr/local/lib/node_modules/n8n/dist/auth/auth.service.js:48:28)
2024-08-31T20:48:31.969Z | error | DatabaseError: the database system is starting up "{ file: 'LoggerProxy.js', function: 'exports.error' }"
2024-08-31T20:52:02.173Z | verbose | Execution for workflow REDACTED was assigned id 33884 "{\n executionId: '33884',\n file: 'WorkflowRunner.js',\n function: 'runMainProcess'\n}"
So, I don’t know how to reproduce now, but this problem occured several times over 2 weeks, and every time I had to restart n8n container while postgres was already alive for a long time.
It’s a separate issue why postgres crashes, I will investigate later.
How to correctly perform a healthcheck is still an open question.
This looks like an interesting one, I suspect if the database is offline for a long time it will just give up.
At some point we will likely add an option in our /health endpoint to also check the database status which may help but for now the only thing I can think of would be to monitor the logs as a health check.