N8n crash simulation

I am using n8n on docker image and i am trying to simulate sudden crashes/failures for n8n server. When does the execution status is regarded as cancelled and when does it regarded as error ?

Subject: Understanding Execution States During n8n Crashes

Hi @mohamedelnady-406,

Great question! Understanding execution states during failures is crucial for building resilient workflows. Let me explain the difference between cancelled and error states in n8n.

Execution Status Types

1. Error State

An execution is marked as error when:

A node throws an exception/error during execution

A workflow encounters a runtime error (API timeout, invalid data, etc.)

The workflow completes abnormally but n8n is still running

Error handling can be caught using the On Error workflow setting

Example scenarios:

HTTP Request node receives 404/500 response

Code node throws an exception

Database connection fails

Validation errors

2. Cancelled State

An execution is marked as cancelled when:

User manually stops the execution from the UI

n8n server crashes/restarts during execution

Docker container stops abruptly

Process is killed (SIGKILL, SIGTERM)

System runs out of resources mid-execution

Key difference: cancelled means the execution was interrupted externally, not by a logical error in the workflow.

Simulating Crashes for Testing

To simulate different failure scenarios:

Scenario 1: Simulate Server Crash

# Force kill the n8n container

docker kill n8n

# Or restart it abruptly

docker restart n8n

Result: Running executions → cancelled

Scenario 2: Simulate Graceful Shutdown

# Send SIGTERM (graceful shutdown)

docker stop n8n

Result: n8n attempts to finish current executions, but forced timeout → cancelled

Scenario 3: Simulate Resource Exhaustion

# Limit container memory and trigger OOM

docker run --memory=“512m” --name n8n n8nio/n8n

Result: Executions interrupted by OOM killer → cancelled

Scenario 4: Trigger Error State (Not Crash)

Create a workflow with intentional errors:

HTTP Request to non-existent endpoint

Code node with throw new Error(‘Test error’)

Invalid credentials

Result: Execution completes with error status

Practical Testing Approach

Here’s a workflow to test crash handling:

Create a long-running workflow:

// Code node - simulates long execution

const start = Date.now();

while (Date.now() - start < 60000) {

// Run for 60 seconds

await new Promise(resolve => setTimeout(resolve, 1000));

}

return { success: true };

Start the workflow execution

During execution, simulate crash:

docker kill n8n

docker start n8n

Check execution status:

Should show as cancelled

Can be found in execution history

Important Behaviors to Note

Queue Mode (if enabled)

With N8N_EXECUTIONS_MODE=queue:

Executions in queue survive crashes

They restart automatically after recovery

Status depends on progress when crashed

Webhook Triggers

If crash occurs during webhook execution:

Client receives timeout/connection error

Execution marked as cancelled

Webhook sender should implement retry logic

Recovery Settings

Configure recovery behavior:

# In docker-compose.yml or environment

N8N_EXECUTIONS_DATA_SAVE_ON_ERROR=all

N8N_EXECUTIONS_DATA_SAVE_ON_SUCCESS=all

N8N_EXECUTIONS_DATA_SAVE_MANUAL_EXECUTIONS=true

This ensures you can analyze what happened during crashes.

Testing Checklist

Test graceful shutdown (docker stop)

Test force kill (docker kill)

Test OOM scenarios

Test network isolation

Verify webhook behavior during crash

Check execution recovery after restart

Test with queue mode enabled/disabled

Monitoring Recommendations

For production resilience:

Use Docker health checks

Implement external monitoring (Prometheus/Grafana)

Set up restart policies: restart: unless-stopped

Enable execution data persistence

Configure proper logging

Summary:

Error = Workflow logic failure while n8n is running

Cancelled = Execution interrupted by external force (crash, manual stop, etc.)

Let me know if you need help setting up specific crash recovery scenarios or have questions about production resilience!

1 Like

Thanks for insightful information.
I have tried to simulate failures on workflow that has about 20 HTTP node and i terminated docker image by sending SIGTERM (ctrl+c) and it entered into graceful period then stopped.
the execution showed status=error the error at node was Execution stopped at this node.
in some other workflow i tried the same experience but i got status=canceled.
The point is that i was able to retry the executions having status=error. but i couldn’t retry cancelled executions. although i have set execution progress to save in workflow setting.
Is there any way to make the canceled worfklow retryable to recover?