Describe the problem/error/question
Hi community,
I am trying to build an RAG workflow using the AI Agent node.
My setup is:
Backend: Local server running llama-cpp-python (hosting Qwen2.5-14B GGUF).
Endpoint: The server works perfectly. I verified it using a standard HTTP Request node (POST) to http://127.0.0.1:8000/v1/chat/completions. It returns the correct JSON response.
The Problem:
I cannot connect this backend to the AI Agent node.
I tried using the OpenAI Chat Model node (connected to the AI Agent’s model input).
I configured the Host as http://127.0.0.1:8000/v1 (and other variations).
I assume n8n is appending extra paths (like /chat/completions) which conflicts with the strict routing of llama-cpp-python, resulting in a persistent 404 Not Found error.
What I need:
Since the HTTP Request node (which works) cannot be connected to the AI Agent (Model Input), I am looking for:
A way to force the OpenAI Chat Model node to use a raw/exact URL without auto-appending paths.
OR a “Generic HTTP Chat Model” node that implements the LangChain interface (circular output) but allows fully custom HTTP configuration (like the HTTP Request node).
Has anyone successfully connected llama-cpp-python (not Ollama) to the AI Agent node
Please share your workflow
my LLM deployment method
the content in run_llama_server.sh
#!/bin/bash
# 1. 【关键修复】解除内存锁定限制
# 解决 warning: failed to mlock ... Cannot allocate memory
ulimit -l unlimited
# 直接使用 llama 环境的 Python(绕过 conda activate)
PYTHON_BIN="/root/data1/miniconda3/envs/llama/bin/python"
# CUDA 库路径(包含 libcudart.so.12 符号链接)
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/targets/x86_64-linux/lib:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
# 日志配置
export LLAMA_CPP_LOG_LEVEL=INFO
# 模型路径(根据实际下载的模型调整)
MODEL_PATH="/root/data1/models/Qwen/Qwen2___5-14B-Instruct-GGUF/qwen2.5-14b-instruct-q4_k_m.gguf"
# 检查模型文件
if [ ! -f "$MODEL_PATH" ]; then
echo "❌ 模型文件不存在: $MODEL_PATH"
echo "请先运行模型下载脚本"
exit 1
fi
echo "=== 启动 llama.cpp 服务 ==="
echo "模型: $MODEL_PATH"
echo "端口: 11434"
echo "GPU 加速: 全层卸载到显卡"
echo "=============================="
# 启动服务
# 关键参数说明:
# --model: 模型文件路径
# --n_gpu_layers -1: 所有层都用 GPU(4090 必选)
# --n_ctx 8192: 上下文长度(8K 对 14B 模型比较合理)
# --n_batch 512: 批处理大小
# --n_threads 8: CPU 线程数(用于非 GPU 部分)
# --host 0.0.0.0: 允许外部访问
# --port 11434: 使用 Ollama 的端口(方便记忆)
# --chat-format chatml: 对话格式(兼容大多数模型)
# --embedding: 支持嵌入向量
exec "$PYTHON_BIN" -m llama_cpp.server \
--model "$MODEL_PATH" \
--n_gpu_layers -1 \
--n_ctx 8192 \
--n_batch 512 \
--n_threads 8 \
--host 0.0.0.0 \
--port 11434 \
--chat_format chatml \
--verbose True
Information on your n8n setup
- n8n version:
- Database (default: SQLite):
- n8n EXECUTIONS_PROCESS setting (default: own, main):
- Running n8n via (Docker, npm, n8n cloud, desktop app):
- Operating system:




