Feature Request / Help: Need a “Generic HTTP Chat Model” node to connect llama-cpp-python to AI Agent (OpenAI node throws 404)

Describe the problem/error/question

Hi community,

I am trying to build an RAG workflow using the AI Agent node.
My setup is:

Backend: Local server running llama-cpp-python (hosting Qwen2.5-14B GGUF).
Endpoint: The server works perfectly. I verified it using a standard HTTP Request node (POST) to http://127.0.0.1:8000/v1/chat/completions. It returns the correct JSON response.
The Problem:
I cannot connect this backend to the AI Agent node.

I tried using the OpenAI Chat Model node (connected to the AI Agent’s model input).
I configured the Host as http://127.0.0.1:8000/v1 (and other variations).
I assume n8n is appending extra paths (like /chat/completions) which conflicts with the strict routing of llama-cpp-python, resulting in a persistent 404 Not Found error.

What I need:
Since the HTTP Request node (which works) cannot be connected to the AI Agent (Model Input), I am looking for:

A way to force the OpenAI Chat Model node to use a raw/exact URL without auto-appending paths.
OR a “Generic HTTP Chat Model” node that implements the LangChain interface (circular output) but allows fully custom HTTP configuration (like the HTTP Request node).
Has anyone successfully connected llama-cpp-python (not Ollama) to the AI Agent node

Please share your workflow

my LLM deployment method

the content in run_llama_server.sh

#!/bin/bash


# 1. 【关键修复】解除内存锁定限制
# 解决 warning: failed to mlock ... Cannot allocate memory
ulimit -l unlimited

# 直接使用 llama 环境的 Python(绕过 conda activate)
PYTHON_BIN="/root/data1/miniconda3/envs/llama/bin/python"

# CUDA 库路径(包含 libcudart.so.12 符号链接)
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/targets/x86_64-linux/lib:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

# 日志配置
export LLAMA_CPP_LOG_LEVEL=INFO

# 模型路径(根据实际下载的模型调整)
MODEL_PATH="/root/data1/models/Qwen/Qwen2___5-14B-Instruct-GGUF/qwen2.5-14b-instruct-q4_k_m.gguf"

# 检查模型文件
if [ ! -f "$MODEL_PATH" ]; then
    echo "❌ 模型文件不存在: $MODEL_PATH"
    echo "请先运行模型下载脚本"
    exit 1
fi

echo "=== 启动 llama.cpp 服务 ==="
echo "模型: $MODEL_PATH"
echo "端口: 11434"
echo "GPU 加速: 全层卸载到显卡"
echo "=============================="

# 启动服务
# 关键参数说明:
# --model: 模型文件路径
# --n_gpu_layers -1: 所有层都用 GPU(4090 必选)
# --n_ctx 8192: 上下文长度(8K 对 14B 模型比较合理)
# --n_batch 512: 批处理大小
# --n_threads 8: CPU 线程数(用于非 GPU 部分)
# --host 0.0.0.0: 允许外部访问
# --port 11434: 使用 Ollama 的端口(方便记忆)
# --chat-format chatml: 对话格式(兼容大多数模型)
# --embedding: 支持嵌入向量

exec "$PYTHON_BIN" -m llama_cpp.server \
    --model "$MODEL_PATH" \
    --n_gpu_layers -1 \
    --n_ctx 8192 \
    --n_batch 512 \
    --n_threads 8 \
    --host 0.0.0.0 \
    --port 11434 \
    --chat_format chatml \
    --verbose True

Information on your n8n setup

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

Environment:

  • n8n version: [latest]

  • Deployment: PM2

  • Local LLM: llama-cpp-python

In the OpenAI credentials for the model node, you can set the base url (with or without v1) and call the model. You can see in the screenshot, in the terminal, there’s requests with both v1 and without.

1 Like


think you very much 。today i try again ,it works,it‘s amazing 。