Evals : SetMetrics node does not output extended extended_reasoning, reasoning_summary

Describe the problem/error/question

Evals : SetMetrics node does not output extended extended_reasoning, reasoning_summary details. The OpenAI model does provide the information but the Eval node is outputting just the score.

Try various ways to make the system prompt more clear but have been unsuccessful. Does anyone have a prompt that is working for them that outputs the following:

{
“extended_reasoning”: “”,
“reasoning_summary”: “”,
“score”: <number: integer from 1 to 5>
}

What is the error message (if any)?

No error message

Please share your workflow

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

My current prompt

You are an expert factual evaluator assessing the accuracy of answers compared to established ground truths.
Evaluate the factual correctness of a given output compared to the provided ground truth on a scale from 1 to 5. Use detailed reasoning to thoroughly analyze all claims before determining the final score, extended_reasoning, reasoning_summary.

DATA

Ground Truth : {{ $(‘When fetching a dataset row1’).item.json.expected_output }}
Output : {{ $(‘RAG AI Agent’).item.json.output }}

Scoring Criteria

  • 5: Highly similar - The output and ground truth are nearly identical, with only minor, insignificant differences.
  • 4: Somewhat similar - The output is largely similar to the ground truth but has few noticeable differences.
  • 3: Moderately similar - There are some evident differences, but the core essence is captured in the output.
  • 2: Slightly similar - The output only captures a few elements of the ground truth and contains several differences.
  • 1: Not similar - The output is significantly different from the ground truth, with few or no matching elements.

Evaluation Steps

  1. Identify and list the key elements present in both the output and the ground truth.
  2. Compare these key elements to evaluate their similarities and differences, considering both content and structure.
  3. Analyze the semantic meaning conveyed by both the output and the ground truth, noting any significant deviations.
  4. Consider factual accuracy of specific details, including names, dates, numbers, and relationships.
  5. Assess whether the output maintains the factual integrity of the ground truth, even if phrased differently.
  6. Determine the overall level of similarity and accuracy according to the defined criteria.

Output Format

Provide:

  • A detailed analysis of the comparison (extended reasoning)
  • A one-sentence summary highlighting key differences (not similarities)
  • The final similarity score as an integer (1, 2, 3, 4, or 5)
    Always follow the JSON format below and return nothing else:
    {
    “extended_reasoning”: “”,
    “reasoning_summary”: “”,
    “score”: <number: integer from 1 to 5>
    }

Examples

Example 1:
Input:

  • Output: “The cat sat on the mat.”
  • Ground Truth: “The feline is sitting on the rug.”
    Expected Output:
    {
    “extended_reasoning”: “I need to compare ‘The cat sat on the mat’ with ‘The feline is sitting on the rug.’ First, let me identify the key elements: both describe an animal (‘cat’ vs ‘feline’) in a position (‘sat’ vs ‘sitting’) on a surface (‘mat’ vs ‘rug’). The subject is semantically identical - ‘cat’ and ‘feline’ refer to the same animal. The action is also semantically equivalent - ‘sat’ and ‘sitting’ both describe the same position, though one is past tense and one is present continuous. The location differs in specific wording (‘mat’ vs ‘rug’) but both refer to floor coverings that serve the same function. The basic structure and meaning of both sentences are preserved, though they use different vocabulary and slightly different tense. The core information being conveyed is the same, but there are noticeable wording differences.”,
    “reasoning_summary”: “The sentences differ in vocabulary choice (‘cat’ vs ‘feline’, ‘mat’ vs ‘rug’) and verb tense (‘sat’ vs ‘is sitting’).”,
    “score”: 3
    }
    Example 2:
    Input:
  • Output: “The quick brown fox jumps over the lazy dog.”
  • Ground Truth: “A fast brown animal leaps over a sleeping canine.”
    Expected Output:
    {
    “extended_reasoning”: “I need to compare ‘The quick brown fox jumps over the lazy dog’ with ‘A fast brown animal leaps over a sleeping canine.’ Starting with the subjects: ‘quick brown fox’ vs ‘fast brown animal’. Both describe the same entity (a fox is a type of animal) with the same attributes (quick/fast and brown). The action is described as ‘jumps’ vs ‘leaps’, which are synonymous verbs describing the same motion. The object in both sentences is a dog, described as ‘lazy’ in one and ‘sleeping’ in the other, which are related concepts (a sleeping dog could be perceived as lazy). The structure follows the same pattern: subject + action + over + object. The sentences convey the same scene with slightly different word choices that maintain the core meaning. The level of specificity differs slightly (‘fox’ vs ‘animal’, ‘dog’ vs ‘canine’), but the underlying information and imagery remain very similar.”,
    “reasoning_summary”: “The sentences use different but synonymous terminology (‘quick’ vs ‘fast’, ‘jumps’ vs ‘leaps’, ‘lazy’ vs ‘sleeping’) and varying levels of specificity (‘fox’ vs ‘animal’, ‘dog’ vs ‘canine’).”,
    “score”: 4
    }

Notes

  • Focus primarily on factual accuracy and semantic similarity, not writing style or phrasing differences.
  • Identify specific differences rather than making general assessments.
  • Pay special attention to dates, numbers, names, locations, and causal relationships when present.
  • Consider the significance of each difference in the context of the overall information.
  • Be consistent in your scoring approach across different evaluations.

Information on your n8n setup

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

It doesn’t look like there’s a runtime error message in your prompt — the structure itself is fine. The only issue that might trip you up is the use of curly quotes ‘ ’ instead of straight quotes ' ' around the node names (e.g., $(‘When fetching a dataset row1’) should be $('When fetching a dataset row1')). If not corrected, n8n will throw a parsing error because it won’t recognize the node reference.

Corrected json

{
“extended_reasoning”: “”,
“reasoning_summary”: “”,
“score”: 0,
“prompt”: “You are an expert factual evaluator assessing the accuracy of answers compared to established ground truths.\nEvaluate the factual correctness of a given output compared to the provided ground truth on a scale from 1 to 5. Use detailed reasoning to thoroughly analyze all claims before determining the final score, extended_reasoning, reasoning_summary.\n\nDATA\nGround Truth : {{ $(‘When fetching a dataset row1’).item.json.expected_output }}\nOutput : {{ $(‘RAG AI Agent’).item.json.output }}\n\nScoring Criteria\n5: Highly similar - The output and ground truth are nearly identical, with only minor, insignificant differences.\n4: Somewhat similar - The output is largely similar to the ground truth but has few noticeable differences.\n3: Moderately similar - There are some evident differences, but the core essence is captured in the output.\n2: Slightly similar - The output only captures a few elements of the ground truth and contains several differences.\n1: Not similar - The output is significantly different from the ground truth, with few or no matching elements.\n\nEvaluation Steps\nIdentify and list the key elements present in both the output and the ground truth.\nCompare these key elements to evaluate their similarities and differences, considering both content and structure.\nAnalyze the semantic meaning conveyed by both the output and the ground truth, noting any significant deviations.\nConsider factual accuracy of specific details, including names, dates, numbers, and relationships.\nAssess whether the output maintains the factual integrity of the ground truth, even if phrased differently.\nDetermine the overall level of similarity and accuracy according to the defined criteria.\n\nOutput Format\nProvide:\n\nA detailed analysis of the comparison (extended reasoning)\nA one-sentence summary highlighting key differences (not similarities)\nThe final similarity score as an integer (1, 2, 3, 4, or 5)\nAlways follow the JSON format below and return nothing else:\n{\n “extended_reasoning”: “”,\n “reasoning_summary”: “”,\n “score”: <number: integer from 1 to 5>\n}\n”
}

Changes made:

  • Replaced curly quotes ‘ ’ with proper straight quotes ' ' around node references.

  • Escaped the line breaks with \n to keep JSON valid if you paste it in as a string.

Thanks for the reply. The thing is n8n does not throw any error. I can see that the model tied to the Evaluation tool is producing the extra data BUT the output of the Evalution Agents is just score.

My assumption was that the default system prompt for this Set Metric node would have been validated by n8n and it doesn’t seem like it.

The challenge if to force Set Metrics to return all 3 variables

Okay, Feel free to connect if you need any help

I have the same issue. Have you figured out a work around? I see in the model logs that it indeed outputted an “extended reasoning” but the output from the “set metrics” node is just a correctness score.

@orimoricori - No I haven’t and it is very frustrating. Expected more from n8n.

A workaround waiting for this feature:
add a llm chain node, copy past system prompt and user prompt in it. I had to add a set node after but might be due to my use case. Then set the metric

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.