Hi n8n Community,
I’m working on a project where I need to extract structured content from legal documents. These documents are in HTML format, and I’m trying to parse them in n8n using the Function node. The challenge is to accurately extract the text inside <p>
tags with the class oj-normal
, which is nested within articles, where each article is identified by a div
with the class eli-subdivision
.
Here’s what I’m trying to achieve: from this file:
https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32022R2554#d1e5239-1-1
- Extract the article title from
<p>
tags with classoj-ti-art
. - Extract the article subtitle from
<p>
tags with classoj-sti-art
. - Extract the text content from
<p>
tags with classoj-normal
and possibly table cells that contain additional text.
The issue is that although I can match the articles and subtitles, I consistently get “No content available” for the article content.
Here’s the code I’m currently using:
javascript
// Load the full HTML from the 'data' property of the previous HTTP Request Node
const htmlContent = items[0].json.data || '';
if (!htmlContent) {
return [
{
json: {
error: "No HTML content found"
}
}
];
}
// Initialize the chapters and articles array
const chapters = [];
const logs = [];
// Use regex to match divs with "eli-subdivision" class for each article
const articleMatches = htmlContent.match(/<div[^>]*class="eli-subdivision"[^>]*>([\s\S]*?)<\/div>/g) || [];
logs.push(`Matched Articles: ${articleMatches.length}`);
articleMatches.forEach(articleDiv => {
// Extract the article title and subtitle
const articleTitleMatch = articleDiv.match(/<p[^>]*class="oj-ti-art"[^>]*>(.*?)<\/p>/);
const articleSubtitleMatch = articleDiv.match(/<p[^>]*class="oj-sti-art"[^>]*>(.*?)<\/p>/);
let articleContent = [];
// Only proceed if there's a subtitle (oj-sti-art) in the article
if (articleSubtitleMatch) {
// Get the subtitle text
const articleSubtitle = articleSubtitleMatch[1].replace(/<[^>]*>/g, '').trim();
// Match all <p> tags with the class "oj-normal" inside the article div
const paragraphMatches = articleDiv.match(/<p[^>]*class="oj-normal"[^>]*>([\s\S]*?)<\/p>/g) || [];
// Clean up and extract the text from the matched paragraphs with "oj-normal"
paragraphMatches.forEach(paragraph => {
const cleanText = paragraph.replace(/<[^>]*>/g, '').replace(/ /g, ' ').trim(); // Remove HTML tags and
if (cleanText) {
articleContent.push(cleanText);
}
});
// Match any table cells with content inside this article division
const tableCellMatches = articleDiv.match(/<td[^>]*>([\s\S]*?)<\/td>/g) || [];
tableCellMatches.forEach(td => {
const cleanText = td.replace(/<[^>]*>/g, '').replace(/ /g, ' ').trim(); // Remove HTML tags and
if (cleanText) {
articleContent.push(cleanText);
}
});
const combinedContent = articleContent.join(' ').replace(/\s+/g, ' ').trim();
if (articleTitleMatch) {
const articleTitle = articleTitleMatch[1].replace(/<[^>]*>/g, '').trim();
chapters.push({
title: articleTitle,
subtitle: articleSubtitle,
content: combinedContent || 'No content available'
});
}
}
});
// Return the extracted chapters and logs
return {
json: {
chapters: chapters,
logs: logs
}
};
Despite having the correct HTML structure, I’m still unable to extract the actual text content from the oj-normal
paragraphs.
What I’ve Tried:
- Regex to extract the
oj-normal
paragraphs. - Checked if there are nested tags that might be causing the issue.
- Extracting text from table cells that might also contain important content.
Issue: I keep getting “No content available” even though the HTML contains the text in the oj-normal
paragraphs. I believe this might be due to the way I’m handling the HTML structure, and regex may not be the best solution here.
Any help or suggestions on how I can better extract this text would be highly appreciated. Also, if there’s a way to use an HTML parser in n8n or if anyone has encountered similar issues, I’d love to hear your thoughts!
Thanks in advance for your support!