Help Needed: Extracting Text from Complex HTML in Legal Documents

Christian_Visti_Lars · September 30, 2024, 8:50am

Hi n8n Community,

I’m working on a project where I need to extract structured content from legal documents. These documents are in HTML format, and I’m trying to parse them in n8n using the Function node. The challenge is to accurately extract the text inside <p> tags with the class oj-normal, which is nested within articles, where each article is identified by a div with the class eli-subdivision.

Here’s what I’m trying to achieve: from this file:
https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32022R2554#d1e5239-1-1

Extract the article title from <p> tags with class oj-ti-art.
Extract the article subtitle from <p> tags with class oj-sti-art.
Extract the text content from <p> tags with class oj-normal and possibly table cells that contain additional text.

The issue is that although I can match the articles and subtitles, I consistently get “No content available” for the article content.

Here’s the code I’m currently using:

javascript

// Load the full HTML from the 'data' property of the previous HTTP Request Node
const htmlContent = items[0].json.data || '';

if (!htmlContent) {
    return [
        {
            json: {
                error: "No HTML content found"
            }
        }
    ];
}

// Initialize the chapters and articles array
const chapters = [];
const logs = [];

// Use regex to match divs with "eli-subdivision" class for each article
const articleMatches = htmlContent.match(/<div[^>]*class="eli-subdivision"[^>]*>([\s\S]*?)<\/div>/g) || [];

logs.push(`Matched Articles: ${articleMatches.length}`);

articleMatches.forEach(articleDiv => {
    // Extract the article title and subtitle
    const articleTitleMatch = articleDiv.match(/<p[^>]*class="oj-ti-art"[^>]*>(.*?)<\/p>/);
    const articleSubtitleMatch = articleDiv.match(/<p[^>]*class="oj-sti-art"[^>]*>(.*?)<\/p>/);

    let articleContent = [];

    // Only proceed if there's a subtitle (oj-sti-art) in the article
    if (articleSubtitleMatch) {
        // Get the subtitle text
        const articleSubtitle = articleSubtitleMatch[1].replace(/<[^>]*>/g, '').trim();
        
        // Match all <p> tags with the class "oj-normal" inside the article div
        const paragraphMatches = articleDiv.match(/<p[^>]*class="oj-normal"[^>]*>([\s\S]*?)<\/p>/g) || [];
    
        // Clean up and extract the text from the matched paragraphs with "oj-normal"
        paragraphMatches.forEach(paragraph => {
            const cleanText = paragraph.replace(/<[^>]*>/g, '').replace(/&nbsp;/g, ' ').trim();  // Remove HTML tags and &nbsp;
            if (cleanText) {
                articleContent.push(cleanText);
            }
        });

        // Match any table cells with content inside this article division
        const tableCellMatches = articleDiv.match(/<td[^>]*>([\s\S]*?)<\/td>/g) || [];
        tableCellMatches.forEach(td => {
            const cleanText = td.replace(/<[^>]*>/g, '').replace(/&nbsp;/g, ' ').trim();  // Remove HTML tags and &nbsp;
            if (cleanText) {
                articleContent.push(cleanText);
            }
        });

        const combinedContent = articleContent.join(' ').replace(/\s+/g, ' ').trim();
    
        if (articleTitleMatch) {
            const articleTitle = articleTitleMatch[1].replace(/<[^>]*>/g, '').trim();
    
            chapters.push({
                title: articleTitle,
                subtitle: articleSubtitle,
                content: combinedContent || 'No content available'
            });
        }
    }
});

// Return the extracted chapters and logs
return {
    json: {
        chapters: chapters,
        logs: logs
    }
};

Despite having the correct HTML structure, I’m still unable to extract the actual text content from the oj-normal paragraphs.

What I’ve Tried:

Regex to extract the oj-normal paragraphs.
Checked if there are nested tags that might be causing the issue.
Extracting text from table cells that might also contain important content.

Issue: I keep getting “No content available” even though the HTML contains the text in the oj-normal paragraphs. I believe this might be due to the way I’m handling the HTML structure, and regex may not be the best solution here.

Any help or suggestions on how I can better extract this text would be highly appreciated. Also, if there’s a way to use an HTML parser in n8n or if anyone has encountered similar issues, I’d love to hear your thoughts!

Thanks in advance for your support!

n8n · September 30, 2024, 8:50am

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

mcnaveen · September 30, 2024, 10:20am

Hi @Christian_Visti_Lars Welcome to the n8n community!

Did you try using the HTML Extract node?

Here are the few example: HTML Extract integrations | Workflow automation with n8n

Kool_Baudrillard · September 30, 2024, 11:30am

Hi,

try to use xpath instead of regex:

eg. //p[@class=‘oj-ti-art’] will return in the given url 64 time Article 1-64
//p[@class=‘oj-sti-art’] will return in the given url 64 time subtitle 1-64
//p[@class=‘oj-normal’] will return all paragraphs with class oj-normal

or you you can use //div[@class=‘eli-subdivision’ and contains(@id, ‘art_1’)] or //div[@class=‘eli-subdivision’ and contains(@id, ‘art’)]//following::p[@class=‘oj-normal’] this will give you the whole article 1 incl headline, sub headline, content.

Just play around a little bit. you can test it here: Scrapfly | Web Scraping Tools | XPATH / CSS Selector online tester

xpath is powerful and yet can be a pain in the a**, good starting point is to use the Chrome extension Ruto.

Christian_Visti_Lars · September 30, 2024, 12:34pm

Yes, I have tried using the HTML Extract node, and it seems to be working well now! I’ve managed to extract the HTML content into a workable JSON format, which I’m now planning to refine further using the function node.

The challenge I’m working on next is organizing the extracted data, like splitting sections into subsections and sub-subsections. But overall, I’m happy with the progress and appreciate the support!

Thanks again for the help!

system · December 29, 2024, 12:35pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.