Docker N8N Serveur - How install pdf-parse ? to extract PDF

Hello this my configuration :

i m on premise :

  • n8n version: : 1.64.3
  • Database (default: SQLite): Postgress
  • n8n EXECUTIONS_PROCESS setting (default: own, main): : default
  • Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
  • Operating system: : Debian 12

I want use this script to extract data from pdf files :slight_smile:

const pdf = require('pdf-parse');

const extractPdfTextPerPage = async (buffer) => {
  const data = await pdf(buffer);
  const pages = data.text.split('\n\n'); // Assumer que '\n\n' sépare les pages (peut varier selon le PDF)
  return pages.map((pageText, index) => ({ page: index + 1, text: pageText }));
};

(async () => {
  const pdfBuffer = $binary.data; // Assurez-vous que le fichier est passé en binaire
  const pages = await extractPdfTextPerPage(pdfBuffer);
  
  // Si vous avez un format spécifique à extraire par page
  return pages.map(page => {
    // Exemple de traitement pour extraire un format particulier sur chaque page
    const regex = /VotreFormatRegExp/g;
    const result = page.text.match(regex);
    return {
      page: page.page,
      extractedText: result ? result : 'Aucun résultat trouvé'
    };
  });
})();

but i got an error lie this :

so i want install it on my docker isntall how do that please ?

Hey @Issa2024 , see if the following video can guide you (you can start from the 13th minute), https://www.youtube.com/watch?v=hwN5qs0CmsE.

In short, you need to install the module before you can use it.

1 Like

pdf-parse seems to be in n8n’s langchain nodes dependency tree. It’s possible that all you need to do is to set NODE_FUNCTION_ALLOW_EXTERNAL to pdf-parse, and you should be able to use the package in the code node.

That said, pdf-parse doesn’t seem to be maintained, and that’s why n8n migrated to pdfjs instead.

Does the Extract from File node not work for you for extracting data from a pdf file?

1 Like