AWS Textract - run through data and get values?

bocaz · April 16, 2022, 12:47pm

Hi,

The data from Amazon Textract is not exactly straight forward, an example is here:

        {
  "DocumentMetadata": {
    "Pages": 1
  },
  "JobStatus": "SUCCEEDED",
  "ExpenseDocuments": [
    {
      "ExpenseIndex": 1,
      "SummaryFields": [
       {
          "Type": {
            "Text": "TOTAL",
            "Confidence": 89.66414642333984
          },
          "LabelDetection": {
            "Text": "TOTAL",
            "Confidence": 89.2239990234375
          },
          "ValueDetection": {
            "Text": "53900",
            "Confidence": 87.2876205444336
          },
          "PageNumber": 1
        },
       {
          "Type": {
            "Text": "INVOICE_RECEIPT_DATE",
            "Confidence": 95.58136749267578
          },
          "LabelDetection": {
            "Text": "INVOICE AND\nSUPPLY DATE",
            "Confidence": 95.3720474243164
          },
          "ValueDetection": {
            "Text": "04/04/2022",
            "Confidence": 94.0542984008789
          },
          "PageNumber": 1
        }
    ]
  }
 ]
}

What can I use to extract just the Text values from every “LabelDetection” and the “ValueDetection”?
(and possibly return them as JSON data)

I understand there is a ‘Simplify Response’ option but that is too simple, it doesn’t return the actual TOTAL and some other important data that is required.

There are over 30 ‘SummaryFields’, as you can see they are not split into unique data sets but multiple ‘groups’ within the Summary Fields block. I need to run through them all. I tried the Item Lists node, but that didn’t help/ I wasn’t able to use it effectively.

(There was ‘Polygon’ and ‘Geometry’ data in the JSON that I removed for brevity.)

Thank you.

bocaz · April 16, 2022, 1:55pm

So far I’ve been using this post:

To try to get the specific data using a function and create JSON.

Original:

Transform Node from above Workflow:

let result = [];

$json.response.forEach(row => {
    row.stores.forEach(e => {
        result.push({
            json: {
                id: row.id,
                name: e.name,
                type: e.type,
            },
        });
    });
});

return result;

My version:

My transform node:

let result = [];

$json.response.forEach(row => {
row.ExpenseDocuments.SummaryFields.forEach(e => {
        result.push({
            json: {
                name: e.LabelDetection.Text,
                type: e.ValueDetection.Text
            },
        });
    });
});

return result;

However I get:

“Cannot read property ‘forEach’ of undefined [Line 4]”

I have also tried:

let result = [];

$json.response.forEach(row => {
row.ExpenseDocuments.forEach(e => {
        result.push({
            json: {
                name: e.SummaryFields.LabelDetection.Text,
                type: e.SummaryFields.ValueDetection.Text
            },
        });
    });
});

return result;

Not sure what I’m doing wrong

bocaz · April 16, 2022, 4:26pm

Okay so following this YT tutorial: https://www.youtube.com/watch?v=wGAEAcfwV8w&

I’ve used Set node to get the ‘SummaryFields’ subset of data that I need (see OP) and set it as pdfdata.

[
  {
    "pdfdata": [
      {
        "Type": {
          "Text": "TOTAL",
          "Confidence": 89.66414642333984
        },
        "LabelDetection": {
          "Text": "TOTAL",
          "Confidence": 89.2239990234375
        },
        "ValueDetection": {
          "Text": "53900",
          "Confidence": 87.2876205444336
        },
        "PageNumber": 1
      },
      {
        "Type": {
          "Text": "INVOICE_RECEIPT_DATE",
          "Confidence": 95.58136749267578
        },
        "LabelDetection": {
          "Text": "INVOICE AND\nSUPPLY DATE",
          "Confidence": 95.3720474243164
        },
        "ValueDetection": {
          "Text": "04/04/2022",
          "Confidence": 94.0542984008789
        },
        "PageNumber": 1
      }
    ]
  }
]

I then created a new Function and used:

How can I select part of json with Function or Set node

return [{
  json: {
    stores: items[0].json.stores.map(s => { return {
        name: s.name,
        type: s.type
      }
    })
  }
}];

from here: How can I select part of json with Function or Set node - #4 by MutedJam

Then I turned it into this:

return [{
  json: {
    pdfdata: items[0].json.pdfdata.map(s => { return {
        name: s.LabelDetection.Text,
        value: s.ValueDetection.Text
      }
    })
  }
}];

So my result is:

[{
"pdfdata": [
{
"name": "TOTAL",
"value": "53900"
},
{
"name": "INVOICE AND SUPPLY DATE",
"value": "04/04/2022"
}]}]

Which is almost there!

But I want the ‘name’ to be the actual data, not “name”:“TOTAL”, i.e. I want to end up with this:

{"TOTAL":"53900"}

I tried:

pdfdata: items[0].json.pdfdata.map(s => { return {
        s.LabelDetection.Text: s.ValueDetection.Text
}})

But that doesn’t work…

BillAlex · April 17, 2022, 7:57am

Thx for the example @bocaz. This helps a lot.

I refer with my solution to the first wish:

What can I use to extract just the Text values from every “LabelDetection” and the “ValueDetection”?

The solution:

Translated with DeepL Translate: The world's most accurate translator (free version)

BillAlex · April 17, 2022, 8:59am

But perhaps this solution is more in line with the requirements:

Prerequisite:

AWS responds with a response-key and this is an array
SummaryFields should be output as one line

bocaz · April 17, 2022, 11:40am

@BillAlex

The first idea had a line 7 issue but for the second idea (AWS Textract - run through data and get values? - #5 by BillAlex), that seems like it would work.

I might be doing something wrong but I get:

Cannot read property ‘forEach’ of undefined [Line 3]

There is only one ExpenseDocuments so I don’t think it needs a ‘forEach’

When I hover over the $json it pulls through the Textract data:

But when I hover over the ExpenseDocuments it does not:

Which seems to suggest it is not pulling the data past the initial entry point of $json ?

I also realised that not every piece of data has a LabelDetection so I had to add an if statement.

I ended up going with this:

Which works great!

bocaz · April 17, 2022, 11:50am

Final solution in code:

//  bocaz and BillAlex - LabelDetection, ValueDetection 
// https://community.n8n.io/t/aws-textract-run-through-data-and-get-values/13190/8
const result = [];
  $json.ExpenseDocuments.forEach(({SummaryFields}) => {
    let _r = {};
    SummaryFields.forEach(({LabelDetection, ValueDetection}) => {
      if (LabelDetection == null){
        return {};
      }
      else return  _r[LabelDetection.Text] = ValueDetection.Text;
    })
    result.push(_r);
  })

// Map it for n8n
return result.map(_result => { return {json: _result}});

The above code is generic and should work for anyone else looking to pull their Summary Fields from an AWS Textract result!

bocaz · April 17, 2022, 12:13pm

I do have one question, how do I go a level deeper?

Before SummaryFields there is ‘Line Items’. However it is deeper than the Summary Fields.

[
{
"DocumentMetadata": {
"Pages": 1
},
"ExpenseDocuments": [
{
"ExpenseIndex": 1,
"LineItemGroups": [
{
"LineItemGroupIndex": 1,
"LineItems": [
{
"LineItemExpenseFields": [
{
"LabelDetection": {
"Confidence": 93.6885986328125,
"Text": "PRODUCT CODE"
},
"PageNumber": 1,
"Type": {
"Confidence": 70,
"Text": "OTHER"
},
"ValueDetection": {
"Confidence": 99.73140716552734,
"Text": "834756"
}
}]
}]
}]
}]

So I need to get too; LineItemGroups > LineItems > LineItemExpenseFields > LabelDetection

I have tried:

$json.ExpenseDocuments.LineItemGroups.forEach(({LineItems}) => {

and

$json.ExpenseDocuments.forEach(({LineItemGroups}) => {
    $json.LineItemGroups.forEach(({LineItems}) => {

Neither of which work, any suggestions?

BillAlex · April 21, 2022, 8:14am

I think what you need is a basic understanding of JavaScript - specifically arrays and objects. And you need to know the expressions and methods of n8n. Sorry

Links:

In particular: *.filter(), *.forEach(), *.map(), *.reduce()
https://docs.n8n.io/code-examples/expressions/methods/
https://docs.n8n.io/code-examples/expressions/variables/
https://docs.n8n.io/data/code/

$json.ExpenseDocuments.LineItemGroups.forEach(({LineItems}) => {

Can’t work, because ExpenseDocuments is an array. You need to specify a special index - if it is static, you can do that:

$json.ExpenseDocuments[0]

Or you go through all indexes - this makes it more generic:

$json.ExpenseDocuments.forEach(callback function) //or use an other array method, for example *.map()

$json.ExpenseDocuments.forEach(({LineItemGroups}) => {
    $json.LineItemGroups.forEach(({LineItems}) => {

…does not work too, because $json is a special n8n object.

$json.LineItemGroups.forEach() //don't work
LineItemGroups.forEach() // works

A good way to see what which variable currently contains is console.log({varName}). console.log() prints the content to the console. The {varName} notation is the short form of {varName: varName}. By the output of an object one can assign faster, which variable it is and thereby also several can be output clearly.

In your special case you could now write the following:

Function Code:

const data = items.map(e => e.json);
// `$json` has only the first row. That is why I have changed to items
// You can also use `$items()` instead of `items`
// If you have only one line as input, you can still use $json.
const result = [];

data.forEach(({ExpenseDocuments}) => {
/* this is the short form of 
 * data.forEach((el) => {
 *   const ExpenseDocuments = el.ExpenseDocuments;
 *   ...
 * })
 */

  try{
    if(!ExpenseDocuments)
      throw Error('`ExpenseDocuments` not set.')
    ExpenseDocuments.forEach(({LineItemGroups}) => {
      if(!LineItemGroups)
        throw Error('`LineItemGroups` not set.')
      LineItemGroups.forEach(({LineItems}) => {
        if(!LineItems)
          throw Error('`LineItems` not set.')
        LineItems.forEach(({LineItemExpenseFields}) => {
          if(!LineItemExpenseFields)
            throw Error('`LineItemExpenseFields` not set.')
          const row = {};
          LineItemExpenseFields.forEach(({LabelDetection, ValueDetection}) => {
            if(!LabelDetection)
              throw Error('`LabelDetection` not set.')
            if(!ValueDetection)
              throw Error('`ValueDetection` not set.')
            row[LabelDetection.Text] = ValueDetection.Text
          })
          result.push(row)
        })
      })
    })
  } catch(e) {
    result.push({
      errMsg: e.message
    })
  }
})

return result.map(el => { return {json: el}});

If objects/attributes are still missing, this must of course be intercepted. But if I had the possibility I would always try to provide a uniform object. My error-handling (try{ throw Error() } catch(){}) is quick and dirty

I hope all the explanations help you further

bocaz · April 22, 2022, 7:02pm

Thanks I actually figured out the solution, it’s not too difficult once you realise what it does, just took a bit of extended research and some trial and error

Allwynpradip · April 25, 2022, 5:45am

@bocaz Can you share the solution here, it would be helpful.