AWS Textract - run through data and get values?

Hi,

The data from Amazon Textract is not exactly straight forward, an example is here:

        {
  "DocumentMetadata": {
    "Pages": 1
  },
  "JobStatus": "SUCCEEDED",
  "ExpenseDocuments": [
    {
      "ExpenseIndex": 1,
      "SummaryFields": [
       {
          "Type": {
            "Text": "TOTAL",
            "Confidence": 89.66414642333984
          },
          "LabelDetection": {
            "Text": "TOTAL",
            "Confidence": 89.2239990234375
          },
          "ValueDetection": {
            "Text": "53900",
            "Confidence": 87.2876205444336
          },
          "PageNumber": 1
        },
       {
          "Type": {
            "Text": "INVOICE_RECEIPT_DATE",
            "Confidence": 95.58136749267578
          },
          "LabelDetection": {
            "Text": "INVOICE AND\nSUPPLY DATE",
            "Confidence": 95.3720474243164
          },
          "ValueDetection": {
            "Text": "04/04/2022",
            "Confidence": 94.0542984008789
          },
          "PageNumber": 1
        }
    ]
  }
 ]
}


What can I use to extract just the Text values from every “LabelDetection” and the “ValueDetection”?
(and possibly return them as JSON data)

I understand there is a ‘Simplify Response’ option but that is too simple, it doesn’t return the actual TOTAL and some other important data that is required.

There are over 30 ‘SummaryFields’, as you can see they are not split into unique data sets but multiple ‘groups’ within the Summary Fields block. I need to run through them all. I tried the Item Lists node, but that didn’t help/ I wasn’t able to use it effectively.

(There was ‘Polygon’ and ‘Geometry’ data in the JSON that I removed for brevity.)

Thank you.

So far I’ve been using this post:

To try to get the specific data using a function and create JSON.

Original:

Transform Node from above Workflow:

let result = [];

$json.response.forEach(row => {
    row.stores.forEach(e => {
        result.push({
            json: {
                id: row.id,
                name: e.name,
                type: e.type,
            },
        });
    });
});

return result;

My version:

My transform node:

let result = [];

$json.response.forEach(row => {
row.ExpenseDocuments.SummaryFields.forEach(e => {
        result.push({
            json: {
                name: e.LabelDetection.Text,
                type: e.ValueDetection.Text
            },
        });
    });
});

return result;

However I get:

“Cannot read property ‘forEach’ of undefined [Line 4]”

I have also tried:

let result = [];

$json.response.forEach(row => {
row.ExpenseDocuments.forEach(e => {
        result.push({
            json: {
                name: e.SummaryFields.LabelDetection.Text,
                type: e.SummaryFields.ValueDetection.Text
            },
        });
    });
});

return result;

Not sure what I’m doing wrong :confused:

Okay so following this YT tutorial: https://www.youtube.com/watch?v=wGAEAcfwV8w&

I’ve used Set node to get the ‘SummaryFields’ subset of data that I need (see OP) and set it as pdfdata.

[
  {
    "pdfdata": [
      {
        "Type": {
          "Text": "TOTAL",
          "Confidence": 89.66414642333984
        },
        "LabelDetection": {
          "Text": "TOTAL",
          "Confidence": 89.2239990234375
        },
        "ValueDetection": {
          "Text": "53900",
          "Confidence": 87.2876205444336
        },
        "PageNumber": 1
      },
      {
        "Type": {
          "Text": "INVOICE_RECEIPT_DATE",
          "Confidence": 95.58136749267578
        },
        "LabelDetection": {
          "Text": "INVOICE AND\nSUPPLY DATE",
          "Confidence": 95.3720474243164
        },
        "ValueDetection": {
          "Text": "04/04/2022",
          "Confidence": 94.0542984008789
        },
        "PageNumber": 1
      }
    ]
  }
]

I then created a new Function and used:

from here: How can I select part of json with Function or Set node - #4 by MutedJam

Then I turned it into this:

return [{
  json: {
    pdfdata: items[0].json.pdfdata.map(s => { return {
        name: s.LabelDetection.Text,
        value: s.ValueDetection.Text
      }
    })
  }
}];

So my result is:

[{
"pdfdata": [
{
"name": "TOTAL",
"value": "53900"
},
{
"name": "INVOICE AND SUPPLY DATE",
"value": "04/04/2022"
}]}]

Which is almost there! :smiley:

But I want the ‘name’ to be the actual data, not “name”:“TOTAL”, i.e. I want to end up with this:

{"TOTAL":"53900"}

I tried:

pdfdata: items[0].json.pdfdata.map(s => { return {
        s.LabelDetection.Text: s.ValueDetection.Text
}})

But that doesn’t work…

Thx for the example @bocaz. This helps a lot.

I refer with my solution to the first wish:

What can I use to extract just the Text values from every “LabelDetection” and the “ValueDetection”?

The solution:

Translated with DeepL Translate: The world's most accurate translator (free version)

1 Like

But perhaps this solution is more in line with the requirements:

Prerequisite:

  • AWS responds with a response-key and this is an array
  • SummaryFields should be output as one line

@BillAlex

The first idea had a line 7 issue but for the second idea (AWS Textract - run through data and get values? - #5 by BillAlex), that seems like it would work.

I might be doing something wrong but I get:

Cannot read property ‘forEach’ of undefined [Line 3]

There is only one ExpenseDocuments so I don’t think it needs a ‘forEach’

When I hover over the $json it pulls through the Textract data:

But when I hover over the ExpenseDocuments it does not:

Which seems to suggest it is not pulling the data past the initial entry point of $json ?

I also realised that not every piece of data has a LabelDetection so I had to add an if statement.

I ended up going with this:

Which works great!

Final solution in code:

//  bocaz and BillAlex - LabelDetection, ValueDetection 
// https://community.n8n.io/t/aws-textract-run-through-data-and-get-values/13190/8
const result = [];
  $json.ExpenseDocuments.forEach(({SummaryFields}) => {
    let _r = {};
    SummaryFields.forEach(({LabelDetection, ValueDetection}) => {
      if (LabelDetection == null){
        return {};
      }
      else return  _r[LabelDetection.Text] = ValueDetection.Text;
    })
    result.push(_r);
  })

// Map it for n8n
return result.map(_result => { return {json: _result}});

The above code is generic and should work for anyone else looking to pull their Summary Fields from an AWS Textract result!

I do have one question, how do I go a level deeper?

Before SummaryFields there is ‘Line Items’. However it is deeper than the Summary Fields.

[
{
"DocumentMetadata": {
"Pages": 1
},
"ExpenseDocuments": [
{
"ExpenseIndex": 1,
"LineItemGroups": [
{
"LineItemGroupIndex": 1,
"LineItems": [
{
"LineItemExpenseFields": [
{
"LabelDetection": {
"Confidence": 93.6885986328125,
"Text": "PRODUCT CODE"
},
"PageNumber": 1,
"Type": {
"Confidence": 70,
"Text": "OTHER"
},
"ValueDetection": {
"Confidence": 99.73140716552734,
"Text": "834756"
}
}]
}]
}]
}]

So I need to get too; LineItemGroups > LineItems > LineItemExpenseFields > LabelDetection

I have tried:

$json.ExpenseDocuments.LineItemGroups.forEach(({LineItems}) => {

and

$json.ExpenseDocuments.forEach(({LineItemGroups}) => {
    $json.LineItemGroups.forEach(({LineItems}) => {

Neither of which work, any suggestions?

I think what you need is a basic understanding of JavaScript - specifically arrays and objects. And you need to know the expressions and methods of n8n. Sorry :worried:

Links:

In particular: *.filter(), *.forEach(), *.map(), *.reduce()
https://docs.n8n.io/code-examples/expressions/methods/
https://docs.n8n.io/code-examples/expressions/variables/
https://docs.n8n.io/data/code/

$json.ExpenseDocuments.LineItemGroups.forEach(({LineItems}) => {

Can’t work, because ExpenseDocuments is an array. You need to specify a special index - if it is static, you can do that:

$json.ExpenseDocuments[0]

Or you go through all indexes - this makes it more generic:

$json.ExpenseDocuments.forEach(callback function) //or use an other array method, for example *.map()
$json.ExpenseDocuments.forEach(({LineItemGroups}) => {
    $json.LineItemGroups.forEach(({LineItems}) => {

…does not work too, because $json is a special n8n object.

$json.LineItemGroups.forEach() //don't work
LineItemGroups.forEach() // works

A good way to see what which variable currently contains is console.log({varName}). console.log() prints the content to the console. The {varName} notation is the short form of {varName: varName}. By the output of an object one can assign faster, which variable it is and thereby also several can be output clearly.

In your special case you could now write the following:

Function Code:

const data = items.map(e => e.json);
// `$json` has only the first row. That is why I have changed to items
// You can also use `$items()` instead of `items`
// If you have only one line as input, you can still use $json.
const result = [];

data.forEach(({ExpenseDocuments}) => {
/* this is the short form of 
 * data.forEach((el) => {
 *   const ExpenseDocuments = el.ExpenseDocuments;
 *   ...
 * })
 */

  try{
    if(!ExpenseDocuments)
      throw Error('`ExpenseDocuments` not set.')
    ExpenseDocuments.forEach(({LineItemGroups}) => {
      if(!LineItemGroups)
        throw Error('`LineItemGroups` not set.')
      LineItemGroups.forEach(({LineItems}) => {
        if(!LineItems)
          throw Error('`LineItems` not set.')
        LineItems.forEach(({LineItemExpenseFields}) => {
          if(!LineItemExpenseFields)
            throw Error('`LineItemExpenseFields` not set.')
          const row = {};
          LineItemExpenseFields.forEach(({LabelDetection, ValueDetection}) => {
            if(!LabelDetection)
              throw Error('`LabelDetection` not set.')
            if(!ValueDetection)
              throw Error('`ValueDetection` not set.')
            row[LabelDetection.Text] = ValueDetection.Text
          })
          result.push(row)
        })
      })
    })
  } catch(e) {
    result.push({
      errMsg: e.message
    })
  }
})

return result.map(el => { return {json: el}});

If objects/attributes are still missing, this must of course be intercepted. But if I had the possibility I would always try to provide a uniform object. My error-handling (try{ throw Error() } catch(){}) is quick and dirty :sweat_smile:

I hope all the explanations help you further :wink:

1 Like

Thanks I actually figured out the solution, it’s not too difficult once you realise what it does, just took a bit of extended research and some trial and error :smile: :vulcan_salute:

@bocaz Can you share the solution here, it would be helpful.