Document Anonymization Workflow
Hello All!!!, I’m very happy getting to know n8n and all the possibilities it opens.
I have this project I want to make and share and I’m looking for suggestions regarding how could my goal be accomplished using the less pc specs possible and with the simplest flow as possible, but to be consistent and reliable is the most important factor over pc specs and simplicity.
I want to make a document anonymizer, to keep some of the data in a documents anonymous before using larger online frontier LLM to make reports and summaries from different documents (PDF, WORD and EXCEL)
So the idea es to have a flow where the documents get split for easier handling and then the chunks are scanned for names of people, places, companies and brands, also phone numbers, coordinates or addresses.
As the chunks get “scanned” the LLM creates a CVS or Google Sheet, were in one column we have the real data for example a name or address and on the next column the LLM makes up a new fake name or address.
As the LLM keeps scanning every times it finds a new item to replace, it checks if an entry for that item has been created, if it has it will use the same fake data already created and if it hasn’t it will add create it, add it to the table and replace it in the chunk.
After as every document is done they will be uploaded to the RAG vector data base
This same table will be used for all documents regarding the current project and saved to be later used to reverse the anonymization.
The idea is that the anonymization step is done locally before creating the vector data base and at the end when we have a final summery o report that gathers all documents we can reverse the anonymization locally and put the original data back. Is important not to replace other data from the report like amounts, results of calculations or studies and other quantitative and qualitative information from the source files that might effect the conclusions or findings with in the reports created.