Document Anonymization Workflow

Seb3D · December 11, 2024, 2:25pm

Document Anonymization Workflow

Hello All!!!, I’m very happy getting to know n8n and all the possibilities it opens.

I have this project I want to make and share and I’m looking for suggestions regarding how could my goal be accomplished using the less pc specs possible and with the simplest flow as possible, but to be consistent and reliable is the most important factor over pc specs and simplicity.

I want to make a document anonymizer, to keep some of the data in a documents anonymous before using larger online frontier LLM to make reports and summaries from different documents (PDF, WORD and EXCEL)

So the idea es to have a flow where the documents get split for easier handling and then the chunks are scanned for names of people, places, companies and brands, also phone numbers, coordinates or addresses.

As the chunks get “scanned” the LLM creates a CVS or Google Sheet, were in one column we have the real data for example a name or address and on the next column the LLM makes up a new fake name or address.

As the LLM keeps scanning every times it finds a new item to replace, it checks if an entry for that item has been created, if it has it will use the same fake data already created and if it hasn’t it will add create it, add it to the table and replace it in the chunk.

After as every document is done they will be uploaded to the RAG vector data base

This same table will be used for all documents regarding the current project and saved to be later used to reverse the anonymization.

The idea is that the anonymization step is done locally before creating the vector data base and at the end when we have a final summery o report that gathers all documents we can reverse the anonymization locally and put the original data back. Is important not to replace other data from the report like amounts, results of calculations or studies and other quantitative and qualitative information from the source files that might effect the conclusions or findings with in the reports created.

n8n · December 11, 2024, 2:25pm

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

gualter · December 18, 2024, 10:52am

hi @Seb3D

Welcome to the community! Makes perfect sense to me, I was also brainstorming a similar workflow of anonymizing data so it’s a very valid use case.

Have you tried tinkering with n8n to make it happen? I suggest going through our AI docs that walks through the basics:

Hope this helps!

lollorosso · January 21, 2025, 8:54pm

I have the same need and did not make it to a working solution. Does anyone have any examples? Best try was to do a code note with regex first and then let llm node do the rest, but i did not get the llm to output strictly json. Are there any working workflows for this?

Seb3D · January 21, 2025, 9:15pm

Hi, yes I’m tinkering.
my thought right now is to split into many simple steps to reduce the chance of the ai making a mistake or hallucinating. First is to split the text into chunks and output all data to me replaced but the model gets confuced and doesn’t output requested data. Maybe I will split task such as first only look for Names of people, then on a second task, look for names of places, after that company names, and so on…