Website extract plus AI anlysis flow help please?

jay377 · April 10, 2024, 4:25am

Hello,

I would like to find out if this is possible or if you can see any issues I may run into here.

I have a list of domain names i need to categorize with AI, basically prepare the niche name into a column in Google sheet at scale (or csv). So essentially I need to start with a google sheet of domains in column A and want to run a workflow which fills out column B with the high level niche category of the website like “plumbing business” for each one and there are thousands of then 10k to 50k pm…

The idea is that the domain goes to an N8N workflow and then N8N scrapes basic info from the homepage of the site (for now it can be readable text but for future usecases I want to scrape the whole page including code) and then return that as unstructured data like a text document that then sends it to Anthropic API (n8n seems to be able to do advanced AI stuff so this integration wont be a problem right?) for the wonderful little cheap Haiku (or cohere) to answer questions about the page, in this case, what is the niche name (ive got a good prompt already). I have high rate limits already from then for this project 5000 per minute so no issue there. Are there N8N rate limits?

Then the LLM returns the answer to the question which is then saved back to the google sheet. Quite simple in theory but can I ask is loading a sheet of 5000 domains and then looping through all of them to get the answer 1 workflow execution?
So the question then is how long can these workflows run, ones like this, any limits? It seems there is an Extract HTML Content node I can use for the scraping?
All different websites so I dont think I need proxies. I am sure extracting 5k to 50k website html; content will put load on the flow and system so any limits with this?

I dont think I need to save the extracted data to a DB as it can be discarded after LLM reads it but maybe i should to start with that for debugging prompt and later queries on the same site to test other data analysis with the AI for more use cases. Any recommended cheap way to setup and save this to a database even a google sheets cell or a simple DB or supabase or something as a way to cache completed sites if its affordable, what do you recommend? Seems a smarter way instead to go instead of scraping sane sites again later (also considering google sheet extension later for website scraping and analysis so DB would be required?).

Sunmary

Set up a workflow that loops through a list of URLs, scraping a single page from each (for now)
Feed the raw scraped HTML to Claude or another LLM API to extract the desired information as strucrured response per domain.
Save the URL and extracted data to a spreadsheet or CSV file
Scale up by running multiple scraping workflows in parallel, being mindful of rate limits and co
Any advice?

Thanks!

n8n · April 10, 2024, 4:25am

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

jay377 · April 12, 2024, 6:33am

Hi, those questions arent really relevant for me, nothing is setup, im not using it yet or decided on the tech setup, I simply want to understand, get a pro opinion, before spending/wasting time setting things up, if the platform is appropriate for my needs or are there any roadbocks a non-user cant see because of the scale and load. Do I need to self host, cloud be to costly due to scale of 50k scrapes etc? See any high level issues?

Is there a different forum for that? Just need experienced opinion on if my process is realistic or workable at scale and speed, dont need support just guidance or Is that fine?

Thanks!

liam · April 13, 2024, 1:41am

This is all possible and if you’re self hosted, you’re only limited by how powerful your server is, but you wouldnt want to do it all in one workflow. Would be better to split it up.

If you really had issues with scaling you could set up queue mode to split the work over multiple servers

I’m afraid you won’t have much luck finding someone on the forum to make a workflow for you that’s that complicated, that is something you will need to hire a consultant for unless you have more specific questions

jay377 · April 13, 2024, 8:04am

Thanks for the response. Could you point out specifically which aspect is complicated so I can understand better, maybe I can do something about it or compromise. Do you mean at ny scale specifically or the workflow in general?

From my perspective it seems simple, for each website line item row in sheet I just scrape website data using the available scraper node and send all the data onwards to LLM node (which it can now accept due to large context limits 1 page is very few tokens) with predefined question prompt and get answer back and send that answer onwards to save the answer back into the sheet or CSV etc. with the original domain?

Not sure which part I am misunderstanding in terms if complexity or does that come from if I want to loop 10k but if i wanted to do say 50 it would be simple? Would appreciate any clarity? Thanks

liam · April 13, 2024, 1:37pm

Each piece isn’t complicated but then putting it together makes it so. Maybe complicated isn’t the word as much as it would be some work.

The biggest complication is your scale at 10k. If you run everything in one workflow you will need to set up sub workflows to ensure you don’t run out of memory. And i don’t know if that means 10k ever or 10k per month or 10k per day.

The workflow might not be super complicated, but to make it tailored to work for you it will usually take trouble shooting.

Also, the built in “scraper” can’t get information from all pages, like pages built with javascript. Those will need to be scraped with an actual scraping tool like puppeteer. There are also a lot of pages with anti-scraping security, which will kick in even more because services like cloudflare will see your visiting tons of websites very quickly.

So yeah, it is pretty complicated and there are a lot of factors to account for

If you are just testing, I would suggest getting it work with a small group of sites. Then slowly try to scale up and solve one problem at a time. Once you run into a problem that you can’t solve on your own you can come back to the forum and will get better help with a very specific question

jay377 · April 15, 2024, 7:39am

Very much appreciate the feedback thank you, that helps to consider things more. Ill do that last but you said and run some simpler tests and consider and document all factors. Its good to have some idea what the factors are though now.

Just to clarify mostly scraping unique biz sites for research and categorization, not popular protected sites. Also it sabot 50k pm so could easily get away with 2–5000 across 2 day and then a few days do more so not ongoing large vol, just sporadic research, but 10k in a day even if it took hours would still be fine, its for internal enrichment data not public use with people waiting for results

liam · April 15, 2024, 11:01am

I’m not sure what else I can tell you.

Use the http node with the get request to start experimenting with getting sites html and go from there

system · July 14, 2024, 11:02am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.