Autonomous AI crawler

oskar · July 3, 2024, 11:46am

Hey

Some time ago, I came up with the idea of an autonomous crawler built with n8n AI agents. Even though it doesn’t always perform perfectly, in many cases, it does the job very well. Feel free to watch the video below, where I dive a bit more into the details of its construction

Also, please let me know what you think of the new video format. It’s a bit experimental for me, so your feedback will be much appreciated!

bartv · July 3, 2024, 12:17pm

Nice work Oskar! I enjoyed the format, it feels more ‘personal’ and the slightly lower pace made it a lot easier to follow too.

oskar · July 3, 2024, 12:29pm

Thank you Bart, appreciate the feedback!

Jim_Le · July 3, 2024, 12:54pm

Really well explained and a great use for agents! You mentioned hallucinations as one of the caveats; just curious what types have you come across? Are these exclusively to do with the urls?

oskar · July 3, 2024, 1:19pm

Thanks for your kind feedback!

URLs are one thing, but I have also noticed that sometimes the agent confuses the data, especially when navigating deep into the page and trying to connect the information. For more basic data (like social media profile links or company profile summary) this shouldn’t happen often, but for more specialized or niche information it can become an issue (speaking from experience here of course).

gerfum · August 15, 2024, 10:58pm

Great Tutorial Oskar, a quick question, will this work for websites that requires log in credentials or two factors Auth? I’m OK to get some kind of step that might ask the user that credential is required or a pre-requisite that before running the crawler you need to be logged in…
Just to check if you had the chance to test it in this kind of environments.
Thanks,

oskar · August 17, 2024, 1:09pm

Hey @gerfum, thanks for watching my tutorial, glad you like it!

This workflow is made with purpose to crawl only public sources (websites that don’t require login/accounts etc.). I’m pretty sure that after some adjustments and rebuilding, it could work with closed sources as well, but to be honest, I haven’t tested it in such environments. It’s also important to know how the login process looks like - in this tutorial I used simple HTTP requests which may be slightly limited in this field.

If you want to crawl only specific websites (not many websites with different layouts), you may want to build script e.g. with Puppeteer to act specifically on those chosen websites (I have a few tutorials on my channel about Puppeteer, so feel free to check them out). I’d also consider ethics of scraping data behind login wall.

gerfum · August 17, 2024, 10:55pm

Thank you @oskar for your reply!