Use-Case - Scrape Site with Login

wiredbrother · June 9, 2020, 3:20am

Hello!

First wanted to say I have followed the n8n project for quite some time now and have always wanted to play around with it. Finally got around to doing it and I love it!

So the subject line says it all, is it possible to scrape a website that requires a login?

What I would like to do:

Scrape my works website behind my account to get my eschedule.
Organize that data into a google sheet
Add the events to a google calendar
Alert me when it happens, when it is done and report what was added.

Now I know the last 3 are 100% easy peasy lemon squeezy but the first has me stumbled. I know I can use HTTP Requests to probably send the login information and then scrape the site (once logged in and holding the proper cookies) to get the schedule (still trying to learn how to even do that )… but would it be easier with something that acts like the Selenium project?

Integrating something like this could (if I’m not mistaken ?) enable web scraping on almost every web page?

Fruit for thought! Again love the project! Thank you all for the hard work!
-wired

Tephlon · June 9, 2020, 3:55am

Hey @wiredbrother,

This is one of those situations where I believe that n8n would be better used to connect to a service and control it rather than natively giving it the ability. There are some great web NodeJS based libraries that do this really well and could probably be integrated into n8n but I believe it would be a lot of work for a specific use case.

If you were to build a Jauntium or similar web scraping system which supports and API, then a node to control that server could then be created around the API. The creation of that node would probably be a lot easier than trying to create a whole new scraping system or even integrating a NodeJS scraping library.

wiredbrother · June 9, 2020, 4:04am

Thank you for the response @Tephlon!

Sadly, I don’t have enough coding knowledge as of yet to code anything into n8n

Is there any way to accomplish the scraping of data (with a login form) with what n8n has currently?

Tephlon · June 9, 2020, 4:35am

Depending on what your business uses for a website, it might be possible to use the HTML Extract to pull the data from the website. The key would be if you could create an authenticated session to the website and then use that same session with n8n. That way, it would not have to log in.

But, that would be a shot in the dark.

The other thing you could do is take a look at whether or not your website provider has an API which it supports. A lot web based collaboration software has an API which you could potentially use the HTTP Request not to connect to the API and pull in your calendar.

wiredbrother · June 9, 2020, 5:00am

You are correct. The authenticated session would work after playing around with it in Chrome but the “browsers” would need to have identical everything in order to even come close to working.

The website does have an API but it is closed to very specific people. This is why a universal method of faking being a human using Jauntium or Selenium and logging in could be very useful as it doesn’t just work for one website but a large majority of sites.

Allowing HTML scraping of virtually anything at ease, not just public pages.

It would be useful, just not sure how to implement something like that.

chris · June 9, 2020, 9:27am

Hi wiredbrother,
the login with a form is usually done with a POST request. Then the response usually sets a cookie on the client for authentication and authorization (may be not needed) as well as redirects to another page after the login. Then you may need to go to another page where you find the info you like to scrape, keeping the cookie in place.

Webscraping is not easy.

All these things are different from site to site. There are specialized frameworks in many languages that handle all the intricacies. You could get started with apify.com for the free plan - it would allow you to update your site 10 times / month. After it get’s a bit expensive but at least you know that it’s working and you learned how to build a scraper without the infrastructure challenges.

If you have many sites that you like to extract data from like this and they don’t have an API a paid apify subscription might become feasible.

jan · June 9, 2020, 10:13am

Yes, agree with everything with what has been said
Building something like that directly in n8n would for sure be possible but also probably out of scope. Using an existing external service like apify.com which is specifically built to handle those cases and then simply use their API via an HTTP Request node (and at some point build a custom integration to their service) would be the best bet right now.

Tephlon · June 9, 2020, 1:56pm

Looks like I will have to add another project to my ever growing list!