It’s a little difficult to explain, so I’ll explain what I was previously doing. I was using an httprequest node to scrape data from a webpage via a cloud function, by watching Oskar’s video. I used chatgpt to figure out how to pass parameters to the cloud function. I used a GSheets to store the many URLs and the css selectors so that I could pass them all in the URL, however there were issues with that process that were beyond my non-programming ability. So I switched to the puppeteer node to return the html as json data which is much easier and set the data. However I’d still like to store the many different URLs and now the many different element paths to each item in Sheets, but I can’t figure out how to get the “value of a value” I guess.
In the photo the set node correctly pulls in the name from the json provided by puppeteer. So I stored ‘{{ $json.pageContent[“pa-root”][“div”][“div”][1][“div”][“ng-component”][“pa-home”][“div”][“div”][2][“div”][0] [“pa-propertyinformation”][“div”][“table”][“tr”][4][“td”][“div”][“div”][“div”][“#text”]}}’ into a cell in my spreadsheet available 2 nodes back in the ownerEl column (the left side of the photo is my sheets node, 2 nodes back with the exact value as the ownerElfromPuppeteerJson on the right that pulls the correct info.
I’m sure I’m doing it wrong, but is it even possible to have ownerElfromGoogleSheets to return Maria Diaz, instead of the element that it’s returning?
According to the output, i see that you are trying to work with HTML directly, but that should be avoided due to the big number of possible issues that may occur if the page receive update from the author.
Unfortunately I’m unable to share the workflow, but essentially, the second item points to the first item, but instead of getting “MARIA A DIAZ” as my desired output, I of course get the element pointing to “MARIA A DIAZ” and think it’s probably not possible.
I believe you are correct regarding issues with page changes, that’s why I wanted to store the location of the element, in a spreadsheet to fill dynamically into n8n. Then I could make changes to the element on the spreadsheet if the webpage changes. As this is just one of many webpages I need to scrape, and the different pages has the elements in a different location on the page.
Is there a better approach then trying to work with HTML directly?
Thanks again barn4k for trying to help as I knew what I wanted would be difficult to explain, but I’ll give it another try. The HTML node I believe gives me the same results that I already have.
There’s not an issue that I’m experiencing with n8n, what I think I may be needing is a variable. Since what I want to do is replace the value with a variable that will give me the same results. The first parameter item in the pic is the json structure that gives me the correct result of “Maria A Diaz” (which I can use, but I have about 70 different webpages to scrape) and the second item in the photo was my attemp to replace it with a variable that would give me the same results, but instead, my variable yields the json structure, instead of the result of the json structure.
Ordinarily for example when you scrape data, you would scrape let’s say 70+ names off of a webpage, which is simple and straight forward, however, I’m attempting to scrape 1 name from 70+ different websites, all with different structures of course. I have the 70+ websites in a google sheet, so looping through the different URLs of each page from google sheets for n8n to navigate to, via http request node or puppeteer node is no problem, the problem I’m having is I was initially looping through 70+ css selectors, but thought that it would be a nightmare to manage, so was trying to find a better way that would be easier to update when website structure changed. But I’m beginning to think that looping through the css, though a possible nightmare, may be the best option.
If you are trying to extract the person’s name (and you already know that name from the 1st website), than it will be easier to look for the name with regex (like ctrl+f in the browser) in the subsequent sites. The issue here is that the name could be of various structure (name, name + surname, surname + name, name + middle name, name short surname and so on). But still it’s easier to accomplish than trying to deal with each site’s css selectors.
Unfortunately I don’t already know the name, I only know the URL of the page that contains the name, address, etc. That’s why I need to visit the URL to get the name and other info. Each URL is a different property with a different name and different structure. So for example:
URL1 → county 1 with name 1 with property address 1
URL2 → county 2 with owner 2 with site address 2
URL3–> county 3 with owners 3 with address 3
Only the URL and the county is known. So that’s why I think that I need to make css rules for each county.
The issue here is that the name could be of various structure (name, name + surname, surname + name, name + middle name, name short surname and so on)
You are exactly right about that, it definately is. I previously thought of regex, but I think that that would work only if all sites used one title like “Owner’s Name” but they don’t and sometimes called “owner” and sometimes there are multiple owners. And there’s no way for me to know ahead of time, but I can parse the name into the proper sequence later though.
I don’t think that I can get around having to make rules for each site and storing them somewhere.