Scrape site for page urls, titles, and meta descriptions

Describe the problem/error/question

New to n8n and trying to set up a workflow to crawl a site using an xml sitemap. Then request the url, page title and meta description. All for rewriting and sending to Google Sheets. Final output kept providing the same url, title and meta. Could not output all pages even though this info was returned with the http request.

Any help appreciated.

What is the error message (if any)?

Please share your workflow

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

Information on your n8n setup

  • n8n version: ?
  • Database (default: SQLite): none
  • n8n EXECUTIONS_PROCESS setting (default: own, main): ?
  • Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
  • Operating system: Latest Mac OS
1 Like

Hey @jsquared hope all is well. Welcome to the community.

Could you please:

  • Share your workflow instead of a screenshot.
  • Pinpoint/describe the exact problem you are having (e.g. “Node X is expected to do Y, but instead does Z”).
1 Like

the problem is on the code1 node.

It will only return 1 item so cause the error.

Should not use code node if the set field node can easily perform the same way.

Thanks for responding.

I’m afraid I don’t quite understand. I am brand new to the platform and watching videos to learn it.

Where would I insert the nodes you suggested? Looking at my workflow it looks like it might go before the Code1?

I appreciate any help in learning this.

Replace the portion of your workflow up to an OpenAI node with that.

OK I tried that. It is only writing one row and does not include the URL.

It’s the same error again for next code node.

Just rewrite the whole workflow

Just wonder which video is teaching using code node as that way.

Notice I changed the model to 4o-mini. If you still want to use 3.5 turbo then need to chagne the model.

And here is an alternative with no code and no set

Final result

URL DESC DESC_LEN NEW_DESC NEW_DESC_LEN
https://www.j2studio.com/ J2 Studio is an award winning Tampa web design company and graphic design firm providing professional creative services and branding materials. 143 Elevate your brand with Tampa’s award-winning web design and graphic design firm, J2 Studio. Get professional creative services and stunning branding materials today. 166
About Tampa Web Design Company - J2 Studio Learn more about J2 Studio, an award winning advertising, graphic design and web design firm in Tampa, FL providing professional creative services. 147 J2 Studio in Tampa, FL is an award-winning advertising, graphic design, and web design firm. Experts in creative services with a proven track record. 149
Tampa Graphic and Web Design Reviews - J2 Studio J2 Studio, an award winning advertising, graphic design and web design firm in Tampa, has received many 5-star Google reviews for creative services. 148 Discover why J2 Studio is the top choice for award-winning creative services in Tampa. See our 5-star Google reviews and elevate your brand today! 146
3 Likes

Hey, SEO veteran here.

I would like to add that gpt 4.1 has one million tokens of context.

The OP could probably read all of the metadata (title and description) ask the model to SEO optimize hundreds of them and ask for the answer to be structured in a JSON format to be trasformed into a list of rows in google sheet.

I suppose the AI agent module should work better for this kind of need.

Also using Gemini 2.0 flash should be faster, slightly cheaper and more “Seo-oriented”.

Testing both models is simpler by using openrouter as a gateway!

1 Like

Thank you so much for this. As I said I am brand new to this. Still going through the n8n tutorials. On video 7/9 for the beginner videos. WIll then move on to the next set.

I tried to use Gemini and ChatGPT to help me buiold the workflow. But it never worked. I then heart that Claude was better at helping with code(?) so I bought a subscription and it did not work either.

I would be happy to buy you a few beers for the help. Is there a way to DM on this platform?

Also, if I were to take that workflow and use it on another site that has say, 1,000 pages, would it be wise to add “Wait” nodes here and there? Still learning about those as well.

Thanks again!

Glad this was helpful (if it was, kindly mark the answer as solution or hit that like button).

As for the Wait nodes - depending on the service, of course, this could be a wise thing to do. Not that I think you will melt the server with traffic, but giving a server some time to breath would be a nice thing to do, also some servers will just plain block you if you send more than x number of requests per y unit of time, so introducing some pauses is not a bad idea.

@jsquared Keep in mind that the platform is evolving VERY quickly and the LLms might be behind the current capabilities.

And since you are esploring SEO activities my advice would be to use the open router provider instead of openai

it provides a common interface to use several kind of LLMs from several companies and it’s a pay-per-use tool, so you pay the tokens that you use.

I’m suggesting that option, because other models might give you more interesting results: for my SEO tests gemini 2.0 flash is pretty good even compared to gpt 4.1 , but i mainly do SEO in italian language.

The results may vary depending on the language and the semantic context.

Have fun!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.