Proper CSS selector to parse HTML content

Dan_Burykin · August 3, 2024, 3:55pm

Hi there. Trying to parse own LinkedIn posts by pooling them from the dashboard, that accessible from web: Taplio with the following workflow, but can’t really set the proper selector:

Thank you very much!

n8n · August 3, 2024, 3:55pm

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

gleeballs · August 4, 2024, 6:11pm

Have you tried

p.chakra-text

Dan_Burykin · August 4, 2024, 7:26pm

Thank for your reply, @gleeballs ! No, I haven’t, but the result is very similar to several div in a raw. It parse about text under the logo, but still not the last 30 day posts.

ihortom · August 4, 2024, 11:18pm

@Dan_Burykin , JavaScript seems to play a big role on that page preventing proper HTML parsing. As a result I had to take unorthodox approach. I noticed that the content of the posts is actually present in the JavaScript code itself as JSON. Thus, instead of parsing p and div I actually extracted the content of script with some further cleanup to get to the JSON containing the posts.

Here’s the unusual workflow that works with the specific URL you provided. It might not wok under other circumstances.

Dan_Burykin · August 5, 2024, 5:01pm

Thank you very much, @ihortom . Perfect, like always!

Dan_Burykin · August 5, 2024, 5:14pm

@ihortom super small mistake in the ‘Split Out1’ parameters. Should be ‘post’, not ‘posts’. Very kindly ask you, as long as you anyway dived into this, if you could add scraping +1 posts parameter - likes. Thank you!

ihortom · August 5, 2024, 5:50pm

That was done on purpose as posts references the array of posts, not individual post. If you want to rename it, feel free to do so. Just bear in mind that I introduced posts in the node earlier, “Parse tweets” node.

Sure, that is straightforward. I also replaced created_at (which is the date report was generated, aka when you run the workflow) with posted that provides the date when the post was actually posted (by you).

Dan_Burykin · August 5, 2024, 6:45pm

Super amazing. Thanks. @ihortom I mean it’s worked for me, when I changes posts to post here:

when change it to ‘posts’ like in your example - it’s not working:

Dan_Burykin · August 12, 2024, 3:22pm

Some update here @ihortom . To uncover my super monster workflow I want to build, I also want to initially scrape influencers last 30 day posts from here: The 100 Best Linkedin Influencers to scrape them as well. Thus, I’d rather need to create a single Google Doc for each category with all it’s current day influencers posts to save them to it. Promise, I won’t ask you how to create the archive, but still should I create a separate thread and tag you there? Thx!

Dan_Burykin · August 12, 2024, 3:29pm

I guess I will need to use proxy for it, which is definitely somewhere beyond my knowledges…

ihortom · August 12, 2024, 7:52pm

Here’s the workflow that extracts all the influencers with the metadata presented on The 100 Best Linkedin Influencers.

You don’t seem to be interested in all that metadata but only LinkedIn usernames to use the first workflow substituting the Taplio URL to get posts of the influencers.

If that is so, here’s an updated workflow to get the posts of the influencers.

Note that some influencers do not have posts listed. Thus, I had to take that into account to prevent breaking the loop.

What is left for you to do is to add the node(s) updating your spreadsheet.

Dan_Burykin · August 13, 2024, 3:07pm

Amazing as always! Thanks, @ihortom . I was going to ask you initially: how to change ‘\n’ symbols to actual breaklines in the output. Also I imagine how I could export it to Google Sheet, but currently clarifying with the AI tool if it will be as effective as if data was in Google Doc. I was also going to use it and so would kindly ask what join() or something function I could you to put all the output like:

[Name] [Posted on… ]
link: [post link (it’s under the date of each post date bubble, if it’s possible to scrape it though)]

[post content (anyway with breaklines instead of \n)]

[Number of Likes] Likes (not critical, but would’ve been good if we had likes meta data back)

[Name2] […

Thank you.

ihortom · August 13, 2024, 4:26pm

That symbol is how n8n tells that there is a new line there. That is, \n will become a new line when you place it in spreadsheet/doc. No need to worry about that.

No need to use join() or other function. What you need is an expression like this

{{ $json.user }} {{ $json.created_at }}
link: {{ $json.post_link }}
{{ $json.post }}
. . .

Surely, you need to add the missing metadata first to be able to use those expressions. In practice, it means extending the list of properties in Post node.

Sure, as said above, add it to Post node. It is available as {{ $json.numLikes }} in there. Similarly, the post URL is available as {{ $json.post_url }}.

Hopefully, you can manage to do this on your own. Let me know if not.

Dan_Burykin · August 15, 2024, 8:15pm

Hey @ihortom . Well, at least I’ve beaten the Google Doc auth! Now I need to clean the doc from the previous day data and past current day data. I’m sure it’s not the optimal way below, but anyway it only send a single post data regardless node is executed the posts amount of times.

ihortom · August 15, 2024, 11:45pm

@Dan_Burykin , indeed, your workflow does not work as one would expect. You are using “Find and Replace Text” option. To resolve the issue, I suggest to utilize “Clear” and “Insert” instead as depicted below

That is, we are going to delete (clear) the content first and then insert post by post.

Dan_Burykin · August 16, 2024, 9:25pm

perfect. now it’s completely ready. just in case you know how to make hyperlink from link post, @ihortom . I imagine it could be a formula in case of google sheet, but not really sure how it goes in for google doc.

ihortom · August 16, 2024, 11:51pm

I’m afraid it is not possible with Google Docs node and utilizing custom HTTP requests is far too complicated.

On a separate note, I would advise updating the post_url expression (in “Posts” node) from {{ $json.zFull.post_url }} to {{ $json.zFull.post_url ?? $json.zFull.socialContent.shareUrl }} as I noticed that sometimes post_url is empty while shareUrl is present instead (both point to the post depending on the nature of that post).

Dan_Burykin · August 17, 2024, 10:09am

Thanks so much for this update, @ihortom Very on time! I’m just thinking does it make sense to call Google Doc API so many times? I mean I wish I knew how to put all posts (and the rest of data of each one) into single text value and put it to Google Doc with a single API call. Or it’s fine? no concern about Google Docs API quota?

Dan_Burykin · August 17, 2024, 12:04pm

@ihortom also I finally created the knowledge base of all the categories via tags above the total influencers list, that are likely to always consist of at least one influencer and often at least one post (it doesn’t apparently scrape properly, but it’s enough at the moment). Is there an option for my self hosted n8n community plan to daily check the rest of the categories if they consist of at least one influencer and if yes - replicate the workflow template for this category daily scraping a long creating Google Doc for the possible posts import for this category? I know it’s and advanced task, and I don’t yet know so many basic things, but well, following my interest as always. Thank you!