Scrape n8n community solved questions into Airtable for LLM training and interview prep

Hey guys !

I’m sharing one of my simplest workflows I literally just finished few hours ago, but this one is actually very specific and useful for a niche use case.

If someone is applying for a support job at n8n or just want to deeply understand how the community solves real problems and growing knowledge base of real Q&As, this workflow is literally their secret weapon.

It scrapes the latest solved questions from the n8n community forum, grabs the accepted answers, and dumps everything clean into Airtable. Sould be ran once a month and let new solved topics accumulate before next run.

The actual WF fetches up to 900 items per run with the question, accepted answer as final solution, and more fields …
Manual trigger for now since once a month is more than enough.

So to use it you’ll need and Airtable base with Topic ID, Title, Question, Answer, Topic URL, Views, Reply Count, Solved Date and Imported At fields.

You can then feed that data into your favorite LLM and ask it to generate a full training guide for an n8n support interview or use it for analysis. Or just flex on people in the forum by knowing everything :laughing:

Here is a sneak peek at the issue category breakdown it built on its own.

image

Hope it helps someone !

Community_solved_questions_clean.json (13.7 KB)

3 Likes

@houda_ben Nice! This is rlly cool!

while I’m here @bartv is scraping allowed? Ik u said something about it back when i was doing ai detection tracking but i dont remember

1 Like

This is a really clever use of the Discourse API — the n8n community has a ton of solved questions that are genuinely valuable for training and interview prep.

On the scraping-allowed question: Discourse has a public JSON API (just append .json to any topic or category URL) that’s explicitly designed for programmatic access. Fetching solved topics via the API at a reasonable rate is generally fine — it’s different from aggressive scraping of rendered HTML. The rate limits are built into the API itself.

A few suggestions to make the workflow more robust:

Better solved-question detection: Instead of filtering after the fact, use the Discourse search API with the in:solved filter directly:

GET /search.json?q=in%3Asolved+category%3Aquestions&page=1

This gives you only solved topics from the start and is more efficient than fetching everything and filtering.

Incremental runs: For your monthly cadence, store the last run date in Airtable and pass after:YYYY-MM-DD to the search query so you’re only fetching new solved topics since the last import. This keeps API calls minimal and avoids re-importing duplicates.

Accepted answer extraction: The accepted answer is marked with accepted_answer: true in the post stream JSON. You can fetch it with:
GET /t/{topic_id}.json and look for posts[].accepted_answer == true

For LLM training quality: Consider also capturing the question post separately from the accepted answer post — many LLM training pipelines want them as a Q/A pair rather than a single document.

Nice workflow — the category breakdown analysis is a bonus I hadn’t thought of but makes a lot of sense for understanding where people struggle most.

1 Like

Hi @OMGItsDerek thanks for the detailed feedback! I really appreciate it !

On the search API with in:solved, I actually tried that and hit two problems. It returned unsolved topics too and has a hard limit of 50 results with no real pagination. That’s why I ended up using the category endpoint with solved=yes and “Update a Parameter in Each Request” pagination mode instead.

The accepted answer extraction is already in there, I fetch each topic with /t/{topic_id}.json and look for accepted_answer: true in the post stream, exactly as you described.

The incremental runs idea is great though, will add that in the next version! :star_struck:

And yes the Q/A pair structure is already how we store it, Question and Answer are separate fields in Airtable so it’s ready for LLM training pipelines out of the box.

Thanks again !!

2 Likes

You can use this data for personal use, but not for republication/repurposing anywhere.

2 Likes

This is brilliant! I literally just came to the forum to do this manually and found your post. Love a good easter egg :egg: :slight_smile:

1 Like

When I ran the workflow I noticed it was pulling the 900 entries from 2019 forward and not the latest support items. Let me know if this was intentional for some reason. I found it was because the option on the URL in the Fetch Items node was set to ascending=true, so I changed it to decending=true and now it starts in 2026 and goes back.

@seanpcook Thanks for the comment !!

just change ascending to false, it will work. i was using true for test purposes and forgot to change it in the workflow.

1 Like