# 🆘 Help Needed: Extracting Real News Titles from Google Alerts HTML in n8n
## About the Project
I'm building an automated news aggregation system for **ANPILAC** (News Agency on Early Childhood Education in Latin America and the Caribbean), a nonprofit platform serving educators, researchers, and policymakers across Latin America.
**Mission:** Provide free, timely access to early childhood education news (ages 0-8) to strengthen educational policies and child development initiatives in LATAM.
**Current status:** Entirely manual curation - I personally scrape news, design content with NotebookLM, and manually post to our portal and social media. This workflow aims to automate the first step: news discovery and validation.
---
## Technical Setup (80% Complete)
**Environment:**
- n8n Desktop v1.121.3 (Windows 11, local execution)
- Gmail API with OAuth2 ✅ Working
- Google Sheets API with OAuth2 ✅ Working
- Schedule Trigger ✅ Working
**Workflow architecture:**
[Schedule: Daily 9AM] → [Gmail: Fetch 20 emails] → [Code Node: Parse HTML] → [Google Sheets: Write data]
**What's working:**
- Gmail authentication and email retrieval
- Google Sheets connection and data writing
- Workflow executes successfully (all nodes green)
**What's NOT working:**
- Code Node extracts wrong text from emails
---
## The Critical Problem
**Expected output:**
Title: “New early childhood education law approved in Argentina” Link: https://educacion.gob.ar/noticias/nueva-ley-educacion-inicial
**Actual output:**
Title: “update subscription preferences” Link: https://primeirainfancia.us10.list-manage.com/profile
The Code Node is extracting **interface text and settings links** instead of **actual news titles and article URLs**.
---
## Root Cause Analysis
Google Alerts emails have HTML with no semantic structure:
- No CSS classes like `class="news-title"`
- No `data-*` attributes
- News links and interface links both use identical `<a>` tags
- No positional or structural pattern to distinguish them
**Sample HTML structure:**
```html
<!-- Real news (what we want) -->
<a href="https://google.com/url?url=https://educacion.gob.ar/nueva-ley">
Ministerio aprueba nueva ley de educación inicial
</a>
<!-- Interface link (what we're getting instead) -->
<a href="https://primeirainfancia.us10.list-manage.com/profile">
update subscription preferences
</a>
<!-- More interface links -->
<a href="https://google.com/alerts/feedback">
Mark as not important
</a>
```
All three use the same HTML structure. **No way to distinguish them with simple regex.**
---
## What We've Tried (8+ hours debugging with 2 AI assistants)
### Attempt 1: Basic regex pattern
```javascript
/<a[^>]*href=["']([^"']+)["'][^>]*>(.*?)<\/a>/gi
```
**Result:** Captures ALL links (news + interface) ❌
### Attempt 2: URL blacklist filtering
```javascript
function esURLNoticia(url) {
const urlsInvalidas = ['google.com/alerts', 'support.google',
'unsubscribe', 'feedback', 'list-manage.com'];
return !urlsInvalidas.some(inv => url.includes(inv));
}
```
**Result:** Improved but still captures interface text ❌
### Attempt 3: Title content validation
```javascript
function esTituloValido(titulo) {
const textosFalsos = ['update subscription', 'ver más', 'marcar como',
'preferences', 'unsubscribe', 'manage'];
return !textosFalsos.some(texto => titulo.toLowerCase().includes(texto));
}
```
**Result:** Better but many edge cases remain ❌
### Attempt 4: Length-based filtering
```javascript
if (titulo.length > 25 && titulo.length < 300)
```
**Result:** Interface text also meets length criteria ❌
### Attempt 5: Combined approach (current code)
```javascript
const regexEnlaces = /<a[^>]*href=["']([^"']+)["'][^>]*>([^<]{25,200})<\/a>/gi;
let match;
while ((match = regexEnlaces.exec(emailHTML)) !== null) {
const url = match[1];
const titulo = match[2].trim();
if (esURLNoticia(url) && esTituloValido(titulo)) {
items.push({ json: { titulo, link: url } });
}
}
```
**Result:** Still extracting wrong content ❌
### Attempt 6: Claude API for AI parsing
Tried using Anthropic's Claude API to intelligently parse HTML, but encountered authentication issues in n8n environment.
---
## Current Output (What Google Sheets receives)
| fecha_escaneo | alerta | titulo | link | pais |
|---------------|--------|--------|------|------|
| 12/12/2025 08:05 | educacion inicial | update subscription preferences | https://list-manage.com/profile | Colombia |
| 12/12/2025 08:05 | educacion inicial | update subscription preferences | https://list-manage.com/profile | Colombia |
**This data is completely unusable** - it's not news content, just email interface elements.
---
## Questions for the Community
### Primary Question
**How can I reliably extract real news titles and URLs from Google Alerts HTML emails in n8n Code Node, distinguishing them from interface/navigation links?**
### Specific Sub-Questions
1. **Hidden HTML patterns:**
- Are there undocumented CSS classes in Google Alerts emails?
- Positional patterns (e.g., first N links are always news)?
- MIME structure markers we can use?
2. **Better parsing approaches:**
- Should we use Cheerio/JSDOM instead of regex in Code Node?
- HTML parsing libraries that work in n8n environment?
- DOM-based selection strategies?
3. **Has anyone solved this before?**
- Existing n8n templates for Google Alerts parsing?
- Community workflows we can adapt?
4. **Alternative approaches:**
- Different email parsing strategies?
- Should we abandon Google Alerts entirely?
- RSS feeds or NewsAPI alternatives for education news in LATAM?
---
## Complete Current Code (Code Node 3)
```javascript
const items = [];
function obtenerFechaHoraArgentina() {
const ahora = new Date();
const offsetBsAs = -3 * 60;
const offsetLocal = ahora.getTimezoneOffset();
const diff = offsetLocal + offsetBsAs;
const fechaBsAs = new Date(ahora.getTime() + diff * 60 * 1000);
const dd = String(fechaBsAs.getDate()).padStart(2, '0');
const mm = String(fechaBsAs.getMonth() + 1).padStart(2, '0');
const yyyy = fechaBsAs.getFullYear();
const hh = String(fechaBsAs.getHours()).padStart(2, '0');
const min = String(fechaBsAs.getMinutes()).padStart(2, '0');
const ss = String(fechaBsAs.getSeconds()).padStart(2, '0');
return `${dd}/${mm}/${yyyy} ${hh}:${min}:${ss}`;
}
function limpiarURL(url) {
if (!url) return '';
if (url.includes('google.com/url?')) {
try {
const urlObj = new URL(url);
const urlReal = urlObj.searchParams.get('url') || urlObj.searchParams.get('q');
if (urlReal) url = urlReal;
} catch (e) {
const match = url.match(/[?&](?:url|q)=([^&]+)/);
if (match) url = decodeURIComponent(match[1]);
}
}
const trackingParams = ['ved', 'usg', 'sa', 'ct', 'cad', 'cd', 'rct'];
try {
const urlObj = new URL(url);
trackingParams.forEach(param => urlObj.searchParams.delete(param));
return urlObj.toString();
} catch (e) {
return url.split('?')[0];
}
}
function detectarPais(url) {
if (!url) return 'Desconocido';
const dominios = {
'.ar': 'Argentina', '.mx': 'México', '.br': 'Brasil', '.cl': 'Chile',
'.co': 'Colombia', '.pe': 'Perú', '.uy': 'Uruguay', '.py': 'Paraguay'
};
for (const [dom, pais] of Object.entries(dominios)) {
if (url.includes(dom)) return pais;
}
return 'Internacional';
}
function detectarAlerta(texto, subject) {
const palabras = ['educacion inicial', 'educacao infantil', 'primera infancia',
'educacion parvularia', 'educacion preescolar'];
for (const palabra of palabras) {
if (subject.toLowerCase().includes(palabra.toLowerCase())) return palabra;
}
return 'educacion inicial';
}
function generarID() {
return Date.now().toString() + Math.random().toString(36).substr(2, 9);
}
function esURLNoticia(url) {
if (!url || !url.startsWith('http')) return false;
const urlsInvalidas = ['google.com/alerts', 'google.com/url', 'support.google',
'mail.google', 'unsubscribe', 'feedback', 'list-manage.com'];
return !urlsInvalidas.some(inv => url.toLowerCase().includes(inv));
}
function esTituloValido(titulo) {
if (!titulo || titulo.length < 25 || titulo.length > 300) return false;
const textosFalsos = ['update subscription', 'ver más', 'marcar como',
'google alert', 'preferences', 'manage', 'unsubscribe'];
const tituloLower = titulo.toLowerCase();
return !textosFalsos.some(texto => tituloLower.includes(texto));
}
for (const item of $input.all()) {
try {
const emailHTML = item.json.html || '';
const subject = item.json.subject || '';
if (!emailHTML) continue;
const palabraClave = detectarAlerta(emailHTML, subject);
const regexEnlaces = /<a[^>]*href=["']([^"']+)["'][^>]*>([^<]{25,200})<\/a>/gi;
let match;
let contadorEmail = 0;
while ((match = regexEnlaces.exec(emailHTML)) !== null && contadorEmail < 15) {
const url = match[1];
const titulo = match[2].trim().replace(/ /g, ' ').replace(/\s+/g, ' ');
if (esURLNoticia(url) && esTituloValido(titulo)) {
const urlLimpia = limpiarURL(url);
const pais = detectarPais(urlLimpia);
items.push({
json: {
'id': generarID(),
'fecha_escaneo': obtenerFechaHoraArgentina(),
'alerta': palabraClave,
'titulo': titulo.substring(0, 250),
'link': urlLimpia,
'pais': pais,
'relevancia': 'Media',
'razonamiento': 'Regex mejorado'
}
});
contadorEmail++;
}
}
} catch (error) {
console.error('Error:', error.message);
}
}
if (items.length > 0) {
return items;
} else {
return [{ json: { 'id': generarID(), 'fecha_escaneo': obtenerFechaHoraArgentina(),
'alerta': 'Sistema', 'titulo': 'No se encontraron noticias', 'link': '',
'pais': 'N/A', 'relevancia': 'Baja', 'razonamiento': 'Sin resultados' } }];
}
```
---
## What I Can Provide
If helpful for debugging:
- ✅ Complete workflow JSON export
- ✅ Sample Google Alerts email HTML (anonymized)
- ✅ Screenshots of n8n workflow and current output
- ✅ Google Sheets structure
---
## Context: Why This Matters
**Keywords monitored (11 Google Alerts):**
- educacion inicial, educacion parvularia (Spanish)
- educacao infantil, primeira infancia (Portuguese)
- educacion preescolar, primera infancia
**Target region:** Argentina, Brazil, Chile, Colombia, Mexico, Peru, Uruguay, and all LATAM countries
**Impact:** ANPILAC serves hundreds of educators and researchers who rely on timely news to inform early childhood policies. Currently, I do ALL curation manually - spending hours daily reading emails, opening links, and hand-picking relevant content. This automation would allow me to focus on content quality and community building instead of data entry.
**Without this working:** The project remains 100% manual, unsustainable for a solo nonprofit operator.
---
## Constraints
- ✅ Must use n8n Desktop (free version, local execution)
- ✅ Free/open-source solutions only (nonprofit budget)
- ❌ Cannot change source (need Google Alerts for comprehensive coverage)
- ❌ No external servers or paid APIs
---
## Thank You
Any guidance - code improvements, alternative parsing strategies, or even "Google Alerts is unparseable, use RSS instead" - would be incredibly valuable. I've been stuck on this for days and would deeply appreciate the community's expertise.
**Project spirit:** Open source, education equity, LATAM solidarity 🌎
---
**Attachments (if requested):**
1. Screenshot: n8n workflow (all nodes green but wrong output)
2. Screenshot: Google Sheets with incorrect data
3. Workflow JSON export (if needed)