Help Needed: Extracting Real News Titles from Google Alerts HTML in n8n

# 🆘 Help Needed: Extracting Real News Titles from Google Alerts HTML in n8n

## About the Project

I'm building an automated news aggregation system for **ANPILAC** (News Agency on Early Childhood Education in Latin America and the Caribbean), a nonprofit platform serving educators, researchers, and policymakers across Latin America.

**Mission:** Provide free, timely access to early childhood education news (ages 0-8) to strengthen educational policies and child development initiatives in LATAM.

**Current status:** Entirely manual curation - I personally scrape news, design content with NotebookLM, and manually post to our portal and social media. This workflow aims to automate the first step: news discovery and validation.

---

## Technical Setup (80% Complete)

**Environment:**
- n8n Desktop v1.121.3 (Windows 11, local execution)
- Gmail API with OAuth2 ✅ Working
- Google Sheets API with OAuth2 ✅ Working
- Schedule Trigger ✅ Working

**Workflow architecture:**

[Schedule: Daily 9AM] → [Gmail: Fetch 20 emails] → [Code Node: Parse HTML] → [Google Sheets: Write data]


**What's working:**
- Gmail authentication and email retrieval
- Google Sheets connection and data writing
- Workflow executes successfully (all nodes green)

**What's NOT working:**
- Code Node extracts wrong text from emails

---

## The Critical Problem

**Expected output:**

Title: “New early childhood education law approved in Argentina” Link: https://educacion.gob.ar/noticias/nueva-ley-educacion-inicial


**Actual output:**

Title: “update subscription preferences” Link: https://primeirainfancia.us10.list-manage.com/profile


The Code Node is extracting **interface text and settings links** instead of **actual news titles and article URLs**.

---

## Root Cause Analysis

Google Alerts emails have HTML with no semantic structure:
- No CSS classes like `class="news-title"`
- No `data-*` attributes
- News links and interface links both use identical `<a>` tags
- No positional or structural pattern to distinguish them

**Sample HTML structure:**
```html
<!-- Real news (what we want) -->
<a href="https://google.com/url?url=https://educacion.gob.ar/nueva-ley">
  Ministerio aprueba nueva ley de educación inicial
</a>

<!-- Interface link (what we're getting instead) -->
<a href="https://primeirainfancia.us10.list-manage.com/profile">
  update subscription preferences
</a>

<!-- More interface links -->
<a href="https://google.com/alerts/feedback">
  Mark as not important
</a>
```

All three use the same HTML structure. **No way to distinguish them with simple regex.**

---

## What We've Tried (8+ hours debugging with 2 AI assistants)

### Attempt 1: Basic regex pattern
```javascript
/<a[^>]*href=["']([^"']+)["'][^>]*>(.*?)<\/a>/gi
```
**Result:** Captures ALL links (news + interface) ❌

### Attempt 2: URL blacklist filtering
```javascript
function esURLNoticia(url) {
  const urlsInvalidas = ['google.com/alerts', 'support.google', 
                         'unsubscribe', 'feedback', 'list-manage.com'];
  return !urlsInvalidas.some(inv => url.includes(inv));
}
```
**Result:** Improved but still captures interface text ❌

### Attempt 3: Title content validation
```javascript
function esTituloValido(titulo) {
  const textosFalsos = ['update subscription', 'ver más', 'marcar como', 
                        'preferences', 'unsubscribe', 'manage'];
  return !textosFalsos.some(texto => titulo.toLowerCase().includes(texto));
}
```
**Result:** Better but many edge cases remain ❌

### Attempt 4: Length-based filtering
```javascript
if (titulo.length > 25 && titulo.length < 300)
```
**Result:** Interface text also meets length criteria ❌

### Attempt 5: Combined approach (current code)
```javascript
const regexEnlaces = /<a[^>]*href=["']([^"']+)["'][^>]*>([^<]{25,200})<\/a>/gi;
let match;
while ((match = regexEnlaces.exec(emailHTML)) !== null) {
  const url = match[1];
  const titulo = match[2].trim();
  if (esURLNoticia(url) && esTituloValido(titulo)) {
    items.push({ json: { titulo, link: url } });
  }
}
```
**Result:** Still extracting wrong content ❌

### Attempt 6: Claude API for AI parsing
Tried using Anthropic's Claude API to intelligently parse HTML, but encountered authentication issues in n8n environment.

---

## Current Output (What Google Sheets receives)

| fecha_escaneo | alerta | titulo | link | pais |
|---------------|--------|--------|------|------|
| 12/12/2025 08:05 | educacion inicial | update subscription preferences | https://list-manage.com/profile | Colombia |
| 12/12/2025 08:05 | educacion inicial | update subscription preferences | https://list-manage.com/profile | Colombia |

**This data is completely unusable** - it's not news content, just email interface elements.

---

## Questions for the Community

### Primary Question
**How can I reliably extract real news titles and URLs from Google Alerts HTML emails in n8n Code Node, distinguishing them from interface/navigation links?**

### Specific Sub-Questions

1. **Hidden HTML patterns:**
   - Are there undocumented CSS classes in Google Alerts emails?
   - Positional patterns (e.g., first N links are always news)?
   - MIME structure markers we can use?

2. **Better parsing approaches:**
   - Should we use Cheerio/JSDOM instead of regex in Code Node?
   - HTML parsing libraries that work in n8n environment?
   - DOM-based selection strategies?

3. **Has anyone solved this before?**
   - Existing n8n templates for Google Alerts parsing?
   - Community workflows we can adapt?

4. **Alternative approaches:**
   - Different email parsing strategies?
   - Should we abandon Google Alerts entirely?
   - RSS feeds or NewsAPI alternatives for education news in LATAM?

---

## Complete Current Code (Code Node 3)
```javascript
const items = [];

function obtenerFechaHoraArgentina() {
  const ahora = new Date();
  const offsetBsAs = -3 * 60;
  const offsetLocal = ahora.getTimezoneOffset();
  const diff = offsetLocal + offsetBsAs;
  const fechaBsAs = new Date(ahora.getTime() + diff * 60 * 1000);
  const dd = String(fechaBsAs.getDate()).padStart(2, '0');
  const mm = String(fechaBsAs.getMonth() + 1).padStart(2, '0');
  const yyyy = fechaBsAs.getFullYear();
  const hh = String(fechaBsAs.getHours()).padStart(2, '0');
  const min = String(fechaBsAs.getMinutes()).padStart(2, '0');
  const ss = String(fechaBsAs.getSeconds()).padStart(2, '0');
  return `${dd}/${mm}/${yyyy} ${hh}:${min}:${ss}`;
}

function limpiarURL(url) {
  if (!url) return '';
  if (url.includes('google.com/url?')) {
    try {
      const urlObj = new URL(url);
      const urlReal = urlObj.searchParams.get('url') || urlObj.searchParams.get('q');
      if (urlReal) url = urlReal;
    } catch (e) {
      const match = url.match(/[?&](?:url|q)=([^&]+)/);
      if (match) url = decodeURIComponent(match[1]);
    }
  }
  const trackingParams = ['ved', 'usg', 'sa', 'ct', 'cad', 'cd', 'rct'];
  try {
    const urlObj = new URL(url);
    trackingParams.forEach(param => urlObj.searchParams.delete(param));
    return urlObj.toString();
  } catch (e) {
    return url.split('?')[0];
  }
}

function detectarPais(url) {
  if (!url) return 'Desconocido';
  const dominios = {
    '.ar': 'Argentina', '.mx': 'México', '.br': 'Brasil', '.cl': 'Chile',
    '.co': 'Colombia', '.pe': 'Perú', '.uy': 'Uruguay', '.py': 'Paraguay'
  };
  for (const [dom, pais] of Object.entries(dominios)) {
    if (url.includes(dom)) return pais;
  }
  return 'Internacional';
}

function detectarAlerta(texto, subject) {
  const palabras = ['educacion inicial', 'educacao infantil', 'primera infancia', 
                    'educacion parvularia', 'educacion preescolar'];
  for (const palabra of palabras) {
    if (subject.toLowerCase().includes(palabra.toLowerCase())) return palabra;
  }
  return 'educacion inicial';
}

function generarID() {
  return Date.now().toString() + Math.random().toString(36).substr(2, 9);
}

function esURLNoticia(url) {
  if (!url || !url.startsWith('http')) return false;
  const urlsInvalidas = ['google.com/alerts', 'google.com/url', 'support.google', 
                         'mail.google', 'unsubscribe', 'feedback', 'list-manage.com'];
  return !urlsInvalidas.some(inv => url.toLowerCase().includes(inv));
}

function esTituloValido(titulo) {
  if (!titulo || titulo.length < 25 || titulo.length > 300) return false;
  const textosFalsos = ['update subscription', 'ver más', 'marcar como', 
                        'google alert', 'preferences', 'manage', 'unsubscribe'];
  const tituloLower = titulo.toLowerCase();
  return !textosFalsos.some(texto => tituloLower.includes(texto));
}

for (const item of $input.all()) {
  try {
    const emailHTML = item.json.html || '';
    const subject = item.json.subject || '';
    if (!emailHTML) continue;
    
    const palabraClave = detectarAlerta(emailHTML, subject);
    const regexEnlaces = /<a[^>]*href=["']([^"']+)["'][^>]*>([^<]{25,200})<\/a>/gi;
    let match;
    let contadorEmail = 0;
    
    while ((match = regexEnlaces.exec(emailHTML)) !== null && contadorEmail < 15) {
      const url = match[1];
      const titulo = match[2].trim().replace(/&nbsp;/g, ' ').replace(/\s+/g, ' ');
      
      if (esURLNoticia(url) && esTituloValido(titulo)) {
        const urlLimpia = limpiarURL(url);
        const pais = detectarPais(urlLimpia);
        
        items.push({
          json: {
            'id': generarID(),
            'fecha_escaneo': obtenerFechaHoraArgentina(),
            'alerta': palabraClave,
            'titulo': titulo.substring(0, 250),
            'link': urlLimpia,
            'pais': pais,
            'relevancia': 'Media',
            'razonamiento': 'Regex mejorado'
          }
        });
        contadorEmail++;
      }
    }
  } catch (error) {
    console.error('Error:', error.message);
  }
}

if (items.length > 0) {
  return items;
} else {
  return [{ json: { 'id': generarID(), 'fecha_escaneo': obtenerFechaHoraArgentina(), 
            'alerta': 'Sistema', 'titulo': 'No se encontraron noticias', 'link': '', 
            'pais': 'N/A', 'relevancia': 'Baja', 'razonamiento': 'Sin resultados' } }];
}
```

---

## What I Can Provide

If helpful for debugging:
- ✅ Complete workflow JSON export
- ✅ Sample Google Alerts email HTML (anonymized)
- ✅ Screenshots of n8n workflow and current output
- ✅ Google Sheets structure

---

## Context: Why This Matters

**Keywords monitored (11 Google Alerts):**
- educacion inicial, educacion parvularia (Spanish)
- educacao infantil, primeira infancia (Portuguese)
- educacion preescolar, primera infancia

**Target region:** Argentina, Brazil, Chile, Colombia, Mexico, Peru, Uruguay, and all LATAM countries

**Impact:** ANPILAC serves hundreds of educators and researchers who rely on timely news to inform early childhood policies. Currently, I do ALL curation manually - spending hours daily reading emails, opening links, and hand-picking relevant content. This automation would allow me to focus on content quality and community building instead of data entry.

**Without this working:** The project remains 100% manual, unsustainable for a solo nonprofit operator.

---

## Constraints

- ✅ Must use n8n Desktop (free version, local execution)
- ✅ Free/open-source solutions only (nonprofit budget)
- ❌ Cannot change source (need Google Alerts for comprehensive coverage)
- ❌ No external servers or paid APIs

---

## Thank You

Any guidance - code improvements, alternative parsing strategies, or even "Google Alerts is unparseable, use RSS instead" - would be incredibly valuable. I've been stuck on this for days and would deeply appreciate the community's expertise.

**Project spirit:** Open source, education equity, LATAM solidarity 🌎

---

**Attachments (if requested):**
1. Screenshot: n8n workflow (all nodes green but wrong output)
2. Screenshot: Google Sheets with incorrect data
3. Workflow JSON export (if needed)

Please share a complete Email with the complete HTML structure here. Maybe there is DOM path that can be followed to identify the links you want to extract.