Extract from HTML

I suspect this is as complicated as I think it is, but I figured I’d ask here before engaging a developer. I have HTML that I’m currently getting from an IMAP node and already using HTML Extract to pull out everything between the BODY tags. What would be even better is if I could turn some of the content into variables, but the HTML doesn’t have any class or id tags that we can use in conjunction with HTML Extract and we have no control over the the HTML we are receiving. Here’s the HTML:

<h2>Client/Agent Information</h2>
<table width="600" border="1" cellspacing="0" cellpadding="4">
   <tbody>
      <tr>
         <td width="130"><strong>Client</strong></td>
         <td>Clever Admin</td>
      </tr>
      <tr>
         <td width="130"><strong>Logged By</strong></td>
         <td>Customer John [email protected]</td>
      </tr>
      <tr>
         <td width="130"><strong>Affected User</strong></td>
         <td>Customer John [email protected]</td>
      </tr>
      <tr>
         <td width="130"><strong>Agent Name</strong></td>
         <td>JOHNSPC01</td>
      </tr>
   </tbody>
</table>
<br> 
<h2>Ticket Header Information</h2>
<table width="600" border="1" cellspacing="0" cellpadding="4">
   <tbody>
      <tr>
         <td width="130"><strong>Ticket Type</strong></td>
         <td>Emergency</td>
      </tr>
      <tr>
         <td width="130"><strong>Subject</strong></td>
         <td>Computer Issue</td>
      </tr>
      <tr>
         <td width="130"><strong>Submitted By</strong></td>
         <td>Customer John</td>
      </tr>
      <tr>
         <td width="130"><strong>Affected User</strong></td>
         <td>Customer John</td>
      </tr>
   </tbody>
</table>
<h2>Issue Description</h2>
I'm having an issue<br><br> 
<h2>Smart Engineer (If Applicable)</h2>
<br><br> 
<h2>Form Answer Data (If Applicable)</h2>
<br><br> 
<h2>Diagnostic Information</h2>
System Information <br>Host Name: JOHNSPC01 <br>OS Name: Microsoft Windows 10 Pro <br>OS Version: Windows 10 Pro.2009.19041.1.amd64fre.vb_release.191206-1406 <br> <br>User Name: John <br>User Domain: JOHNSPC01

And I’d like to be able to separate out the HTML that follows each TD in each row of the table and then the issue description and information that follows it after the table. Using the example above, I’d end up with the following variables:

client: "Clever Admin"
loggedby: "Customer John [email protected]"
affecteduser: "Customer John [email protected]"
agentname: "JOHNSPC01"
tickettype: "Emergency"
subject: "Computer Issue"
submittedby: "Customer John"
affectedusername: "Customer John"
issuedescription: "I'm having an issue.<br><br>"
smartengineer: "<br><br>"
formdata: "<br><br>"
diagnostic: "System Information <br>Host Name: JOHNSPC01 <br>OS Name: Microsoft Windows 10 Pro <br>OS Version: Windows 10 Pro.2009.19041.1.amd64fre.vb_release.191206-1406 <br> <br>User Name: John <br>User Domain: JOHNSPC01"

Is there an easy, non-Javascript (or non-extensive-Javascript) way to do this in n8n? I should also mention that all the titles like Client, Logged By, Affected User, etc. are constants and do not change in the source HTML.

Hey @cleveradmin,

I have come across a similar situation in the past, and I used the nth-child CSS selector. If the order of the values are fix, then this selector might help :slight_smile:

Yes, I suspect that will work. We are actually no longer using this workflow as we’ve implemented something different, but I will keep that in mind should I come across this requirement in the future. Thank you.

1 Like