HTTP retriever - regex matching multiple values

jeremyh

Hi

I am attempting to use the HTTP retriever to 'scrape' a HTML page for a number of values/datapoints. The page contains a table that looks like this, and my objective is to capture all of the numerical values as separate datapoints:

<table>

	<td>Gewicht Volk 1</td>
        <td><b><p class="right">44.6</p></b></td>
        <td>kg</td>
        <td>Zu/Abnahme</td>          
                    <td><b><p class="right">44.6</p></b></td>
         <td>kg</td></tr>
	<tr>
		<td>Gewicht Volk 2</td>
		<td><b><p class="right">29.4</p></b></td>
        <td>kg</td>
        <td>ab 00.00 Uhr</td> 
                    <td><b><p class="right">29.4</p></b></td>
        <td>kg</td></tr>
	<tr>
		<td>Luftdruck</td>
		<td><b><p class="right">1015</p></b></td>
        <td>mbar</td>	
	</tr>
        <td>Temperatur Drucksensor</td> 
        <td><b><p class="right">1.7</p></b></td>
        <td>°C</td>
	<tr>
        <td>Temperatur</td>
        <td><b><p class="right">6.2</p></b></td>
        <td>°C</td>
        <td>Tagesmin.</td> 
        <td><b><p class="right">0.0</p></b></td>
        <td>°C</td>
        <td>, Tagesmax.</td> 
        <td><b><p class="right">0.0</p></b></td>
        <td>°C</td>
</tr>
        <td>Brutraumtemperatur</td>
        <td><b><p class="right">0.0</p></b></td>
        <td>°C</td>
</tr>
        <td>Regensensor</td>
        <td><b><p class="right">17</p></b></td>
        <td>mm</td>   
        <td>Tagesmenge</td> 
        <td><b><p class="right">0.0</p></b></td>
        <td>mm</td>
</tr>
        <td>Luftfeuchtigkeit</td>
        <td><b><p class="right">80.6</p></b></td>
        <td>%</td>
	<tr>
        <td>Akku</td>
        <td><b><p class="right">12.0</p></b></td>
        <td>V</td>
    </tr>
     <tr>
        <td>CSQ (Signalqualität Antenne)</td>
        <td><b><p class="right">-1</p></b></td>
        <td> </td>
	 </tr>

If I use a regex along the lines of (?<=)(.*)(?=<\/p) I can match the first value (44.6), but I cannot get the second value or nth value by incrementing the 'value capture group' value in the data point properties. Adding a {n} index to the end of the regex to get the nth match doesn't seem to help either.

If I use a regex like <td>Luftfeuchtigkeit<\/td> <td>(.*?)< (the forum has stripped the additional whitespaces) or even (?<=Luftdruck<\/td> <td>)(.*)(?=<\/p>) I don't get any matches at all.

Attempts to 'learn' regex have come up short, so I am limited to copying examples from others and messing about by trial and error. Any suggestions about how to accomplish this would be very much appreciated!

phildunlap

Hi Jeremy,

Lookahead and lookbehind can definitely get complex (at least you didn't ask a backreference question!). My tactic would be to count how many times "right" appears before what we're interested in,

(?:.|\r|\n)*?(?:right"(?:.|\r|\n)*?){n}right">(\d+\.?\d*)(?:.|\r|\n)*

I'm using (?:.|\r|\n)* to match everything, and then we have the {n} parameter to determine how many right" to skip. The only capturing group should be the (n+1)th value, capturing at group index 1.

I didn't try this in Mango, but I did test out the regex a little.

phildunlap

Of, if you wanted to be explicit about the cities, fearing the order may change or some such,

Luftdruck(?:.|\r|\n)*?(\d+\.?\d*)

jeremyh

Brilliant, thanks Phil. Appreciate your explanation. That second regex works perfectly!