HTTP retriever - regex matching multiple values
-
Hi
I am attempting to use the HTTP retriever to 'scrape' a HTML page for a number of values/datapoints. The page contains a table that looks like this, and my objective is to capture all of the numerical values as separate datapoints:
<table> <td>Gewicht Volk 1</td> <td><b><p class="right">44.6</p></b></td> <td>kg</td> <td>Zu/Abnahme</td> <td><b><p class="right">44.6</p></b></td> <td>kg</td></tr> <tr> <td>Gewicht Volk 2</td> <td><b><p class="right">29.4</p></b></td> <td>kg</td> <td>ab 00.00 Uhr</td> <td><b><p class="right">29.4</p></b></td> <td>kg</td></tr> <tr> <td>Luftdruck</td> <td><b><p class="right">1015</p></b></td> <td>mbar</td> </tr> <td>Temperatur Drucksensor</td> <td><b><p class="right">1.7</p></b></td> <td>°C</td> <tr> <td>Temperatur</td> <td><b><p class="right">6.2</p></b></td> <td>°C</td> <td>Tagesmin.</td> <td><b><p class="right">0.0</p></b></td> <td>°C</td> <td>, Tagesmax.</td> <td><b><p class="right">0.0</p></b></td> <td>°C</td> </tr> <td>Brutraumtemperatur</td> <td><b><p class="right">0.0</p></b></td> <td>°C</td> </tr> <td>Regensensor</td> <td><b><p class="right">17</p></b></td> <td>mm</td> <td>Tagesmenge</td> <td><b><p class="right">0.0</p></b></td> <td>mm</td> </tr> <td>Luftfeuchtigkeit</td> <td><b><p class="right">80.6</p></b></td> <td>%</td> <tr> <td>Akku</td> <td><b><p class="right">12.0</p></b></td> <td>V</td> </tr> <tr> <td>CSQ (Signalqualität Antenne)</td> <td><b><p class="right">-1</p></b></td> <td> </td> </tr>
If I use a regex along the lines of
(?<=<p class="right">)(.*)(?=<\/p)
I can match the first value (44.6
), but I cannot get the second value or nth value by incrementing the 'value capture group' value in the data point properties. Adding a{n}
index to the end of the regex to get the nth match doesn't seem to help either.If I use a regex like
<td>Luftfeuchtigkeit<\/td> <td><b><p class="right">(.*?)<
(the forum has stripped the additional whitespaces) or even(?<=Luftdruck<\/td> <td><b><p class="right">)(.*)(?=<\/p>)
I don't get any matches at all.Attempts to 'learn' regex have come up short, so I am limited to copying examples from others and messing about by trial and error. Any suggestions about how to accomplish this would be very much appreciated!
-
Hi Jeremy,
Lookahead and lookbehind can definitely get complex (at least you didn't ask a backreference question!). My tactic would be to count how many times "right" appears before what we're interested in,
(?:.|\r|\n)*?(?:right"(?:.|\r|\n)*?){n}right">(\d+\.?\d*)(?:.|\r|\n)*
I'm using
(?:.|\r|\n)*
to match everything, and then we have the{n}
parameter to determine how manyright"
to skip. The only capturing group should be the (n+1)th value, capturing at group index 1.I didn't try this in Mango, but I did test out the regex a little.
-
Of, if you wanted to be explicit about the cities, fearing the order may change or some such,
Luftdruck(?:.|\r|\n)*?(\d+\.?\d*)
-
Brilliant, thanks Phil. Appreciate your explanation. That second regex works perfectly!