Monday, March 13, 2017

Regex Match HTML/XML with Laziness to get Tag Contents

Abstract

Regular expressions are extremely powerful. Figuring out how to get them to match what you want though can be a challenge. One of the tougher matches is with HTML/XML content. Often you get more matched than you want; that’s because you are being greedy. Be lazy! You’ll get a better match.

Disclaimer

This post is solely informative. Critically think before using any information presented. Learn from it but ultimately make your own decisions at your own risk.

Problem

Suppose you have the following bit of HTML.

<p> this is a <span>very</span> <b>cool</b> regex tip </p>

You want to use a capturing group to get the contents of the <span> tag. So you put together a regular expression that looks like this:

<span>(.+)<

But unfortunately this doesn’t get you the contents of the tag. The regular expression is to greedy and matches too much of the string. The regular expression matches all the way to the start of the closing paragraph tag.

<p> this is a <span>very</span> <b>cool</b> regex tip </p>

So what’s the problem here? The problem is the regular expression is being too greedy. Let’s make it less greedy and a bit more lazy.

Solution

The solution is to put together a regular expression that is a bit more lazy. This more lazy regular expression will stop matching once it hits the first new opening tag instead of matching to the last opening tag. Here is a more lazy regular expression.

<span>(.+?)<

Now this will match to the start of the closing </span> tag like you might expect it to. Plus, now that the matching is working more as expected, the capturing group can easily get the contents of the tag. Here is how the regular expression matches now.

<p> this is a <span>very</span> <b>cool</b> regex tip </p>

Summary

That’s it. Be a little more lazy and a little less greedy. I hope this has helped you a little bit figuring out your regular expression matching problem.

References

Goyvaerts, J. (2016, December 08). Laziness Instead of Greedinesss. Regular-Expressions.info. Retrieved from http://www.regular-expressions.info/repeat.htmlhttp://www.regular-expressions.info/repeat.html.

1 comment:

  1. Hello Michael,

    Nice blog! I am editor at Java Code Geeks (www.javacodegeeks.com). We have the JCG program (see www.javacodegeeks.com/join-us/jcg/), that I think you’d be perfect for.

    If you’re interested, send me an email to eleftheria.drosopoulou@javacodegeeks.com and we can discuss further.

    Best regards,
    Eleftheria Drosopoulou

    ReplyDelete