How to easily parse HTML without RegEx

I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.

Here’s a very simple example of what you could do with it - I’m just extracting inner HTML from any element inside a HTML file which has a css class called “scrape” assigned to it:

using HtmlAgilityPack;

public partial class _Default : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(Server.MapPath(filePath));
        Parse(doc.DocumentNode);
    }
    private void Parse(HtmlNode n)
    {
        foreach (HtmlAttribute atr in n.Attributes)
        {
            if (atr.Name == "class" && atr.Value == "scrape")
            {
                Response.Write(n.InnerHtml);
            }
        }

        if (n.HasChildNodes)
        {
            foreach (HtmlNode cn in n.ChildNodes)
            {
                Parse(cn);
            }
        }
    }
}

That’s just a very small part of what it could do. I’ll expand upon this and post a few more examples in the future showing some interesting things you could do with this.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!