How to easily parse HTML without RegEx

I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.

Here’s a very simple example of what you could do with it - I’m just extracting inner HTML from any element inside a HTML file which has a css class called “scrape” assigned to it:

using HtmlAgilityPack;

public partial class _Default : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(Server.MapPath(filePath));
        Parse(doc.DocumentNode);
    }
    private void Parse(HtmlNode n)
    {
        foreach (HtmlAttribute atr in n.Attributes)
        {
            if (atr.Name == "class" && atr.Value == "scrape")
            {
                Response.Write(n.InnerHtml);
            }
        }

        if (n.HasChildNodes)
        {
            foreach (HtmlNode cn in n.ChildNodes)
            {
                Parse(cn);
            }
        }
    }
}

That’s just a very small part of what it could do. I’ll expand upon this and post a few more examples in the future showing some interesting things you could do with this.

If you liked this post, 🗞 subscribe to my newsletter and follow me on 𝕏!