How to easily parse HTML without RegEx

06 May 2008

I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.

Here’s a very simple example of what you could do with it - I’m just extracting inner HTML from any element inside a HTML file which has a css class called “scrape” assigned to it:

using HtmlAgilityPack;

public partial class _Default : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(Server.MapPath(filePath));
        Parse(doc.DocumentNode);
    }
    private void Parse(HtmlNode n)
    {
        foreach (HtmlAttribute atr in n.Attributes)
        {
            if (atr.Name == "class" && atr.Value == "scrape")
            {
                Response.Write(n.InnerHtml);
            }
        }

        if (n.HasChildNodes)
        {
            foreach (HtmlNode cn in n.ChildNodes)
            {
                Parse(cn);
            }
        }
    }
}

That’s just a very small part of what it could do. I’ll expand upon this and post a few more examples in the future showing some interesting things you could do with this.

Jesal Gadhia

How to easily parse HTML without RegEx

Related Posts

How to break the cycle of firefighting and build engineering excellence 01 Jan 2025

Build an AI roadmap that actually delivers value 15 Oct 2024

6 Questions To Ask Yourself Before Applying To That Job 31 Jul 2024