How to easily parse HTML without RegEx
06 May 2008I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.
Here’s a very simple example of what you could do with it - I’m just extracting inner HTML from any element inside a HTML file which has a css class called “scrape” assigned to it:
using HtmlAgilityPack; public partial class _Default : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) { HtmlDocument doc = new HtmlDocument(); doc.Load(Server.MapPath(filePath)); Parse(doc.DocumentNode); } private void Parse(HtmlNode n) { foreach (HtmlAttribute atr in n.Attributes) { if (atr.Name == "class" && atr.Value == "scrape") { Response.Write(n.InnerHtml); } } if (n.HasChildNodes) { foreach (HtmlNode cn in n.ChildNodes) { Parse(cn); } } } }
That’s just a very small part of what it could do. I’ll expand upon this and post a few more examples in the future showing some interesting things you could do with this.