Flatten HTML Document to List of Tags, Attributes, and Values

I had a need to flatten a set of HTML documents to a list of the HTML tags in their head sections.  I thought this bit of code might be useful for someone in the future.

This uses the CsQuery library which is a port of jQuery in C#: https://github.com/jamietre/CsQuery
CsQuery also has a NuGet Package: https://www.nuget.org/packages/CsQuery


//Note: Get HTML from somewhere...
var html = "";

var cq = CsQuery.CQ.Create(html);

var head = cq["head"];

var nonScriptHeadTagsQuery =
from t in head.Children()
where
t.NodeName != "SCRIPT"
&& t.NodeName != "LINK"
select new { Tag = t, TagId = Guid.NewGuid() };

var nonScriptHeadTags = nonScriptHeadTagsQuery.ToList();

var htmlTags =
nonScriptHeadTags
.SelectMany(tagInfo => tagInfo.Tag.Attributes, (tagInfo, attribute) => new { TagInfo = tagInfo, Attribute = attribute })
.Select(x => new
{
TagId = x.TagInfo.TagId,
TagType = x.TagInfo.Tag.NodeName,
AttributeName = x.Attribute.Key,
AttributeValue = x.Attribute.Value,
})
.ToList();


Hope this helps,
Aaron

Comments

Popular posts from this blog

Search iPhone Text Messages with SQLite SQL Query

Configure SonarAnalyzer.CSharp with .editorconfig, no need for SonarCloud or SonarQube

Edit Default Visual Studio 2012 Item and Project Templates