Flatten HTML Document to List of Tags, Attributes, and Values
I had a need to flatten a set of HTML documents to a list of the HTML tags in their head sections. I thought this bit of code might be useful for someone in the future.
This uses the CsQuery library which is a port of jQuery in C#: https://github.com/jamietre/CsQuery
CsQuery also has a NuGet Package: https://www.nuget.org/packages/CsQuery
//Note: Get HTML from somewhere...
var html = "";
var cq = CsQuery.CQ.Create(html);
var head = cq["head"];
var nonScriptHeadTagsQuery =
from t in head.Children()
where
t.NodeName != "SCRIPT"
&& t.NodeName != "LINK"
select new { Tag = t, TagId = Guid.NewGuid() };
var nonScriptHeadTags = nonScriptHeadTagsQuery.ToList();
var htmlTags =
nonScriptHeadTags
.SelectMany(tagInfo => tagInfo.Tag.Attributes, (tagInfo, attribute) => new { TagInfo = tagInfo, Attribute = attribute })
.Select(x => new
{
TagId = x.TagInfo.TagId,
TagType = x.TagInfo.Tag.NodeName,
AttributeName = x.Attribute.Key,
AttributeValue = x.Attribute.Value,
})
.ToList();
Hope this helps,
Aaron
This uses the CsQuery library which is a port of jQuery in C#: https://github.com/jamietre/CsQuery
CsQuery also has a NuGet Package: https://www.nuget.org/packages/CsQuery
//Note: Get HTML from somewhere...
var html = "";
var cq = CsQuery.CQ.Create(html);
var head = cq["head"];
var nonScriptHeadTagsQuery =
from t in head.Children()
where
t.NodeName != "SCRIPT"
&& t.NodeName != "LINK"
select new { Tag = t, TagId = Guid.NewGuid() };
var nonScriptHeadTags = nonScriptHeadTagsQuery.ToList();
var htmlTags =
nonScriptHeadTags
.SelectMany(tagInfo => tagInfo.Tag.Attributes, (tagInfo, attribute) => new { TagInfo = tagInfo, Attribute = attribute })
.Select(x => new
{
TagId = x.TagInfo.TagId,
TagType = x.TagInfo.Tag.NodeName,
AttributeName = x.Attribute.Key,
AttributeValue = x.Attribute.Value,
})
.ToList();
Hope this helps,
Aaron
Comments