
  Wiki Navigation


 Recently Updated

 Latest Releases

 MediaPortal 1.34
            Releasenews | Download
 MediaPortal 2.5
            Releasenews | Download





Parses HTML site from given Template


Parser Template


Builds a profile of the HTML source for parsing


Xml Serializable Class for Section Match data


Parses a section of HTML source for elements from a given template


Xml Serializable Class for SectionTemplate data


Interface for Storing the Parser Data


Simple Parser Data class. Stores any Element tag and value in a dictionary.


Using HtmlParser it is very simple to parse a web site and get whatever data from that site.

// Create a template - can be loaded from xml config file
HtmlParserTemplate template = new HtmlParserTemplate();
template.SectionTemplate = new HtmlSectionTemplate();

// setup the template
template.SectionTemplate.Tags = "T";
template.SectionTemplate.Template = "<table><tr><td><#DATA></td></tr></table>";

HtmlParser parser = new HtmlParser(template, typeof(ParserData), null)

// Build a request of the site to parse
HTTPRequest request = new HTTPRequest("");

// Load the site and see how many times the template occurs
int count = parser.ParseUrl(request);

// now we can get the data for each occurance
for(int i=0; i<count; i++)
  IParserData data = parser.GetData(i);
  // and here we do something with the data - display it, store it, etc

If we need something more then the template can provide then, it is possible to perform regex searches on the source of each section using the method:

string SearchRegex(int index, string regex, bool remove)

Designing the template is the hardest part of the whole process, and it is also not that hard. (Parser Template)

There are a number of other features available, also from the underline code which downloads the HTML source from the internet including:

  • Http Authentication
  • Using Internet Explorer to download the Source instead of .NET (Runs extra Javascript in the source)
  • Caching the HTML pages to disk
  • Site statistics: Total number of pages and bytes downloaded, total time used, average transfer rate

So if you want more control or the html source comes from somewhere else (ie - not the web).

Here we will still get the source from the web:

// Again we build a request with site URL
HTTPRequest request = new HTTPRequest("");

// but instead of calling HtmlParser we will first get the source
HTMLPage page = new HTMLPage(request);
string source = page.GetPage();

// now that we have the source we can work on it before going to the parser

// To parser the source we need again a template
// This time only a section template and not a parser template
HtmlSectionTemplate template = new HtmlSectionTemplate();

template.Tags = "T";
template.Template = "<table><tr><td><#DATA></td></tr></table>";

// With the template we create a profiler
HtmlProfiler profiler = new HtmlProfiler(template);

// and use this to get the number of times the template occurs in our source
int count = profiler.MatchCount(source);

// Here we can get the source of each section
string sectionSoruce = profiler.GetSource(i);

// to parse each section we use a section parser
HtmlSectionParser parser = new HtmlSectionParser(template);

// we must also create a place for the parsed data
ParserData data = new ParserData();

IParserData iData = data;

// Finally we can parse the section source
parser.ParseSection(sectionSource, ref iData);



This page has no comments.