Page tree

  Wiki Navigation

    Loading...


 Recently Updated


 Latest Releases

 MediaPortal 1.32
            Releasenews | Download
 MediaPortal 2.5
            Releasenews | Download


Parser

Class

Description

HtmlParser

Parses HTML site from given Template

HtmlParserTemplate

Parser Template

HtmlProfiler

Builds a profile of the HTML source for parsing

HtmlSectionMatch

Xml Serializable Class for Section Match data

HtmlSectionParser

Parses a section of HTML source for elements from a given template

HtmlSectionTemplate

Xml Serializable Class for SectionTemplate data

IParserData

Interface for Storing the Parser Data

ParserData

Simple Parser Data class. Stores any Element tag and value in a dictionary.

Usage:

Using HtmlParser it is very simple to parse a web site and get whatever data from that site.

// Create a template - can be loaded from xml config file
HtmlParserTemplate template = new HtmlParserTemplate();
template.SectionTemplate = new HtmlSectionTemplate();

// setup the template
template.SectionTemplate.Tags = "T";
template.SectionTemplate.Template = "<table><tr><td><#DATA></td></tr></table>";

HtmlParser parser = new HtmlParser(template, typeof(ParserData), null)

// Build a request of the site to parse
HTTPRequest request = new HTTPRequest("http://some.site.com/page_of_interest");

// Load the site and see how many times the template occurs
int count = parser.ParseUrl(request);

// now we can get the data for each occurance
for(int i=0; i<count; i++)
{
  IParserData data = parser.GetData(i);
  // and here we do something with the data - display it, store it, etc
}

If we need something more then the template can provide then, it is possible to perform regex searches on the source of each section using the method:

string SearchRegex(int index, string regex, bool remove)

Designing the template is the hardest part of the whole process, and it is also not that hard. (Parser Template)

There are a number of other features available, also from the underline code which downloads the HTML source from the internet including:

  • Http Authentication
  • Using Internet Explorer to download the Source instead of .NET (Runs extra Javascript in the source)
  • Caching the HTML pages to disk
  • Site statistics: Total number of pages and bytes downloaded, total time used, average transfer rate

So if you want more control or the html source comes from somewhere else (ie - not the web).

Here we will still get the source from the web:

// Again we build a request with site URL
HTTPRequest request = new HTTPRequest("http://some.site.com/page_of_interest");

// but instead of calling HtmlParser we will first get the source
HTMLPage page = new HTMLPage(request);
string source = page.GetPage();

// now that we have the source we can work on it before going to the parser

// To parser the source we need again a template
// This time only a section template and not a parser template
HtmlSectionTemplate template = new HtmlSectionTemplate();

template.Tags = "T";
template.Template = "<table><tr><td><#DATA></td></tr></table>";

// With the template we create a profiler
HtmlProfiler profiler = new HtmlProfiler(template);

// and use this to get the number of times the template occurs in our source
int count = profiler.MatchCount(source);

// Here we can get the source of each section
string sectionSoruce = profiler.GetSource(i);

// to parse each section we use a section parser
HtmlSectionParser parser = new HtmlSectionParser(template);

// we must also create a place for the parsed data
ParserData data = new ParserData();

IParserData iData = data;

// Finally we can parse the section source
parser.ParseSection(sectionSource, ref iData);

   

 

This page has no comments.