Wiki Navigation
- Loading...
Parser
Class |
Description |
---|---|
HtmlParser |
Parses HTML site from given Template |
HtmlParserTemplate |
Parser Template |
HtmlProfiler |
Builds a profile of the HTML source for parsing |
HtmlSectionMatch |
Xml Serializable Class for Section Match data |
HtmlSectionParser |
Parses a section of HTML source for elements from a given template |
HtmlSectionTemplate |
Xml Serializable Class for SectionTemplate data |
IParserData |
Interface for Storing the Parser Data |
ParserData |
Simple Parser Data class. Stores any Element tag and value in a dictionary. |
Usage:
Using HtmlParser it is very simple to parse a web site and get whatever data from that site.
// Create a template - can be loaded from xml config file HtmlParserTemplate template = new HtmlParserTemplate(); template.SectionTemplate = new HtmlSectionTemplate(); // setup the template template.SectionTemplate.Tags = "T"; template.SectionTemplate.Template = "<table><tr><td><#DATA></td></tr></table>"; HtmlParser parser = new HtmlParser(template, typeof(ParserData), null) // Build a request of the site to parse HTTPRequest request = new HTTPRequest("http://some.site.com/page_of_interest"); // Load the site and see how many times the template occurs int count = parser.ParseUrl(request); // now we can get the data for each occurance for(int i=0; i<count; i++) { IParserData data = parser.GetData(i); // and here we do something with the data - display it, store it, etc }
If we need something more then the template can provide then, it is possible to perform regex searches on the source of each section using the method:
string SearchRegex(int index, string regex, bool remove)
Designing the template is the hardest part of the whole process, and it is also not that hard. (Parser Template)
There are a number of other features available, also from the underline code which downloads the HTML source from the internet including:
- Http Authentication
- Using Internet Explorer to download the Source instead of .NET (Runs extra Javascript in the source)
- Caching the HTML pages to disk
- Site statistics: Total number of pages and bytes downloaded, total time used, average transfer rate
So if you want more control or the html source comes from somewhere else (ie - not the web).
Here we will still get the source from the web:
// Again we build a request with site URL HTTPRequest request = new HTTPRequest("http://some.site.com/page_of_interest"); // but instead of calling HtmlParser we will first get the source HTMLPage page = new HTMLPage(request); string source = page.GetPage(); // now that we have the source we can work on it before going to the parser // To parser the source we need again a template // This time only a section template and not a parser template HtmlSectionTemplate template = new HtmlSectionTemplate(); template.Tags = "T"; template.Template = "<table><tr><td><#DATA></td></tr></table>"; // With the template we create a profiler HtmlProfiler profiler = new HtmlProfiler(template); // and use this to get the number of times the template occurs in our source int count = profiler.MatchCount(source); // Here we can get the source of each section string sectionSoruce = profiler.GetSource(i); // to parse each section we use a section parser HtmlSectionParser parser = new HtmlSectionParser(template); // we must also create a place for the parsed data ParserData data = new ParserData(); IParserData iData = data; // Finally we can parse the section source parser.ParseSection(sectionSource, ref iData);
This page has no comments.