[c#] String to HtmlDocument

I'm fetching the html document by URL using WebClient.DownloadString(url) but then its very hard to find the element content that I'm looking for. Whilst reading around I've spotted HtmlDocument and that it has neat things like GetElementById. How can I populate an HtmlDocument with the html returned by url?

This question is related to c# html

The answer is


For those who don't want to use HTML agility pack and want to get HtmlDocument from string using native .net code only here is a good article on how to convert string to HtmlDocument

Here is the code block to use

public System.Windows.Forms.HtmlDocument GetHtmlDocument(string html)
        {
            WebBrowser browser = new WebBrowser();
            browser.ScriptErrorsSuppressed = true;
            browser.DocumentText = html;
            browser.Document.OpenNew(true);
            browser.Document.Write(html);
            browser.Refresh();
            return browser.Document;
        }

To answer the original question:

HTMLDocument doc = new HTMLDocument();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(fileText);
// now use doc

Then to convert back to a string:

doc.documentElement.outerHTML;

you could get a htmldocument by:

 System.Net.WebClient wc = new System.Net.WebClient();

 System.IO.Stream stream = wc.OpenRead(url);
 System.IO.StreamReader reader = new System.IO.StreamReader(stream);
 string s = reader.ReadToEnd();

 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(s);

so you have getbiyid and getbyname ... but any further you'd better of with
HTML Agility Pack as suggested . f.e you can do: doc.DocumentNode.SelectNodes(xpathselector) or regex to parse the doc ..

btw: why not regex ? . its soo cool if you can use it right... but xpath is also very mighty ... so "choose your poison"

cu


You could try with OpenNew and then with Write but that's a bit strange use of that class. More info on MSDN.


I've adapted Nikhil's answer somewhat to simplify it. Admittedly, I have not run it through a .net compiler and there are likely very good reasons for the lines Nikhil put in which I have omitted. However, at least in my use case (a very simple page) they were unnecessary.

My use case was for a quick powershell script:

$htmlText = $(New-Object 
System.Net.WebClient).DownloadString("<URI HERE>") #Get the HTML document from a webserver
$browser = New-Object System.Windows.Forms.WebBrowser
$browser.DocumentText = $htmlText
$browser.Document.Write($htmlText)
$response = $browser.document

For my case, this returned an HTMLDocument object with HTMLElement objects in it, instead of __ComObject object types (which are a challenge to use in powershell class code) returned by a call to Invoke-WebRequest in PS 5.1.14393.1944

I believe the equivalent C# code is:

public System.Windows.Forms.HtmlDocument GetHtmlDocument(string html)
{
    WebBrowser browser = new WebBrowser();
    browser.DocumentText = html;
    browser.Document.Write(html);
    return browser.Document;
}

Using Html Agility Pack as suggested by SLaks, this becomes very easy:

string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode specificNode = doc.GetElementById("nodeId");
HtmlNodeCollection nodesMatchingXPath = doc.DocumentNode.SelectNodes("x/path/nodes");