Removing namespaces from XHTML 

Last week we needed to get some html fragments from a XHTML document. A straightforward process one might say, although we ran into some namespace problems along the way.

Unlike HTML, XHTML is in essence a XML document and you can therefore use the XmlDocument class to load an in memory representation of the document.

Our document:

   1:  <?xml version="1.0" encoding="utf-8"?><!DOCTYPE html
   2:    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3:  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl">
   4:     <head>
   5:        <title>title something</title>
   6:     </head>
   7:     <body>
   8:        <div id="broodtekst">
   9:           <h1>get this part</h1>
  10:        </div>  
  11:     </body>
  12:  </html>

Our objective is to retrieve the html in between the <div id="Broodtekst"><h1>get this part</h1></div>.

Retrieving the html:

   1:  string result = string.Empty;
   2:  XmlDocument doc = new XmlDocument();
   3:   
   4:  doc.Load(bestand);
   5:  XmlNamespaceManager man = new XmlNamespaceManager(doc.NameTable);
   6:  man.AddNamespace("d", "http://www.w3.org/1999/xhtml");
   7:              
   8:  XmlNode node =  doc.SelectSingleNode("//d:div[@id='broodtekst']",man);
   9:  result = node.InnerXml;

 

Our result unfortunately did not yield the expected string "<h1>get this part</h1>", in stead it yielded "<h1 xmlns="http://www.w3.org/1999/xhtml">get this part</h1>". It turn out the XmlDocument keeps track out of which namespace the html is queried and puts the namespace in all the tags related to the namespace. A good feature, but not what we wanted. After a lot of searching and asking around our colleague Keren came up with the answer (thx Keren). Remove all the namespaces from the XmlDocument before proceding with a XPath query.

Removing all namespaces and retrieve html:

   1:   string result = null;
   2:  System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
   3:  using (MemoryStream ms = new MemoryStream(bestand))
   4:  {
   5:       using (XmlTextReader tr = new XmlTextReader(ms))
   6:       {
   7:             tr.Namespaces = false;
   8:             tr.ProhibitDtd = false;
   9:             doc.Load(tr);
  10:        }
  11:   }
  12:  //Extract content div
  13:  XmlNode node2 = doc.SelectSingleNode("//div[@id='broodtekst']");
  14:  if (node != null)
  15:  {
  16:       result = node2.InnerXml;
  17:  }

Instead of loading the stream directly into the XmlDocument class you put it in a XmlTextReader which has a couple of sweet properties to remove namespaces (line 7) and prohibit the XmlDocument to retrieve optional DTD files (line 8, thx William).

Hopefully this might help some people in the .Net community struggeling with XML/XHTML and namespaces.

Posted on 22-09-2008 by Hans ter Wal
0 Comments  |  Trackback Url  |  Link to this post
Tags: .NET

Links to this post

Comments

Name:
URL:
Email:
Comments:

CAPTCHA Image Validation