xpathing

Have you ever worked with external xml documents where you don't know the hole structure or exactly all the used namespaces? It's a pain. Such a situation occured for one of our user group members, when he had to process a complex rdf document with multiple different xml markups.

The following listing shows a modified and simplified version of this rdf document

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="resource" xmlns:br="http://binary.resource/">
    <br:length />1024</br:length>
  </rdf:Description>
  <rdf:Description rdf:about="meta" xmlns:bm="http://binary.meta/">
    <bm:author>Manuel Pichler</bm:author>
    <bm:published>2007-07-03</bm:published>
  </rdf:Description>
  <rdf:Description rdf:about="more" xmlns:bmore="http://binary.more/">
    <bmore:more>more and more and more namespaces</bmore:more>
  </rdf:Description>
</rdf:RDF>

How would you handle such a structure where you know all or almost all possible markups, but each is optional? The design for the Document Object Model (DOM) assumes that you know all used namespaces in a document. Ok you can use getElementsByTagNameNS() and getAttributeNodeNS() with the asterisk character as argument to fetch all namespaced nodes. Which leads to the following code to find all used namespaces.

<?php
$dom = new DOMDocument();
$dom->load( "xpath.xml" );

$foundNS = array();

$nodes = $dom->getElementsByTagNameNS( "*", "*" );

foreach ( $nodes as $node )
{
    if ( $node->namespaceURI === "http://www.w3.org/1999/02/22-rdf-syntax-ns#" )
    {
        continue;
    }
    if ( isset( $foundNS[$node->namespaceURI ] ) )
    {
        continue;
    }

    $foundNS[$node->namespaceURI] = true;

    print "ns-prefix: {$node->prefix}, ns-uri: {$node->namespaceURI}\n";
}
?>

This could be a good solution for small documents, but consider really large documents with thousands of elements where each element is part of a namespace. This solution would load all these nodes into user land code and check there for an unknown/unvisited namespace which will be very expensive. So we have to find another solution for our problem. And here comes XPath into the play. XPath provides a large set of functions and constructs that help solving this problem.

<?php
$dom = new DOMDocument();
$dom->load( "xpath.xml" );

$xp = new DOMXPath( $dom );
$xp->registerNamespace( "rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#" );

$nodes = $xp->evaluate(
    "//*[ name(.) = 'rdf:Description'
     ]/*[ namespace-uri(.) != namespace-uri(..) and
          namespace-uri(.) != '' and
          namespace-uri(.) != namespace-uri(preceding-sibling::*)]" );

foreach ( $nodes as $node )
{
    print "ns-prefix: {$node->prefix}, ns-uri: {$node->namespaceURI}\n";
}
?>

What we do here is. We look for the rdf:Description elements which are the parents for all unknown namespaces. Then we search for all elements with a namespace namespace-uri(.) != '' that is not equal to the rdf namespace namespace-uri(.) != namespace-uri(..). But there are multiple elements with the same namespace so we have to unify the result which is done with namespace-uri(.) != namespace-uri(preceding-sibling::*).

Comments and optimizations are very welcome.

Tags:

Categories:

Published: July 3, 2007