xpathing
Have you ever worked with external xml documents where you don't know the hole structure or exactly all the used namespaces? It's a pain. Such a situation occured for one of our user group members, when he had to process a complex rdf document with multiple different xml markups.
The following listing shows a modified and simplified version of this rdf document
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="resource" xmlns:br="http://binary.resource/">
<br:length />1024</br:length>
</rdf:Description>
<rdf:Description rdf:about="meta" xmlns:bm="http://binary.meta/">
<bm:author>Manuel Pichler</bm:author>
<bm:published>2007-07-03</bm:published>
</rdf:Description>
<rdf:Description rdf:about="more" xmlns:bmore="http://binary.more/">
<bmore:more>more and more and more namespaces</bmore:more>
</rdf:Description>
</rdf:RDF>
How would you handle such a structure where you know all or almost all possible markups, but each is optional? The design for the Document Object Model (DOM) assumes that you know all used namespaces in a document. Ok you can use getElementsByTagNameNS()
and getAttributeNodeNS()
with the asterisk character as argument to fetch all namespaced nodes. Which leads to the following code to find all used namespaces.
<?php
$dom = new DOMDocument();
$dom->load( "xpath.xml" );
$foundNS = array();
$nodes = $dom->getElementsByTagNameNS( "*", "*" );
foreach ( $nodes as $node )
{
if ( $node->namespaceURI === "http://www.w3.org/1999/02/22-rdf-syntax-ns#" )
{
continue;
}
if ( isset( $foundNS[$node->namespaceURI ] ) )
{
continue;
}
$foundNS[$node->namespaceURI] = true;
print "ns-prefix: {$node->prefix}, ns-uri: {$node->namespaceURI}\n";
}
?>
This could be a good solution for small documents, but consider really large documents with thousands of elements where each element is part of a namespace. This solution would load all these nodes into user land code and check there for an unknown/unvisited namespace which will be very expensive. So we have to find another solution for our problem. And here comes XPath into the play. XPath provides a large set of functions and constructs that help solving this problem.
<?php
$dom = new DOMDocument();
$dom->load( "xpath.xml" );
$xp = new DOMXPath( $dom );
$xp->registerNamespace( "rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#" );
$nodes = $xp->evaluate(
"//*[ name(.) = 'rdf:Description'
]/*[ namespace-uri(.) != namespace-uri(..) and
namespace-uri(.) != '' and
namespace-uri(.) != namespace-uri(preceding-sibling::*)]" );
foreach ( $nodes as $node )
{
print "ns-prefix: {$node->prefix}, ns-uri: {$node->namespaceURI}\n";
}
?>
What we do here is. We look for the rdf:Description
elements which are the parents for all unknown namespaces. Then we search for all elements with a namespace namespace-uri(.) != ''
that is not equal to the rdf namespace namespace-uri(.) != namespace-uri(..)
. But there are multiple elements with the same namespace so we have to unify the result which is done with namespace-uri(.) != namespace-uri(preceding-sibling::*)
.
Comments and optimizations are very welcome.