Using PHP 5's SimpleXML
Pages: 1, 2
XML Namespaces
SimpleXML even makes processing RSS 1.0 feeds easy. RSS 1.0 uses XML
namespaces, which can present a bit of a headache during parsing. With XML
namespaces, each element lives under a URL, which acts as a package name. This
allows you to distinguish between, say, the HTML <title>
element and the RSS <title> element.
All of a sudden things became more complex. You can no longer refer to
title, since an unadorned title doesn't let the processor know
which <title> you mean. You could be thinking of the RSS
item <title>, but there's also an HTML
<title> in the document.
As a result, there's now {http://www.w3.org/1999/xhtml}:title
and also {http://purl.org/rss/1.0}:title instead. XML uses the
colon (:) as a demarcation character between the URL and the
plain tag name. In technical language, the complete name is called the
qualified name, or the qname for short. (Really!)
Since URLs are long, you can map a short word to the URL. So, you frequently
end up referring to these elements as <xhtml:title> and
<rss:title>. These short names are known as namespace
prefixes. However, it's the URL that's important, so prefixes like
xhtml and rss are conventions, not actual namespaces.
(It's important to mention that the URL doesn't have to resolve to a web page, it's just an easy way for people to create non-conflicting namespaces.)
SimpleXML likes the world to be simple, so it pretends the namespaces don't exist. (I know a whole crowd of readers feel this cure is worse than the disease. Remember, however, this is SimpleXML. If you're worried about namespace clashes use DOM.)
Here's the same data as before, encoded as RSS 1.0 and saved as
rss-1.0.xml:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://purl.org/rss/1.0/"
>
<channel rdf:about="http://www.php.net/">
<title>PHP: Hypertext Preprocessor</title>
<link>http://www.php.net/</link>
<description>The PHP scripting language web site</description>
</channel>
<item rdf:about="http://www.php.net/downloads.php">
<title>PHP 5.0.0 Beta 3 Released</title>
<link>http://www.php.net/downloads.php</link>
<description>
PHP 5.0 Beta 3 has been released. The third beta of PHP is
also scheduled to be the last one (barring unexpected surprises).
</description>
<dc:date>2004-01-02</dc:date>
</item>
<item rdf:about="http://shiflett.org/archive/19">
<title>PHP Community Site Project Announced</title>
<link>http://shiflett.org/archive/19</link>
<description>
Members of the PHP community are seeking volunteers to help
develop the first web site that is created both by the community and for
the community.
</description>
<dc:date>2003-12-18</dc:date>
</item>
</rdf:RDF>
This XML document has three different namespaces. Looking at the top of the
file, two namespaces have explicit namespace prefix mappings. That's what
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" and the
following line does. It associates those URLs to rdf and
dc. You can see rdf:RDF, rdf:about, and
dc:date elements and attributes within the document.
RDF is "Yet Another XML Spec" (YAXMLS). I won't go into it here, but you can learn more on the W3 RDF site and in Tim Bray's article, What is RDF?, on XML.com. O'Reilly also has a book on RDF titled, Practical RDF.
There's also one entity without a prefix,
xmlns="http://purl.org/rss/1.0/". That's the default namespace,
since there's no colon after xmlns. Elements without a prefix,
like item and title, live in the default namespace.
This is different from RSS 0.91, where elements do not live in any
namespace.
To search for elements in a namespace under DOM, you need to switch to a new set of methods, where you pass in the tag and the namespace. As I said earlier, SimpleXML just barges forward with its head down. You can use the exact same syntax with RSS 1.0 as earlier:
foreach ($s->item as $item) {
print $item->title . "\n";
}
PHP 5.0.0 Beta Released
PHP Community Site Project Announced
This is not a problem because, despite all the namespace vigilance, there are no name clashes in the document.
XML Namespaces and XPath
However, SimpleXML is not completely naive. It recognizes the potential for problems with this attitude. Therefore, you can distinguish between two namespaced elements with XPath, but you need to use namespace prefixes.
SimpleXML automatically registers all the non-default namespace prefixes, but you need to handle the default namespace. (This lack of default namespace mapping is a deficit in XPath 1.0, not SimpleXML.)
To find and print all rss:title entries:
$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item/rss:title');
foreach ($titles as $title) {
print "$title\n";
}
PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced
After loading the file, manually register a namespace prefix to go with
http://purl.org/rss/1.0/. You're free to select any prefix you
want, but rss is a natural choice.
The new XPath query now looks for //rss:item/rss:title instead
of plain old //item/title, since it needs namespace prefixes.
It's a little funny that there's no way to define a default namespace prefix
for an XPath search, but that's how it is. Even though these elements don't
have explicit prefixes in the document, they need prefixes in the XPath
query.
You can use XPath to take advantage of the additional data in the RSS feed. For instance, to find and print all the entries from January 2004:
$s = simplexml_load_file('rss-1.0.xml');
$s->register_ns('rss', 'http://purl.org/rss/1.0/');
$titles = $s->xsearch('//rss:item[
starts-with(dc:date, "2004-01-")]/rss:title');
foreach ($titles as $title) {
print "$title\n";
}
PHP 5.0.0 Beta 3 Released
The first two lines are the same, but I've modified the XPath query to
filter the results. In XPath, you can request a subset of elements in a level
by requiring them to match a test inside of square brackets ([]).
This test requires the dc:date element under the current
rss:item to begin with the string 2004-01-. If so,
starts-with() returns true, and XPath knows to include it in the
results. (These dates are part of the Dublin Core Metadata specification, hence
the prefix of dc.)
This prints only one title because the Community Site item was posted in December, while Beta 3 came out in January. (Actually, it came out at the end of December, but it makes the example easier to explain.)
Other Features
SimpleXML has a few more features: you can edit elements and attributes in place by assigning them a new value. Then, you can save the modified XML document to a file or store it in a PHP variable. Additionally, you can validate XML documents using XML Schema.
Besides RSS, SimpleXML is also perfect for parsing configuration files and consuming web services with REST. Additionally, I'm sure that as PHP 5 evolves, SimpleXML will gain even more functionality. Keep an eye peeled for the announcements and enjoy playing with SimpleXML.
Adam Trachtenberg is the manager of technical evangelism for eBay and is the author of two O'Reilly books, "Upgrading to PHP 5" and "PHP Cookbook." In February he will be speaking at Web Services Edge 2005 on "Developing E-Commerce Applications with Web Services" and at the O'Reilly booth at LinuxWorld on "Writing eBay Web Services Applications with PHP 5."
Return to the PHP DevCenter.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 11 of 11.
-
Great tutorial!
2009-05-12 15:45:26 mattcass [Reply | View]
Nicely done, and nearly 5 years ago. I was looking for some simple examples of using PHP and XML and this got me going on the right track. Very handy and thanks a lot!
-
Any tips for working with HTML and simplexml?
2005-06-16 08:16:25 kjwebguy [Reply | View]
Hi,
Thanks for the article. Any tips for working with HTML and simplexml? Any pitfalls to look out for? Thanks.
-
xsearch -> xpath
2005-05-09 04:34:22 KvdnBerg [Reply | View]
I'm not sure about this but I think you need to use $s->xpath instead of $s->xsearch, at least in my setup (PHP5 on Apache2 on Linux) xsearch gives an undefined method error while xpath gives the correct results.
-
Supported tagnames
2004-02-15 10:54:05 vladimir-shapiro [Reply | View]
What happens if i have the tag names with '-' symbol? For example: <first-name> or <last-name>?
Or when i use russian or german tagnames? Is it covered with SimpleXML?
wbr, Vladimir -
Supported tagnames
2004-03-12 02:27:31 riffraff [Reply | View]
It seem to me that SimpleXML will just work like JavaScript hashes, where you can safely write
myHash.my_key just as long as my_key is a valid identifier.
I suppose that everyone that does not use ascii-7 alphanumerics will be forced to use something else (I hope I'm wrong).
I'd suggest to the writers to try out ruby's REXML library, cause that is a kick ass, powerful and free library. And SimpleXML behaviour can be reproduced in REXML with 5 lines of code ;)
-
SimpleXML and XSLT in PHP5
2004-01-16 10:32:47 anonymous2 [Reply | View]
I am getting ready to move a codebase over to XSLT for template rendering. Will SimpleXML make that process easier for PHP developers or should I look into something like the Apache Project's XSLT parser? -
SimpleXML and XSLT in PHP5
2004-01-16 12:27:44 Adam Trachtenberg |
[Reply | View]
There are many new XML features in PHP5, including rewritten DOM, XSLT, and XPath classes. Unfortunately, I could not cover them all here.
PHP 5's XSLT class uses libxsl, the sister library to libxml2. You pass the class DOM objects and it transforms files for you. This interface is cleaner than the PHP 4 XSLT extension, which used the Sablotron XSLT parser and didn't integrate with other PHP XML extensions.
The current issue of PHP Magazine's Digital Edition (http://www.php-mag.net/) contains an article by me that gives a few examples of how to use XSLT with PHP 5. (Unfortunately, this article is not available for free.)
So, to answer your question, if you're using PHP 4 and want to do more XSLT, then PHP 5 is definitely a step up. I hope that helps. Let me know if you have other questions.
-
Neat, but...
2004-01-16 03:54:16 anonymous2 [Reply | View]
I imagine the overhead of parsing the whole document and loading it into memory can be tremendous for sufficiently-large XML documents.
In spite of that, it's a very cool extension. Makes me look forward to the release of PHP 5 that much more. -
Neat, but...
2004-01-16 09:54:22 Adam Trachtenberg |
[Reply | View]
SimpleXML uses the same parsing functions libxml2 uses to create DOM documents. Yes, you do have the overhead of needing to contain the entire document in memory; however, this is all handled in C, so the process is fast and efficient. (As compared to doing this in PHP.)
The only alternative from a memory perspective is SAX, but SAX can be quite painful to program when your schema is complex. (Try parsing the Apple pList format, for example.)


