PHP TIPS & TRICKS - XML in PHP 4

This article is intended for the experienced PHP programmer, interested in writing applications using XML. It assumes that you are familiar with XML's syntax and advantages.

(For more about Zend Technologies, please visit www.zend.com)

Readers interested in learning more about XML before reading this article should refer to the following sites:

What is XML? - An introduction by Normal Walsh.
XML FAQ – Frequently Asked Questions about XML
Project Cool: XML resources - Tutorials and references on XML
XML.com - Portal Site for XML
XML.org - The XML site, with many more links
Annotated XML specification - Annotated version of the original W3C's XML-spec

Introduction

I will be the first to admit that I love computing standards. If every vendor adhered to an industry standard, I think the web would be a better medium for everyone. Using standardized data exchange formats, open, platform-independent computing models become feasible. It is for this reason that I am a big fan of XML.

Fortunately for me, my favorite scripting language supports XML and is continuing to strengthen this support. PHP has enabled me to quickly bring XML documents to the web, to gather statistical information about them, and to transform them to other formats. For example, I regularly use PHP's XML processing capabilities to manage my articles and books written using XML.

In this article, I will discuss how you can process XML documents with PHP, using the built-in Expat parser. By way of an example, I will demonstrate Expat's processing methods. The example will also show you how you can:

Set up your own handler functions.
Map XML documents to your own PHP data structures.

Introducing Expat

XML parsers, known also as XML processors, enable applications to access the structure and contents of an XML document. Expat is the parser used with PHP scripts. It has also been used in other projects, such as Mozilla, Apache, and Perl.

What is an Event-Based Parser?

There are two basic types of XML parsers:

Tree-based parser: Transforms an XML document into a tree structure. This kind of parser analyzes a document in full, and provides it with an API to access the elements of the generated tree. A common standard is the Document Object Model (DOM).
Event-based parser: Views an XML document as a series of events. When a specific event occurs, it calls a developer-provided function to handle it.

Event-based parsers have a data-centric view of XML documents, meaning they focus on the data parts of the documents, and not on its structure. These parsers process the document from top to bottom and report events - such as the start of an element, the end of an element, starting of character data, etc. - to the application, usually through callback functions.

Consider the following "Hello-World" XML document example:

< greeting >

Hello World

< /greeting >

An event-based parser would report it as a series of three events:

Start Element: greeting
Start CDATA section, value: Hello World
Close Element: greeting

Unlike a tree-based parser, an event-based parser does not create a structure representation of the document. In the CDATA section (above), an event-based parser would not enable you to get information about the parent element, greeting.

It does, however, provide you with lower level access. This enables faster access times and makes better use of system resources. Under this approach, then, there is no need to hold an entire document in memory; indeed, documents can even exceed your system's memory limits.

Expat is such an event-based parser. Of course, when using Expat, it is still possible to create a complete native tree structure in PHP, if necessary.

The Hello-World example above consists of well-formed XML. It is not valid, meaning there is no Document Type Definition (DTD) associated with it, nor does it have an inline DTD.

For Expat, this makes no difference: Expat is a non-validating parser, and would therefore ignore any DTDs linked to a document, anyway. But, beware! The document still needs to be well-formed – otherwise Expat (like any other XML parser adhering to the XML standard) will die with an error message.

Being a non-validating parser, Expat is fast and small - well suited for web applications.

Compiling Expat

Expat can be compiled into PHP for PHP versions 3.0.6 and above. It has been part of the official Apache distribution since Apache version 1.3.9. To enable it on Unix systems, all you need to do is to configure and compile PHP --with-xml.

If you're compiling PHP as an Apache module, the Expat library will be found in the Apache distribution (by default). Using Windows, however, you will need to load the XML DLL file either at runtime or as part of the PHP configuration.

An XML Parsing Example: XMLstats

One way of understanding Expat's functionality is by example. The sample application we'll discuss uses Expat to collect statistical data about a XML document.

For every document element, the following information will be printed:

The number of times the element is used in the document.
The amount of character data within this element.
The element's parent.
The element's children.

Note: As will be demonstrated, we will have to create a native data structure in PHP to hold the element's parent and children.

Preparing The Parser

An XML parser instance is created using the function xml_parser_create(). The instance is then used for all subsequent functions. The concept is similar to using the result identifier for PHP's MySQL functions.

Prior to parsing a document, event-based parsers generally require you to register callback functions invoked when a certain event occurs. Expat is no exception. It defines seven possible events:

Target	XML Parser Function	Description
Elements	xml_set_element_handler()	starting and closing of elements
Character data	xml_set_character_data_handler()	beginning of character data
External entities	xml_set_external_entity_ref_handler()	occurrence of an external entity
Unparsed external entities	xml_set_unparsed_entity_decl_handler()	occurrence of an unparsed external entity
Processing instructions	xml_set_processing_instruction_handler()	occurrence of a processing instruction
Notation declarations	xml_set_notation_decl_handler()	occurrence of a notation declaration
Default	xml_set_default_handler()	all events that have no assigned handler

All callback functions must take the parser instance as a first argument (in addition to its other arguments).

Refer to the sample script appended at the end of this article. You'll notice that it uses both element and character data handlers. The element handler callback function is registered using xml_set_element_handler(). This function needs three arguments:

The parser instance
The name of the callback function that handles opening elements
The name of the callback function that handles closing elements

Callback functions must exist when you start parsing the XML document. Their definitions need to match exactly with the prototypes outlined in the PHP manual.

For example, Expat passes three arguments to the start element handler. In the sample script, the definition appears as follows:

function start_element($ parser, $ name, $ attrs)

The first argument is the parser identifier, the second argument is the name of the starting element, and the third argument is an associative array containing all attribute names and values of the element.

Once you start parsing the XML source document, Expat will call your start_element() function every time it sees a starting element and pass the appropriate arguments to it.

Case Folding for XML

Case folding is disabled using the xml_parser_set_option()function. This option is enabled by default, and causes element names passed to the handler functions to be transformed to upper-case. Because XML is case sensitive (and therefore case is important for statistics about XML documents), case folding must be disabled for our example.

Parsing The Document

After all of this preparation, the script can now start to actually parse the XML document as follows:

Xml_parse_from_file(), a custom function, opens the file specified as argument and parses it in blocks of 4 Kb.
The function xml_parse(), as well as xml_parse_from_file(), return false on an error. This will happen, for example, when the XML document is not well-formed.
You can use xml_get_error_code() to retrieve the numeric error code of the last error. To get the textual message for this specific error code, pass the code to the xml_error_string()function.
Printing the current line number of the XML document makes debugging easier: xml_get_current_line_number().
Over the course of the parsing, the callback functions are called.

Representing The Document Structure

When parsing the document, one important issue must be addressed when working with Expat: How to keep a basic representation of the document structure.

As was explained earlier, event-based parsers don't make any structure information available by themselves.

However, tag structure is an important aspect of XML. The element series < book >< title >, for example, will mean something very different from < figure >< title >. That is, as any editor will tell you, a book's title and a figure's title will have nothing in common, in spite of the fact that both make use of the term "title". As such, to effectively process XML using an event-based parser, you'll want to make use of your own stacks or lists to maintain the document's structure information.

To create a mirror of the document's structure, the script needs to know, at the very least, the parent element for the current element. This is not possible when using the plain Expat API; it reports only events for the current element, and does not record any contextual information. Therefore, you need to set up our own stack structure.

The sample script employs a FILO (First In, Last Out) stack structure. The stack, a normal array, holds the set of start elements. In the start-element handler, the current element is pushed to the top of the stack using array_push(). Accordingly, the close-element handler function removes the top-most element using array_pop().

For a series of < book >< title >< /title >< /book >, the stack would be filled like this:

Start element book: Assign "book" to the first element of the stack ($ stack[0]).
Start element title: Assign "title" to the top of the stack ($ stack[1]).
Close element title: Remove the top-most element of the stack ($ stack[1]).
Close element book: Remove the top-most element of the stack ($ stack[0]).

PHP 3.0's implementation of our example used a $ depth variable to manually keep track of the element nesting. It worked but the script was less elegant. PHP 4.0 introduced the array_pop() and the array_push() functions, enabling a cleaner implementation of the script.

Collecting the Data

To collect information about each element, the script needs to remember every occurrence of the element. It uses a global array variable, $ elements, to hold all distinct elements of the document. The array entries are instances of the element class, which has four properties (class variables):

$ count - the number of times the element was found in the document
$ chars - bytes of character data within this element
$ parents - parent elements
$ childs - child elements

As you can see, it's no problem to keep class instances within an array.

Note: A peculiar language feature of PHP is that you can traverse class structures just like you would traverse associative arrays, by using a while(list() = each())loop. It will show you all class variables (and method names when using PHP 3.0) as strings.

When an element is found, we need to increment its associated counter to track how many times it has occurred in the document. The count element in the corresponding $ elements array item is incremented by one.

We also want to let the parent element know that the current element is its child. To do this, the current's name is appended to the parent's $ childs array entry. Finally, the current element should remember who its parents are. Therefore, the parent is added to the current element's $ parents array entry.

Displaying Statistics

The rest of the code loops through the array $ elements and its sub-arrays to display the statistics. This is a simple series of nested loops - while this produces nice output, the code per se is neither of particular elegance nor does it consist of clever tricks: it's a loop like you probably use it every day to simply get the job done.

The example script is designed to be invoked from the command line, using the CGI binary of the PHP interpreter. Therefore, it outputs the statistics formatted as plain ASCII text. If you want to run this script over the Web, you'll have to modify the output functions a bit to produce proper HTML.

Summary

Expat is PHP's XML parser. As an event-based parser, it does not create a structure representation of the document. By providing you with lower level access, however, it does enable faster access times and makes better use of system resources.

As a non-validating parser, Expat ignores DTD's linked to the XML document, but it will die with an error message if the document is not well-formed. When using Expat, be sure to remember that you will need to:

Provide the event handlers to process the document;
Set up your own data structures like stacks and trees in order to draw advantage of XML's structured information markup.

New XML applications are emerging daily and support for XML continues to be enhanced in PHP (for example, by adding support for the DOM-based XML-parser LibXML).

With PHP and Expat, you will be well prepared for a future of efficient, open, and platform-independent standards.

A Sample Script

< ?

/*************************************************************

* $ Title: XML parsing example: Collect statistics about a XML document $ *

* $ Description: *

* This example uses PHP's expat parser to collect statistical information *

* (like number of distinct elements, children and parents of elements) *

* about a XML document. *

* Call it with the XML file to process as argument: *

* ./xmlstats_PHP4.php3 test.xml *

* *

* $ Requires: Expat *

* PHP 4.0 built as CGI binary $ *

* *

*************************************************************/

// The first argument is the file to process

$ file = $ argv[1];

// Initialize variables

$ elements = $ stack = array();

$ total_elements = $ total_chars = 0;

// The base class for an element

class element

{

var

$ count = 0;

var

$ chars = 0;

var

$ parents = array();

var

$ childs = array();

}

// Utility function to parse a XML document from a file

function xml_parse_from_file($ parser, $ file)

{

if(!

file_exists($ file))

{

die(

"Can't find file \"$ file\".");

}

if(!(

$ fp = @fopen($ file, "r")))

{

die(

"Can't open file \"$ file\".");

}

while(

$ data = fread($ fp, 4096))

{

if(!

xml_parse($ parser, $ data, feof($ fp)))

{

return(

false);

}

fclose($ fp);

return(

true);

}

// Utility function to print a message in a box

function print_box($ title, $ value)

{

printf("\\n+%'-60s+\\n", "");

printf("|%20s", "$ title:");

printf("%14s", $ value);

printf("%26s|\\n", "");

printf("+%'-60s+\\n", "");

}

// Utility function to print a line

function print_line($ title, $ value)

{

printf("%20s", "$ title:");

printf("%15s\\n", $ value);

}

// Sort function for usasort()

function my_sort($ a, $ b)

{

return(

is_object($ a) && is_object($ b) ? $ b- >count - $ a- >count: 0);

}

function

start_element($ parser, $ name, $ attrs)

{

global

$ elements, $ stack;

// Does this element already exist in the global $ elements array?

if(!isset($ elements[$ name]))

{

// No - add a new instance of class element

$ element = new element;

$ elements[$ name] = $ element;

}

// Increase this elements count

$ elements[$ name]- >count++;

// Is there a parent element?

if(isset($ stack[count($ stack)-1]))

{

// Yes - set $ last_element to the parent

$ last_element = $ stack[count($ stack)-1];

// If there is no entry for the parent element in the current

// element's parents array, initialize it to 0

if(!isset($ elements[$ name]- >parents[$ last_element]))

{

$ elements[$ name]- >parents[$ last_element] = 0;

}

// Increase the count for this element's parent

$ elements[$ name]- >parents[$ last_element]++;

// If there is no entry for