Fall 1999

   XML and Structured Storage


Making the Case for XML

XML revolutionizes structured file storage by reducing development time and improving performance.

by Nachi Sendowski

  In the good ol' days, you had to invent your own data structure, storage, grammar, and file format when implementing structured file storage for an app. You also had to code the read and write procedures for this data and—on occasion—write your own parser and possibly an editor to administer the stored data.

What you need:
Visual Basic 5.0 or 6.0 (Professional or Enterprise Edition)
Internet Information Systems (IIS) 4.0
Extensible Markup Language (XML) eliminates the need to jump through such hoops ever again. It provides a flexible and easy-to-use object model for accessing structured data programmatically. An application can easily traverse an XML document using the provided parser and object model and have sequential or random access to any well-formed structured information. The data kept in XML is text-based, readable, maintained easily, potentially self-validating, and accessible with a variety of tools. In this article, I'll explain the basic steps for putting XML to use, walking you through the steps necessary to implement a hypothetical repository. Along the way, I'll show you a few tricks for using XML either to maximize performance in VB or compensate for missing VB functionality, using benchmarks to illustrate the relative performance difference between various approaches. At the same time, I'll try to punch through the hype, making the case for what XML can do today, as I cover its strengths and weaknesses in the context of other present-day technologies (see Table 1).

Before exploring how XML can help you, let's look at the problems you need to solve when implementing structured storage. Many applications need to keep some form of persistent values. These values can consist of application settings, runtime-resource locations, data dictionary items, saved state, or any information the app might need at run time. It's also common for this information to require some structure, such as values with associated subvalues. You probably need to save these values to a file, manage or edit them with an external program, and read them back quickly at run time. Examples of these files include the VB project file, VB form files, and Internet Explorer (IE) Channel Definition Format (CDF) files. The solutions described in this article apply to almost any structured application data that doesn't require a full relational database.

Consider the VB form file, which is saved as text using a proprietary format. This file's first part holds hierarchical information about the controls and their properties, while the rest of the file holds the VB code. The VB Integrated Development Environment (IDE) uses the VB Form editor to read and save the file at design time, and the VB compiler uses the file at run time to create the Form resources and code. In the past, you had to invent a new file format and write all the software to handle reading, writing, and interpreting the stored values for each implementation. XML gives you a standard format for storing structured information in a plain text file. You can name values stored in an XML file, keep them in a hierarchical structure, and describe them with any desired attributes.

And that's not all: Microsoft also provides a standard XML parser with IE, versions 4.0 and later. This parser can read any well-formed XML file and hand you back an object model (Document Object Model, or DOM) to use at run time. Data retrieval from the XML file is fast, and the object model is easy to work with. You can also find many XML editors that let you load, edit, and save your XML files (see Table 2).

Persist Your App Values
You have many options to choose from when persisting application values. For example, assume you want to develop a repository of application-runtime metadata. Begin by reviewing your options for persisting this metadata. The repository should meet several requirements. First, a file should hold runtime information describing the application data tables, as well as data- entry form definitions and query or search definitions. Tables, forms, and searches include fields, actions, and some relationships. The repository must also read and access the metadata quickly at run time and let you edit it easily during development. Obviously, you want a well-structured text file format that you can edit easily and keep under version control.

The available options for persisting application values fall into two categories: those you must implement yourself and those provided by available tools. Implementing your own is often a good choice because you get complete control. You can make it as simple or robust as you want, running the gamut from a simple VB file input/output (I/O) to a full implementation of a propriety file format, grammar, and parser. Writing your own grammar/parser proves the most flexible, least limiting, and by far the most rewarding option—it's also a lot of work. A simple solution might consist of using Input # statements to read in a plain comma-delimited text file.

One option is to use the VB Property Bag object, which ships with VB6, to persist values. A property bag lets you enable a VB object to persist itself to the bag, properties and all, then save the bag to a file. However, the Property Bag object is a bit of a black box. It uses a proprietary VB format of byte arrays, which might change in some future version of VB. The storage is also somewhat flat in structure, using key-value pairs. Finally, a property bag has no design-time interface, and it's too closely coupled with VB objects. I believe you can use it if you need to deal with only one object at a time, but covering a structure of objects seems unreasonably complex. The ADO Recordset object also has some options to persist values to text. However, ADO's propriety format and lack of support for depth or complex hierarchies also make the Recordset object an inadequate solution for a repository.

This leaves you with two options: Employ a simple VB implementation for reading comma-delimited values, or use someone else's grammar/parser by taking advantage of XML. Let's compare them. First, look at the data format and structure (see Table 3). XML's syntax is structured, easy to read, self-describing, clear, and easy to validate. A comma-delimited file is dense, unreadable, inflexible, and a bit intimidating to work with. You get the feeling the next time you edit the file something will "go out of line" that takes hours to fix. You must account for every possible value in this format, such as ensuring that commas sit in the right place even when the value is empty. A missing comma causes a file to get out of alignment quickly. In contrast, XML's grammar lets you exclude elements that have no values.

Next, compare the coding for these options (see Table 4). It's easy to use VB's input functions, but it's even easier to make a single call and let the XML parser read in the entire file for you.

XML Outpaces VB File I/O
Speed is another important consideration. It probably won't surprise you to learn that an XML parser reads in a file much faster than simple VB file I/O functions. For example, I used VB's Input # statement to read the comma-delimited values directly into local variables, with no string manipulation functions, special parsing, or concatenations. The XML parser easily outpaced even this subset VB's file I/O capabilities. VB took about 200 ms to read in a 15K file, while the XML parser in IE 4.0 took 20 ms to read in a file with the same amount of data. That's an order of magnitude faster. Note that including all the tags makes the file size twice as large for the same amount of data.

It's also a small case of comparing oranges to tangerines. However, I do create small VB objects to hold the metadata values as I read the file, and creating these objects takes some of the time. I use these objects to read the comma-delimited file into local member variables and hold the metadata values. The XML parser also builds up and hands back an object model populated with the retrieved data, so it might be fair to compare the two approaches, after all. Adding in the time it takes to create the same VB objects after reading in the XML file brings the total execution time to 100ms. In other words, you can read in the file and assemble the object model with XML in about half the time it takes to read in the file using VB's native file I/O. The online source code shows you how to implement both approaches and includes timing code. It also includes sample data files.

These results convince me that XML is the best tool for implementing such a repository. However, using it to implement a solution is another matter. Next, you need to find the best way to work with it. Developers typically load an XML file and immediately process the data for display or some other use. In this case, you need to read in the file up front and access elements in the data randomly at run time, as necessary. The XML object model is simple and easy to work with, but it doesn't have good search capabilities. For example, it lacks an easy way to locate and access a particular element quickly, and you must find the element each time you want to reach its attributes.

This means you need to create your own small objects to hold property values and collections of child objects keyed by name. These objects and collections permit fast key access to elements to get their properties, but XML's DOM is better suited for sequential access than for random access. The first sequential access presents a few implementation options. When processing the XML tree of elements, one approach is to process everything there, then determine what each element child is and either ignore it or process it further. You begin at the root element, getting all child elements and checking each child. If the child's an element, process it; if it's a collection of child elements, process all children recursively. Doing this requires processing every type of element, determining what it is, and handling it somehow. This approach works for certain applications, especially those that let you display or edit XML data. Such applications process the entire file and format whatever is there for display. You display an element one way and a collection another way.

Find Only What You Want
A second approach altogether is to process the XML tree by looking only for what interests you, ignoring the elements meaningless to your application. This makes more sense in the sample scenario. You know ahead of time what interests you in the metadata file: the TABLE, FIELD, SEARCH, and similar tags. More specifically, you want only FIELD tags under TABLE tags and SEARCHFIELD tags under SEARCH tags. This means you can ignore FOO tags anywhere in the file or any FIELD tags that don't occur in the right place. Note that you can also use a Document Type Definition (DTD) at design time with a validating parser to verify the file contains no errant tags.

The first approach requires accessing every element in the tree, getting its name, and using Case or a similar statement to compare the name string to the possible tags you know about. You still must validate whether a tag is in the proper place once you determine what a tag is. This means more work for you from the outset, and VB's string comparisons aren't fast. The second approach requires only that you ask for the allowed child elements; you know a tag is in the correct place when you get a reference to it. This part of the second approach is less efficient because you must always check for the existence of each possible element, whether or not it's in the file. Your metadata file might have TABLE tags with no SEARCH tags, yet this implementation checks for SEARCH tags under each table.

I wanted to know how this would affect performance, so I benchmarked both approaches, testing for situations where only a few or all of the possible elements existed in a file. The second approach proves about 20 to 30 percent faster on average. The online source associated with this article shows you how to implement the second approach in each of the LoadXML*(...) functions.

Processing the XML tree by looking for known elements presents one last challenge. When you request a child element by its tag name from another element on the XML tree, you get one type of object—MSXML.IXMLElement—when there is only one occurrence of this child. However, you get another kind of object MSXML.IXMLEle-ment Collection—when more than one child exists. I conducted several benchmarks to determine the best way to handle this situation. One general approach is to assign the returned object to the untyped VB object, check what you receive, and make the proper calls once you know the object type. Remember you might receive nothing, one element, or a collection:

Dim XMLElement As Object
Set XMLElement = _
   XMLTableElem.children.Item("Field")
If XMLElement Is Nothing Then
ElseIf TypeOf XMLElement Is _
   MSXML.IXMLElement Then
ElseIf TypeOf XMLElement Is _
   MSXML.IXMLElementCollection Then
End If

Turn Off Error Handling Temporarily
Another approach is to turn off error handling temporarily, make the call, then check what error, if any, is raised (see Listing 1). A variant of this approach proved fastest in my tests. Start by turning off errors with this statement:

On Error Resume Next

Next, request a child element, assuming this will return a collection and handle any errors that take place. If you encounter no errors, you have either a collection or nothing. If the collection is not "nothing," use a For…Each loop to proceed with processing. You have a single element to process if you receive an Error 13 (data type mismatch):

Dim XMLElementColl As _
   MSXML.IXMLElementCollection
On Error Resume Next
Set XMLElementColl = _
   XMLTableElem.children.Item("FIELD")
If Err.Number = 0 Then
ElseIf Err.Number = 13 Then 
End If

This example and the handful of others in this article illustrate how simple it can be to take advantage of XML. XML generates an unbelievable amount of hype, and no, it won't toast your bread in the morning. On the other hand, standards-based structured storage can save you a lot of time writing and maintaining your own methodologies while simultaneously giving you better performance. The online source includes two samples that illustrate how you might put this technology to practical use. The first, a "getting started" sample, consists of a set of statements for interacting with the XML object model. The second sample shows you how to implement a metadata component that loads an XML file containing information about an application with database tables, fields, actions, and search definitions. You can use this sample to review or test the different implementation options and performance when loading an XML file vs. a simple text file—the online source includes both. You can also use this sample as a starting point for creating your own XML structured storage files. For example, you might take a cue from many present-day industries and define your own set of tags to describe the information you want to store as XML. You could then modify the metadata component to read your specific XML files.

XML as a technology is also moving forward. The Document Object Model Level 2 recommendation from the W3C (World Wide Web Consortium) is on its way (see Links). This recommendation builds on the Document Object Model Level 1, the version discussed in this article. Level 2 adds interfaces for associating style sheets with a document, the Cascading Style Sheets object model, the Range object model, filters and iterators, as well as events. IE 5.0 already includes some of these enhancements. These features augment how you search, access, and interact with the IE 5.0 XML DOM. You probably know this is only the beginning; I, for one, can hardly wait.


Nachi Sendowski lives in the San Francisco Bay Area and is a principle partner in The Enticy Group, a consulting LLC, and the director of software engineering for Healinx Corp, an Internet startup company. He is responsible for the architecture, design, and development of software frameworks that provide the necessary tools to build scalable, multitier Web applications efficiently. Reach him at nachi@enticy.com, nachi@healinx.com, or visit his company's Web site at http://www.enticy.com/.

 
Resources
"Roll Your Own XML Generator," Ash Rofail and Yasser Shohoud, VBPJ June 1999
Links
• Microsoft XML Notepad: msdn.microsoft.com/
xml/notepad/intro.asp

• XML Zone: http://www.xml-zone.com/
• The Document Object Model Level 2 recommendation from the World Wide Web Consortium (W3C): www.w3.org/xml
VB Zone Links
Product Review of the Week
Rational Suite Enterprise

Site of the Week
Andrea VB Programmers eGroup

Book Review of the Week
Professional Visual Basic 6 Distributed Objects

Tip of the Day
MTSTransactionMode in Visual Basic 6.0

Download of the Week
Roboprint 5.2

Get the Code
  Registered users can download the code for the Magazine issue in which an article appears. Get the code for this issue here.
  Premier Club members can download the code from each article individually. Get the code for this article here.
  Join the Premier Club