n the good ol'
days, you had to invent your own data structure, storage, grammar,
and file format when implementing structured file storage for an
app. You also had to code the read and write procedures for this
data and—on occasion—write your own parser and possibly an editor to
administer the stored data.
What you need: Visual Basic 5.0
or 6.0 (Professional or Enterprise Edition) Internet
Information Systems (IIS) 4.0
| |
Extensible
Markup Language (XML) eliminates the need to jump through such hoops
ever again. It provides a flexible and easy-to-use object model for
accessing structured data programmatically. An application can
easily traverse an XML document using the provided parser and object
model and have sequential or random access to any well-formed
structured information. The data kept in XML is text-based,
readable, maintained easily, potentially self-validating, and
accessible with a variety of tools. In this article, I'll explain
the basic steps for putting XML to use, walking you through the
steps necessary to implement a hypothetical repository. Along the
way, I'll show you a few tricks for using XML either to maximize
performance in VB or compensate for missing VB functionality, using
benchmarks to illustrate the relative performance difference between
various approaches. At the same time, I'll try to punch through the
hype, making the case for what XML can do today, as I cover its
strengths and weaknesses in the context of other present-day
technologies (see Table 1).
Before exploring how XML can help you, let's look at the problems
you need to solve when implementing structured storage. Many
applications need to keep some form of persistent values. These
values can consist of application settings, runtime-resource
locations, data dictionary items, saved state, or any information
the app might need at run time. It's also common for this
information to require some structure, such as values with
associated subvalues. You probably need to save these values to a
file, manage or edit them with an external program, and read them
back quickly at run time. Examples of these files include the VB
project file, VB form files, and Internet Explorer (IE) Channel
Definition Format (CDF) files. The solutions described in this
article apply to almost any structured application data that doesn't
require a full relational database.
Consider the VB form file, which is saved as text using a
proprietary format. This file's first part holds hierarchical
information about the controls and their properties, while the rest
of the file holds the VB code. The VB Integrated Development
Environment (IDE) uses the VB Form editor to read and save the file
at design time, and the VB compiler uses the file at run time to
create the Form resources and code. In the past, you had to invent a
new file format and write all the software to handle reading,
writing, and interpreting the stored values for each implementation.
XML gives you a standard format for storing structured information
in a plain text file. You can name values stored in an XML file,
keep them in a hierarchical structure, and describe them with any
desired attributes.
And that's not all: Microsoft also provides a standard XML parser
with IE, versions 4.0 and later. This parser can read any
well-formed XML file and hand you back an object model (Document
Object Model, or DOM) to use at run time. Data retrieval from the
XML file is fast, and the object model is easy to work with. You can
also find many XML editors that let you load, edit, and save your
XML files (see Table 2).
Persist Your App Values
You have many options to choose
from when persisting application values. For example, assume you
want to develop a repository of application-runtime metadata. Begin
by reviewing your options for persisting this metadata. The
repository should meet several requirements. First, a file should
hold runtime information describing the application data tables, as
well as data- entry form definitions and query or search
definitions. Tables, forms, and searches include fields, actions,
and some relationships. The repository must also read and access the
metadata quickly at run time and let you edit it easily during
development. Obviously, you want a well-structured text file format
that you can edit easily and keep under version control.
The available options for persisting application values fall into
two categories: those you must implement yourself and those provided
by available tools. Implementing your own is often a good choice
because you get complete control. You can make it as simple or
robust as you want, running the gamut from a simple VB file
input/output (I/O) to a full implementation of a propriety file
format, grammar, and parser. Writing your own grammar/parser proves
the most flexible, least limiting, and by far the most rewarding
option—it's also a lot of work. A simple solution might consist of
using Input # statements to read in a plain comma-delimited text
file.
One option is to use the VB Property Bag object, which ships with
VB6, to persist values. A property bag lets you enable a VB object
to persist itself to the bag, properties and all, then save the bag
to a file. However, the Property Bag object is a bit of a black box.
It uses a proprietary VB format of byte arrays, which might change
in some future version of VB. The storage is also somewhat flat in
structure, using key-value pairs. Finally, a property bag has no
design-time interface, and it's too closely coupled with VB objects.
I believe you can use it if you need to deal with only one object at
a time, but covering a structure of objects seems unreasonably
complex. The ADO Recordset object also has some options to persist
values to text. However, ADO's propriety format and lack of support
for depth or complex hierarchies also make the Recordset object an
inadequate solution for a repository.
This leaves you with two options: Employ a simple VB
implementation for reading comma-delimited values, or use someone
else's grammar/parser by taking advantage of XML. Let's compare
them. First, look at the data format and structure (see Table 3). XML's syntax is
structured, easy to read, self-describing, clear, and easy to
validate. A comma-delimited file is dense, unreadable, inflexible,
and a bit intimidating to work with. You get the feeling the next
time you edit the file something will "go out of line" that takes
hours to fix. You must account for every possible value in this
format, such as ensuring that commas sit in the right place even
when the value is empty. A missing comma causes a file to get out of
alignment quickly. In contrast, XML's grammar lets you exclude
elements that have no values.
Next, compare the coding for these options (see Table 4). It's easy to use VB's
input functions, but it's even easier to make a single call and let
the XML parser read in the entire file for you.
XML Outpaces VB File I/O
Speed is another important
consideration. It probably won't surprise you to learn that an XML
parser reads in a file much faster than simple VB file I/O
functions. For example, I used VB's Input # statement to read the
comma-delimited values directly into local variables, with no string
manipulation functions, special parsing, or concatenations. The XML
parser easily outpaced even this subset VB's file I/O capabilities.
VB took about 200 ms to read in a 15K file, while the XML parser in
IE 4.0 took 20 ms to read in a file with the same amount of data.
That's an order of magnitude faster. Note that including all the
tags makes the file size twice as large for the same amount of data.
It's also a small case of comparing oranges to tangerines.
However, I do create small VB objects to hold the metadata values as
I read the file, and creating these objects takes some of the time.
I use these objects to read the comma-delimited file into local
member variables and hold the metadata values. The XML parser also
builds up and hands back an object model populated with the
retrieved data, so it might be fair to compare the two approaches,
after all. Adding in the time it takes to create the same VB objects
after reading in the XML file brings the total execution time to
100ms. In other words, you can read in the file and assemble the
object model with XML in about half the time it takes to read in the
file using VB's native file I/O. The online
source code shows you how to implement both approaches and
includes timing code. It also includes sample data files.
These results convince me that XML is the best tool for
implementing such a repository. However, using it to implement a
solution is another matter. Next, you need to find the best way to
work with it. Developers typically load an XML file and immediately
process the data for display or some other use. In this case, you
need to read in the file up front and access elements in the data
randomly at run time, as necessary. The XML object model is simple
and easy to work with, but it doesn't have good search capabilities.
For example, it lacks an easy way to locate and access a particular
element quickly, and you must find the element each time you want to
reach its attributes.
This means you need to create your own small objects to hold
property values and collections of child objects keyed by name.
These objects and collections permit fast key access to elements to
get their properties, but XML's DOM is better suited for sequential
access than for random access. The first sequential access presents
a few implementation options. When processing the XML tree of
elements, one approach is to process everything there, then
determine what each element child is and either ignore it or process
it further. You begin at the root element, getting all child
elements and checking each child. If the child's an element, process
it; if it's a collection of child elements, process all children
recursively. Doing this requires processing every type of element,
determining what it is, and handling it somehow. This approach works
for certain applications, especially those that let you display or
edit XML data. Such applications process the entire file and format
whatever is there for display. You display an element one way and a
collection another way.
Find Only What You Want
A second approach altogether is
to process the XML tree by looking only for what interests you,
ignoring the elements meaningless to your application. This makes
more sense in the sample scenario. You know ahead of time what
interests you in the metadata file: the TABLE, FIELD, SEARCH, and
similar tags. More specifically, you want only FIELD tags under
TABLE tags and SEARCHFIELD tags under SEARCH tags. This means you
can ignore FOO tags anywhere in the file or any FIELD tags that
don't occur in the right place. Note that you can also use a
Document Type Definition (DTD) at design time with a validating
parser to verify the file contains no errant tags.
The first approach requires accessing every element in the tree,
getting its name, and using Case or a similar statement to compare
the name string to the possible tags you know about. You still must
validate whether a tag is in the proper place once you determine
what a tag is. This means more work for you from the outset, and
VB's string comparisons aren't fast. The second approach requires
only that you ask for the allowed child elements; you know a tag is
in the correct place when you get a reference to it. This part of
the second approach is less efficient because you must always check
for the existence of each possible element, whether or not it's in
the file. Your metadata file might have TABLE tags with no SEARCH
tags, yet this implementation checks for SEARCH tags under each
table.
I wanted to know how this would affect performance, so I
benchmarked both approaches, testing for situations where only a few
or all of the possible elements existed in a file. The second
approach proves about 20 to 30 percent faster on average. The online
source associated with this article shows you how to implement
the second approach in each of the LoadXML*(...) functions.
Processing the XML tree by looking for known elements presents
one last challenge. When you request a child element by its tag name
from another element on the XML tree, you get one type of
object—MSXML.IXMLElement—when there is only one occurrence of
this child. However, you get another kind of object
MSXML.IXMLEle-ment Collection—when more than one child
exists. I conducted several benchmarks to determine the best way to
handle this situation. One general approach is to assign the
returned object to the untyped VB object, check what you receive,
and make the proper calls once you know the object type. Remember
you might receive nothing, one element, or a collection:
Dim XMLElement As Object
Set XMLElement = _
XMLTableElem.children.Item("Field")
If XMLElement Is Nothing Then
ElseIf TypeOf XMLElement Is _
MSXML.IXMLElement Then
ElseIf TypeOf XMLElement Is _
MSXML.IXMLElementCollection Then
End If
Turn Off Error Handling Temporarily
Another approach is
to turn off error handling temporarily, make the call, then check
what error, if any, is raised (see Listing 1). A variant of this
approach proved fastest in my tests. Start by turning off errors
with this statement:
On Error Resume Next
Next, request a child element, assuming this will return a
collection and handle any errors that take place. If you encounter
no errors, you have either a collection or nothing. If the
collection is not "nothing," use a For…Each loop to proceed with
processing. You have a single element to process if you receive an
Error 13 (data type mismatch):
Dim XMLElementColl As _
MSXML.IXMLElementCollection
On Error Resume Next
Set XMLElementColl = _
XMLTableElem.children.Item("FIELD")
If Err.Number = 0 Then
ElseIf Err.Number = 13 Then
End If
This example and the handful of others in this article illustrate
how simple it can be to take advantage of XML. XML generates an
unbelievable amount of hype, and no, it won't toast your bread in
the morning. On the other hand, standards-based structured storage
can save you a lot of time writing and maintaining your own
methodologies while simultaneously giving you better performance.
The online
source includes two samples that illustrate how you might put
this technology to practical use. The first, a "getting started"
sample, consists of a set of statements for interacting with the XML
object model. The second sample shows you how to implement a
metadata component that loads an XML file containing information
about an application with database tables, fields, actions, and
search definitions. You can use this sample to review or test the
different implementation options and performance when loading an XML
file vs. a simple text file—the online source includes both. You can
also use this sample as a starting point for creating your own XML
structured storage files. For example, you might take a cue from
many present-day industries and define your own set of tags to
describe the information you want to store as XML. You could then
modify the metadata component to read your specific XML files.
XML as a technology is also moving forward. The Document Object
Model Level 2 recommendation from the W3C (World Wide Web
Consortium) is on its way (see Links). This recommendation builds on
the Document Object Model Level 1, the version discussed in this
article. Level 2 adds interfaces for associating style sheets with a
document, the Cascading Style Sheets object model, the Range object
model, filters and iterators, as well as events. IE 5.0 already
includes some of these enhancements. These features augment how you
search, access, and interact with the IE 5.0 XML DOM. You probably
know this is only the beginning; I, for one, can hardly wait.
Nachi Sendowski lives in the San
Francisco Bay Area and is a principle partner in The Enticy Group, a
consulting LLC, and the director of software engineering for Healinx
Corp, an Internet startup company. He is responsible for the
architecture, design, and development of software frameworks that
provide the necessary tools to build scalable, multitier Web
applications efficiently. Reach him at nachi@enticy.com, nachi@healinx.com, or visit his
company's Web site at http://www.enticy.com/.