Wednesday, March 19, 2008

So you'd like to parse GML...

I am a big fan of standards.

However, I'd say that the Open Geospatial Consortium has a serious illness when it comes to defining standards for the industry: W3C-itis. The scope of the GML standard started out huge and became gigantic; the result is plagued with ambiguity and multiple ways of doing things; and is only half-correct on technical issues. (The worst example that I can think of is their botching of spatial references; e.g. axis order.) The OGC has W3C-itis far worse than the W3C themselves.

In short (as far as I can tell by looking at the result and not being involved in the process): the people writing the GML standards and those informing its development were the producers of data, and they took it up as their creed to ensure the standard could represent any and all data they could imagine and do so in a way that deviates as little as possible from their current practices. This has the effect of giving the shaft to any consumer of data, who needs to implement the entire standard in order to be able to meet the ideal of understanding all data produced under it. Which is the entire point of having a data format standard in the first place -- to decouple producers and consumers.

None of this is news and has been complained about loudly elsewhere. And there are now lots of bandaids emerging, e.g. the OGC's "profiles."

What I do have to contribute that's new is the following visualization of the GML schema for a single polygon. This is a railroad diagram, pretty straightforwardly interpreted by following the tracks. (Note that the tool I used sometimes puts sections right-to-left, so follow the arrows.) You'll need to click on it and zoom to 100% for it to be even remotely comprehensible.


A few simple observations and notes:
  • Everything is optional. You can have a completely empty polygon with no interior or exterior as far as GML is concerned.
  • Yes, there are five different ways to represent the points that make up a LineString for the boundary of a polygon's interior. Oh, and five different ways to represent a polygon boundary itself: as a LinearRing, or as a Ring with Curve, a CompositeCurve, an OrientableCurve or just a plain ol' LineString curveMembers. Nice.
  • The solid blue boxes are nonterminals ('complex types' in the XML Schema lingo): there are many more tags and choices hidden under them. I couldn't fully expand them without making a ridiculously large image.
  • I left off all XML attributes, in particular the grab-bag of attributes that sits on most every element. (srsName, arcRole, ref, etc)
  • The tools I used to create the image aren't perfect, so don't use this as a reference to GML. Instead think of it as a nice demonstration of the complexity you face.
I produced the image using ebnf2ps, which is awesome; although this is a bit of an abuse of its powers. I used an unfinished homebrew tool, which at some point may be released as open source, to process the schema's XSD files into EBNF. (Its actual purpose is to produce RELAX-NG schemas, which are very close cousins to EBNF.)