Wednesday, May 7, 2008

On sharing geospatial data

Blogger James Fee put up a post this monday that's generated a lot of interesting discussion.

Whenever geospatial data sharing comes up there's a pretty strong tendency to gloss over the fact that "sharing data" is actually a two-parter:
  • There is the sort of sharing you do with co-workers or associates: getting your data with them in, to borrow a phrase from the GPL, the "preferred form of the work for making modifications to it."
  • Then there is publishing GIS data (internally or externally) once it's "finalized."
As everyone (or at least everyone commenting on James' article) recognizes, the current state of the art sucks. We are lost in a maze of twisty file formats, standards, and protocols, none alike; even the simple task of emailing a shapefile to a co-worker is often the cause of bitter enmity between GIS and IT technicians.

Status quo for simple sharing: Zip your data up, making sure not to leave off any of the many files it is spread across, put a password on the zip file (the attachment will be stripped if it contains a personal geodatabase mdb visible to the spam filter), change the extension to something not '.zip', attach and pray. Gosh forbid you are sending anything larger than a few megabytes, or working with whole map documents at once. (Did you remember to include every related layer in that zip file with relative paths in the mxd?)

As for map data publishing, there is an embarrassment of options, none of them really mature (or rather, dominant) yet. Most require some programming in order to set up a publisher side, and on the consumer side you'll need to jump through some awkward conversion hoops as soon as you want to do something more involved than look at a pretty picture in IE or Google Earth. If you are like me, and you prefer or need to get your hands dirty, you breath a deep sigh of relief when there are plain old shapefiles or GeoTiffs available for download over HTTP.

Then there's discovering published data in the first place.... with a few exceptions, if you don't know in advance that it exists and have a good idea of where to look for it, it might as well not exist.


I like the idea of a SQLite data format, at least as a "preferred working format" for vector data. As a full relational database in a file, it supports the rich data model that was shoehorned into GML, but in a format designed especially for working with relational data and a full set of mature tools and libraries. And it doesn't have the drawbacks of being flaky and proprietary the way personal geodatabases are. What would be sweeten the pot for its use as a working data format, besides vendor support in the tools everyone uses: basic versioning or a 'track changes' equivalent. Way, way better would be supporting distributed branches and merging the way us programmers get to treat source code!

(Consider if the folks working on the WDPA had the GIS equivalent of git or darcs to do their jobs: branches for published versions where error corrections can still be made; the ability to accept or reject edits via email from anyone; local experts with their own working copies; larger datasets that contain other information on protected areas from other sources, but which can still automatically merge in the latest WDPA changes...)

That is the direction I think things will eventually move, but it might take a very long time. Don't hold your breath.

Footnote: I've put a reasonable amount of thought into these topics; without saying too much, they are extremely relevant to the product I am building. (ETA: inside of a couple of months!) (It's not decentralized version control for GIS, either: alack and alas.)

1 comment:

Ed Corkery said...

Why not let the users decide what file format they want to use, in all cases?

That's the approach we're taking with Koordinates (http://koordinates.com).

Because mapping crosses many professional boundaries where the end-uses of the data can be divergent, map data can (and will continue to be) stored in a variety of file formats. So the active sharers and more passive users really need better translation tools. Both sides of the sharing experience can then create/use the data in their professional software of choice.

By the way, I fully agree with the division of geodata sharing into two broad categories of publication (one to many) and collaboration (some to some, in a permissions-controlled environment). However, both are tightly intertwined, and to a degree the present method of "publishing" geodata as distinct products at specific points in time is driven by the lack of any other way of doing things (ie - the lack of iterative and quick collaboration tools for professional mappers and their customers).