Saturday, May 17, 2008

Weekend listen: The Giant Pool of Money

It's an hour long, off topic, and I'm behind on work. But it's worth a listen if you're a non-finance-whiz, and interested in understanding the housing crisis:

http://www.thisamericanlife.org/Radio_Episode.aspx?episode=355

("Free download" link is on the left side for a full mp3.)

For a geospatial take, the NY Fed has been putting together some very nice mortgage maps with quite recent data.

Wednesday, May 7, 2008

On sharing geospatial data

Blogger James Fee put up a post this monday that's generated a lot of interesting discussion.

Whenever geospatial data sharing comes up there's a pretty strong tendency to gloss over the fact that "sharing data" is actually a two-parter:
  • There is the sort of sharing you do with co-workers or associates: getting your data with them in, to borrow a phrase from the GPL, the "preferred form of the work for making modifications to it."
  • Then there is publishing GIS data (internally or externally) once it's "finalized."
As everyone (or at least everyone commenting on James' article) recognizes, the current state of the art sucks. We are lost in a maze of twisty file formats, standards, and protocols, none alike; even the simple task of emailing a shapefile to a co-worker is often the cause of bitter enmity between GIS and IT technicians.

Status quo for simple sharing: Zip your data up, making sure not to leave off any of the many files it is spread across, put a password on the zip file (the attachment will be stripped if it contains a personal geodatabase mdb visible to the spam filter), change the extension to something not '.zip', attach and pray. Gosh forbid you are sending anything larger than a few megabytes, or working with whole map documents at once. (Did you remember to include every related layer in that zip file with relative paths in the mxd?)

As for map data publishing, there is an embarrassment of options, none of them really mature (or rather, dominant) yet. Most require some programming in order to set up a publisher side, and on the consumer side you'll need to jump through some awkward conversion hoops as soon as you want to do something more involved than look at a pretty picture in IE or Google Earth. If you are like me, and you prefer or need to get your hands dirty, you breath a deep sigh of relief when there are plain old shapefiles or GeoTiffs available for download over HTTP.

Then there's discovering published data in the first place.... with a few exceptions, if you don't know in advance that it exists and have a good idea of where to look for it, it might as well not exist.


I like the idea of a SQLite data format, at least as a "preferred working format" for vector data. As a full relational database in a file, it supports the rich data model that was shoehorned into GML, but in a format designed especially for working with relational data and a full set of mature tools and libraries. And it doesn't have the drawbacks of being flaky and proprietary the way personal geodatabases are. What would be sweeten the pot for its use as a working data format, besides vendor support in the tools everyone uses: basic versioning or a 'track changes' equivalent. Way, way better would be supporting distributed branches and merging the way us programmers get to treat source code!

(Consider if the folks working on the WDPA had the GIS equivalent of git or darcs to do their jobs: branches for published versions where error corrections can still be made; the ability to accept or reject edits via email from anyone; local experts with their own working copies; larger datasets that contain other information on protected areas from other sources, but which can still automatically merge in the latest WDPA changes...)

That is the direction I think things will eventually move, but it might take a very long time. Don't hold your breath.

Footnote: I've put a reasonable amount of thought into these topics; without saying too much, they are extremely relevant to the product I am building. (ETA: inside of a couple of months!) (It's not decentralized version control for GIS, either: alack and alas.)

Tuesday, April 29, 2008

Where do people find the time?

I recently stumbled across this excellent, optimistic, little talk by Clay Shirky about user participation from the Web 2.0 conference. (Transcript.)

Wednesday, March 19, 2008

So you'd like to parse GML...

I am a big fan of standards.

However, I'd say that the Open Geospatial Consortium has a serious illness when it comes to defining standards for the industry: W3C-itis. The scope of the GML standard started out huge and became gigantic; the result is plagued with ambiguity and multiple ways of doing things; and is only half-correct on technical issues. (The worst example that I can think of is their botching of spatial references; e.g. axis order.) The OGC has W3C-itis far worse than the W3C themselves.

In short (as far as I can tell by looking at the result and not being involved in the process): the people writing the GML standards and those informing its development were the producers of data, and they took it up as their creed to ensure the standard could represent any and all data they could imagine and do so in a way that deviates as little as possible from their current practices. This has the effect of giving the shaft to any consumer of data, who needs to implement the entire standard in order to be able to meet the ideal of understanding all data produced under it. Which is the entire point of having a data format standard in the first place -- to decouple producers and consumers.

None of this is news and has been complained about loudly elsewhere. And there are now lots of bandaids emerging, e.g. the OGC's "profiles."

What I do have to contribute that's new is the following visualization of the GML schema for a single polygon. This is a railroad diagram, pretty straightforwardly interpreted by following the tracks. (Note that the tool I used sometimes puts sections right-to-left, so follow the arrows.) You'll need to click on it and zoom to 100% for it to be even remotely comprehensible.


A few simple observations and notes:
  • Everything is optional. You can have a completely empty polygon with no interior or exterior as far as GML is concerned.
  • Yes, there are five different ways to represent the points that make up a LineString for the boundary of a polygon's interior. Oh, and five different ways to represent a polygon boundary itself: as a LinearRing, or as a Ring with Curve, a CompositeCurve, an OrientableCurve or just a plain ol' LineString curveMembers. Nice.
  • The solid blue boxes are nonterminals ('complex types' in the XML Schema lingo): there are many more tags and choices hidden under them. I couldn't fully expand them without making a ridiculously large image.
  • I left off all XML attributes, in particular the grab-bag of attributes that sits on most every element. (srsName, arcRole, ref, etc)
  • The tools I used to create the image aren't perfect, so don't use this as a reference to GML. Instead think of it as a nice demonstration of the complexity you face.
I produced the image using ebnf2ps, which is awesome; although this is a bit of an abuse of its powers. I used an unfinished homebrew tool, which at some point may be released as open source, to process the schema's XSD files into EBNF. (Its actual purpose is to produce RELAX-NG schemas, which are very close cousins to EBNF.)

Thursday, February 7, 2008

Programming functionally: A proto-XPath implementation in Javascript

This has been sitting on my hard drive for a while. I thought I'd clean it up and post it! Well ... at least I've posted it.


If you don't know what "functional programming" is, wikipedia has a fairly technical overview. Loosely, "object" is to "object-oriented programming" as "function" is to "functional programming".

I wrote this to give a taste of the kind of power functional programming can bring to bear on a problem, as well as to document the basics of my psuedo-xpath implementation for myself. Some love to hate Javascript, but I think this comes more from the hostile browser environment than the language, which is quite nice. Note that, while I'm heavily using Javascript's functional features, I'm not making any effort to be 'purely functional'. (If I were using a purist style, I would not be using variable assignments and avoiding for/while loops, instead using recursion and if statements to the same effect.)

JS makes it quite easy to pass functions to as arguments, one of the core distinctions of programming in a functional style:
function doSomethingWithPi(arg) {
arg(3.1415);
}
function eat(x) {
alert(x + " sure was yummy!");
}
doSomethingWithPi(eat); // ------> "3.1415 sure was yummy!"

Simple enough. Now consider:
function siblingAxis(f,node) {
var v = node.nextSibling;
while (v != null) {
f(v);
v = v.nextSibling;
}
}

This function, when given a function & a DOM node, calls that function for every sibling of that node (until out of siblings). If you're familiar with the XPath spec, you already know why the function is called "siblingAxis". Here's another, very similar function -- it calls the provided function once for each child of the given node:
function childAxis(f,node) {
var v = node.firstChild;
if (v != null) {
f(v); // call the function for the first child node
siblingAxis(f,v); // do the same for all its siblings...
}
}

And here's one that takes two functions; the first acts as a test for the given DOM node and the second is only called if the test succeeds with the given node.
function test(test_f, f, node) {
if (test_f(node)) {
f(node);
}
}

Now, if we could wire our test function up to our childAxis function, we would have the beginnings of something fairly powerful. The trick is to curry the function test. It's easier to show what this means rather than try to explain it.
function test(test_f, f) {
return function(node) {
if (test_f(node)) {
f(node);
}
}
}

Now, test(test_f,action_f) does not perform its test or action right away. Instead it returns a new function, which expects a node and in turn runs the test against the node, and the other function if it's successful. Javascript implicitly keeps test_f and action_f arguments intact and available in the returned function. This is called a closure in FP jargon. By currying test, we can wire it up with e.g., childAxis:
function isLink(node) {
return (node.nodeType == 1 && node.localName.toLowerCase() == "a");
}

function doSomethingWithAllLinkChildren(f, node) {
childAxis(test(isLink,f), node); // easy peasy
}

It only makes sense to do the exact same thing with the axis functions themselves. This means that each axis function takes a "do something with nodes on this axis" function, and returns a function that, from a starting node, actually traverses the given axis and does the something.
function siblingAxis(f) {
return function(node) {
var v = node.nextSibling;
while (v != null) {
f(v);
v = v.nextSibling;
}
}
}

function childAxis(f) {
return function(node) {
var v = node.firstChild;
if (v != null) {
f(v); // call the function for the first child node
siblingAxis(f)(v); // siblingAxis is curried: siblingAxis(f,v) becomes siblingAxis(f)(v)
}
}
}

What have we gained from this? Well, here's a function findem that alerts with the children of each span found within a div child of the given node. In xpath terms, we're after "/div/span/*" (with the starting node treated as the root):
function nameTest(nm) {
return function(node) {
return (node.nodeType == 1 && node.localName.toLowerCase() == nm.toLowerCase());
}
}

function nodealert(node) {
alert("found " + node.localName);
}

function findem(node) {
var path = childAxis(test(nameTest("div"),(childAxis(test(nameTest("span"),childAxis(nodealert)))));
path(node);
}

Let's make another small change to test, currying it again:
function test(test_f) {
return function(f) {
return function(node) {
if (test_f(node)) {
f(node);
}
}
}
}

Now, we can use a handy bit of non-javascript notation and say that f ⋅ g is shorthand for function(x) { f(g(x)) }, and rewrite findem in this style:
function findem(node) {
var path = childAxis⋅test(nameTest("div"))⋅childAxis⋅test(nameTest("span"))⋅childAxis⋅nodealert;
path(node);
}

Observe the wonderful correspondence between our functions and the various components of the path they represent ("/div/span/*"):
/     ---> childAxis
div ---> test(nameTest("div"))
/ ---> childAxis
span ---> test(nameTest("span"))
/ ---> childAxis
* ---> (n/a)


At this point, having XPath (or something like it) up and running in javascript is a very straightforward job.

  • Each axis (child, sibling, descendant, etc) can be represented by : function someAxis(f) { return function(node) { /* traverse the axis from the node, calling 'f' along the way */ } }

  • Each test (both "node tests" and other predicates) can represented as a function that takes a node and returns a boolean, then gets wrapped by test. We'd need to write helper functions such as nameTest above to construct the actual test functions.

  • The input path is parsed and translated into a list of axis and test functions. Note that all of them have the form function(f) { return function(node) { /*...*/ } }

  • Combine the list of functions into the form a(b(c(d(ff))) where ff is the 'action' function that you want to be called for any nodes returned by the path. This has the effect of calling the outer function(f) { /*...*/} for each node step with f as the next node step, and results in a function awaiting a node...

  • Call it with the root node; your ff will be called for every succeeding node!


Click here to see the complete example of the proto-XPath implementation that I've been using along the way. It adds parent,self, and descendant axes; it also insert an implicit test at each step of evaluation to ensure that each node is visited only once. (This prevents repeat visits during paths like "//div//div"). The parsing code is hokey but works.


Notes:

  • I'm using the term "currying" a bit loosely to refer both to the transformation of a function f(a,b,c) { /* ... */ } into f(a) { return f(b) { return f(c) { /*...*/ } } }, and to calling a function written in this style with partial arguments to return a new function, as in f(a_argument)(b_argument).


  • The code I've written is not purely functional, nor even close to it: However, it makes good use of some functional programming features, and is a nice functional/imperative compromise for javascript.


  • There is still quite a bit of work to get a full implementation of XPath up and running, as XPath is a large language. However, I believe the remaing work -- if you were interested in a full implementation of XPath -- consists mostly of straightforward extension of the sketch laid out here.

Saturday, January 5, 2008

Starting up

Here I am, embarking on the development of my new software business! Verdict so far: busy, but good busy.

I've stepped out on my own to develop software that works with and within the growing ecosystem of geographic information systems. This is a very exciting time in the geospatial world, with lots going on -- an opportune time for the birth of a new company. I'm eager to share what I can along the journey.

This blog will be a mixed announcement platform, dumping-ground, soap-box, and entry point into the many conversations happening out there on the web. Here's what to expect, especially while I'm focusing most intently on brewing up new software:
  • Code: I've been the beneficiary of open source for a very, very, long time now. Where it makes sense to do so, I plan on giving back.
  • Articles: I enjoy (over-) explaining things, particularly programming-related. Perhaps someone might find it useful; almost everything I've learned about coding in the last five years has been from the web.
  • "Blog" material: Pointers to useful resources, off-the-top-of-my-head pontification, product reviews, links to offbeat YouTube videos...
Here's to the interesting times ahead!