Saturday, October 11, 2008

Inside Boundary Generator: Computational geometry for the GIS programmer

I'm going to take a post or two to dive into the internals of Boundary Generator, my GPL geoprocessing tool that had its "alpha" release last week. The nice thing about open source is that I can point directly at the code as I go along. Up this dreary Saturday afternoon: The underlying computational geometry (some code but mostly math!) behind finding shared boundaries. In my experience, it's fairly rare to get your hands dirty with this when doing GIS development, so hopefully others will find it as interesting as I do. (This is also fairly elementary as far as computational geometry goes; this blog post has nothing that wouldn't be taught in the first few sessions of an undergraduate computer graphics course, or even an expected prerequisite.)

The problem that we need to solve is that of finding overlapping line segments. In addition, our solution should have well-defined tolerances and not be subject to (too much) numerical instability.

Representing geometry

When sitting down to code, the first problem is that of representation: How do we store a line segment? Most peoples' instinctive representation is as a pair of endpoints -- (x0,y0),(x1,y1). But a bit of thought shows that there are many other options. For example, we could store line segments in a more algebraic form, such as the values m, b, x0 and x1 from the following description of a line segment:
  • y = mx + b where x0 <= x <= x1
Although this isn't as natural as simply working with endpoints, it has some big advantages over simply storing endpoints for finding overlaps:
  • If two line segments have the same slope (m) and y-intercept (b), we can see if they overlap just by checking their x0..x1 ranges for an overlap.
  • We can also test for line crossings by solving a system of two equations (y = (m0)(x) + b0 and y = (m1)(x) + b1) for x and y. Boundary Generator doesn't need this, but other geoprocessing functions in the future will.
However, this approach is plagued with numerical instability and special cases -- what do you do with lines that are vertical or near-vertical? -- that will quickly turn code into spaghetti as you add special cases to address these problems. It is also difficult to specify overlap tolerances in a meaningful way.

Instead, boundary generator uses a more abstract (and much better) representation that dives a little further (but not too far) past basic algebra: It is a parametric representation that consists of an origin point (o), a vector (v), and a scalar parameter (t). Just as an infinite line may be defined by a slope and y-intercept value, it may instead be defined as an origin and a vector; and just as a line segment can be defined upon an infinite line as a range of x values, a line segment can be defined as a range of t values. The line is all points p which satisfy p = vt + o; the line segment is the portion of the line where 0 <= t <= 1. If the preceeding made sense you may skip the next section or two. :)

Vectors (Go to the code; go to a more in-depth Wikipedia article.)

What is a vector? It is this: a magnitude (length), and a direction. It is typically and most conveniently represented as an (x,y) pair; this defines both direction and length. (The direction is the direction from the origin (0,0) through the point (x,y) and the magnitude is the length of a line segment from the origin to (x,y) ).

Vectors can be added to each other; if we have two vectors a and b, a + b is the vector (ax + bx, ay + by). (They can also be subtracted.) You can also scale a vector by an amount s, changing its length without changing its direction, simply by multiplying both x and y by the scale amount: (sx, sy).

We can also (obviously, since a vector is just an x,y pair) choose to represent a point via a vector simply by interpreting it as the magnitude and direction from the origin.

Line segments (Go to the code.)

Our parametric representation of an infinite line (repeating from before) is: a starting point o, a direction/magnitude v, and a parameter t: vt + o. By varying t, we get all points on the line. When t is 0 it gives us the starting point. We can also test to see if a point p is on the line by trying to solve p = vt +o for t.

And to turn our infinite line into a line segment: simply restrict t to lie between zero and one. Since v already includes a length as well as a direction, the line segment is nicely defined -- the start point is o; the end point is o + v (by setting t to 1.)

This also is a very, very good representation for finding overlapping segments as well as crossing segments. But first, I need to introduce...

The vector dot product (Go to the code; go to the Wikipedia article)

This is the swiss-army-knife of vector operations. It takes in two vectors, and returns (the length of the first vector) * (the length of the second vector) * (the cosine of the angle between them). It is also very easy to calculate. This has a ton of uses, to name a few:
  • The dot product of a vector with itself is equal to the square of its length.
  • The angle between two vectors a and b, both of magnitude 1, is acos(dot(a,b)).
  • If two vectors are perpendicular, their dot product will be zero.
The code uses the dot product in all these ways. For testing whether line segments overlap, we need to ensure they are parallel within a specific tolerance. This can be done by normalizing the vectors (multiplying each by 1/length to preserve its direction while making its length 1), then taking the dot product and inverse cosine; the two are parallel if the angle is within the tolerance to zero or 180 degrees.

The actual boundary generator implementation of AreParallel() uses a bit of algebra during parallel testing to eliminate all the slower and numerically unstable bits. (E.g., replacing divisions w/ multiplications; taking the cos() of the tolerance up front rather than acos() of the dot product).

The distance from a point to a line (Go to the code)

The dot product is also used to help determine the distance from a point to a line. It requires a bit of algebraic reasoning (I put a nicer version of this derivation on, but I don't know how long the link will last):
  • The line l is defined by vectors o and v as above; the point is the p.
  • If we draw a line from p to the closest point of l, it will be perpendicular to l. Call this closest point q.
  • Since q is on l, there is a t value such that q = tv + o.
  • Since the line from p to q is perpendicular to l, dot(v, p-q) = 0.
  • Need to introduce a pair of identities: dot(a, b+c) = dot(a,b) + dot(a,c) and dot(a, sb) = (s)(dot(a,b))
  • Thus, (dot(v,p) - dot(v,q)) = 0
  • dot(v,p) - dot(v,(vt+o)) = 0
  • dot(v,p) = dot(v,vt+o))
  • dot(v,p) = dot(v,vt) + dot(v,o)
  • dot(v,p) = t * dot(v,v) + dot(v,o)
  • t = (dot(v,p) - dot(v,o)) / dot(v,v)
Once we have t, we can find q easily, then the distance between p and q.

Finally: Finding the overlaps between line segments (Go to the code.)

With all these preliminaries out of the way, finding out if and how two line segments a and b overlap is quite straightforward, and the implementation of LineSeg::Overlay should be easy to read. First, check that angle between a and b's direction vectors is within the angle tolerance. Then, check to see that each endpoint of line b is within the distance tolerance of line b, saving the t values along the way. Finally, if the (min_t_val..max_t_val) range overlaps (0..1), there is an overlap! (If not, b lies upon the same infinite line as a but they don't have anything shared.) Finally, the implementation constructs the overlapping segment as well as the clipped portions of a and/or b that were not overlapping.

There are a few tweaks to keep the implementation is fast and reasonably numerically stable -- it uses squared distances everywhere instead of distances to avoid sqrt(), for example, and ends up only having two division operations involved.

There is also fairly extensive unit testing for all the above code -- essential to keep this type of code from having nasty corner-case errors!

This overlay operation alone would be enough to implement a very slow version of boundary generator: Test every line segment of every polygon against every other line segment of every polygon and record the overlaps. Unfortunately its performance is unacceptable.

A final interesting side note

The representation we chose for lines, and line segments means that everything -- all of the math and all of the code -- remains correct and true, even if we are working in three (or more!) dimensions. All that needs to change is to start using three-dimensional vectors rather than two; everything else is built atop them using vector operations that exist in any number of dimensions.

Next time

Next, when I find the time, will be a description of the quadtree spatial index that Boundary Generator uses to keep things at least somewhat snappy.


I see this sort of tutorial content all over the place in programming communities, but less often in the geospatial developer world. If you're finding this interesting or helpful, please comment or drop me an email so I keep doing it!

Monday, October 6, 2008

Announcement: Boundary Generator, v. 0.1 (An open source geoprocessing tool.)

The time has come to introduce my weekends-and-evenings project for the last month to the world. World, meet Boundary Generator; Boundary Generator, meet world! It's an open source (GPL) geoprocessing tool. There are still some rough edges and the functionality is primitive, but it's good enough for people to download and give it a whirl.

What it does: Given an input layer of polygons, it creates lines that represent the shared borders between those polygons, attributing it with the IDs of the polygons on each side.

Why that is useful:
  • The output is an adjacency table, useful for all sorts of analyses. (For an example I'm familiar with, it is a required input to the MARXAN family of tools for automated conservation reserve design).
  • The output is also a normal shapefile, allowing you view it with interesting symbology or perform further processing. (Examples: Highlight antagonistic county borders by doing a join then symbolizing on the difference in political party identification between the sides; buffer international borders and measure populations w/in given distances, etc.)
  • For the programmers, it is a nice chunk of sample code for building a geoprocessing tool in C#. It is also an example of working directly with the underlying methods of computational geometry rather than relying on a pre-built API.
Quality and performance:
  • Boundary Generator is extremely alpha, so expect bugs and use with caution. I have done some testing, but far from enough.
  • The performance is not horrible; on my setup it runs at about 0.25 ms/line segment, or about three minutes to find the borders between the 3,488 counties in the US 2000 census. (My environment is WinXP under VirtualBox and linux, on an older 1.8Ghz laptop; the counties shapefile has ~803k line segments total.)
What the GPL open source license, as I understand it (obligatory: I am not a lawyer, so check first with yours!), lets you do:
  • You can run the tool and use the results however you like.
  • You can copy and paste code from the tool, include it wholesale in a bigger project of yours, or modify it to fit your needs, so long as you never distribute the larger project to another entity (it will forever be internal-only), or you distribute your project under the GPL as well.
  • If you'd like to include any code from this tool in a project of your own under other terms, you can contact me to discuss them.
Finally, why I took the time to build and release this:
  • Altruistically, to give GIS practitioners a useful tool -- for free! -- and GIS developers a larger example of ArcObjects geoprocessing to examine.
  • Less altruistically, my company will be releasing commercial products for GIS professionals in a fairly short timeframe; this is a great opportunity to attract potential beta users and measure the audience.
  • And finally: I have some availability for low-commitment consulting on top of ongoing product development. If you are reading this, like what you see in Boundary Generator, and have a small-yet-interesting geospatial problem to solve, please contact me.
I hope people find this interesting; I am planning on spending some more time cleaning up the rough edges over the next month or so. Feedback is extremely welcome!

Monday, September 15, 2008

You are in a maze of twisty passages, all alike. (An ArcObjects rant)

So, it's approaching midnight and the weekends-and-evenings side project I've been working on for the last three days or so suddenly stopped being fun and started being unpleasant.

Once again I find myself shaking my fist in the general direction of Redlands, CA. Can't they ever get anything right?

Hint: When your API contains the following interfaces;
  • IFeatureLayer
  • IFeatureClass
  • IGeoFeatureLayer
  • IDEFeatureClass
  • IGPFeatureLayer
; all of which are wrappers around pretty much the exact same concept... you're doing it wrong.

And to push things over the edge, the design wizards in Redlands have decided that it takes a dozen or or more twisty lines of under-documented property accesses and mystery casts to get from any one of them to another.

(snagged from the Fail blog)

For ESRI/ArcObjects virgins, you can get a glimpse of the horror in a conveniently visual form here: (warning: huge PDF; also an older version but what google found quickly).

And for those of you who have shared my pain, and might be interested as to what spurred this particular outburst (I'm probably into the hundreds of these, but this is the first I've published outside of a nasty email or two):

I'm implementing a geoprocessing tool, and trying to do it properly and robustly. ((Which means, at the moment, somehow getting from an IGPValue which contains a catalog path to an (output) feature layer, to the appropriate IFeatureWorkspace that can create a feature class in that location -- even if, for example, it's inside a personal geodatabase instead of a good ol' shapefile.))

Other things not working right yet:
  • Having my IGPFunction::Validate magically create an IDEFeatureLayer that appropriately describes its output. Necessary, I think, for people to be able to wire my tool up in Model Builder. (Side rant for another day -- Model Builder's value as a scripting/programming/automation tool: nearly worthless. Model Builder's value as a PowerPoint slide generator: $$$$.)
  • Getting IGPUtilities::GenerateDefaultOutputValue do something more interesting than E_FAIL.
Good news: Once things are working right, I'm planning to post and GPL the whole thing: that way others can possibly learn a bit from my struggles and/or help me overcome any lingering ArcObjects battles.

The tool itself is a reincarnation of Boundary Maker, an old Avenue script I wrote ages ago.

Saturday, May 17, 2008

Weekend listen: The Giant Pool of Money

It's an hour long, off topic, and I'm behind on work. But it's worth a listen if you're a non-finance-whiz, and interested in understanding the housing crisis:

("Free download" link is on the left side for a full mp3.)

For a geospatial take, the NY Fed has been putting together some very nice mortgage maps with quite recent data.

Wednesday, May 7, 2008

On sharing geospatial data

Blogger James Fee put up a post this monday that's generated a lot of interesting discussion.

Whenever geospatial data sharing comes up there's a pretty strong tendency to gloss over the fact that "sharing data" is actually a two-parter:
  • There is the sort of sharing you do with co-workers or associates: getting your data with them in, to borrow a phrase from the GPL, the "preferred form of the work for making modifications to it."
  • Then there is publishing GIS data (internally or externally) once it's "finalized."
As everyone (or at least everyone commenting on James' article) recognizes, the current state of the art sucks. We are lost in a maze of twisty file formats, standards, and protocols, none alike; even the simple task of emailing a shapefile to a co-worker is often the cause of bitter enmity between GIS and IT technicians.

Status quo for simple sharing: Zip your data up, making sure not to leave off any of the many files it is spread across, put a password on the zip file (the attachment will be stripped if it contains a personal geodatabase mdb visible to the spam filter), change the extension to something not '.zip', attach and pray. Gosh forbid you are sending anything larger than a few megabytes, or working with whole map documents at once. (Did you remember to include every related layer in that zip file with relative paths in the mxd?)

As for map data publishing, there is an embarrassment of options, none of them really mature (or rather, dominant) yet. Most require some programming in order to set up a publisher side, and on the consumer side you'll need to jump through some awkward conversion hoops as soon as you want to do something more involved than look at a pretty picture in IE or Google Earth. If you are like me, and you prefer or need to get your hands dirty, you breath a deep sigh of relief when there are plain old shapefiles or GeoTiffs available for download over HTTP.

Then there's discovering published data in the first place.... with a few exceptions, if you don't know in advance that it exists and have a good idea of where to look for it, it might as well not exist.

I like the idea of a SQLite data format, at least as a "preferred working format" for vector data. As a full relational database in a file, it supports the rich data model that was shoehorned into GML, but in a format designed especially for working with relational data and a full set of mature tools and libraries. And it doesn't have the drawbacks of being flaky and proprietary the way personal geodatabases are. What would be sweeten the pot for its use as a working data format, besides vendor support in the tools everyone uses: basic versioning or a 'track changes' equivalent. Way, way better would be supporting distributed branches and merging the way us programmers get to treat source code!

(Consider if the folks working on the WDPA had the GIS equivalent of git or darcs to do their jobs: branches for published versions where error corrections can still be made; the ability to accept or reject edits via email from anyone; local experts with their own working copies; larger datasets that contain other information on protected areas from other sources, but which can still automatically merge in the latest WDPA changes...)

That is the direction I think things will eventually move, but it might take a very long time. Don't hold your breath.

Footnote: I've put a reasonable amount of thought into these topics; without saying too much, they are extremely relevant to the product I am building. (ETA: inside of a couple of months!) (It's not decentralized version control for GIS, either: alack and alas.)

Tuesday, April 29, 2008

Where do people find the time?

I recently stumbled across this excellent, optimistic, little talk by Clay Shirky about user participation from the Web 2.0 conference. (Transcript.)

Wednesday, March 19, 2008

So you'd like to parse GML...

I am a big fan of standards.

However, I'd say that the Open Geospatial Consortium has a serious illness when it comes to defining standards for the industry: W3C-itis. The scope of the GML standard started out huge and became gigantic; the result is plagued with ambiguity and multiple ways of doing things; and is only half-correct on technical issues. (The worst example that I can think of is their botching of spatial references; e.g. axis order.) The OGC has W3C-itis far worse than the W3C themselves.

In short (as far as I can tell by looking at the result and not being involved in the process): the people writing the GML standards and those informing its development were the producers of data, and they took it up as their creed to ensure the standard could represent any and all data they could imagine and do so in a way that deviates as little as possible from their current practices. This has the effect of giving the shaft to any consumer of data, who needs to implement the entire standard in order to be able to meet the ideal of understanding all data produced under it. Which is the entire point of having a data format standard in the first place -- to decouple producers and consumers.

None of this is news and has been complained about loudly elsewhere. And there are now lots of bandaids emerging, e.g. the OGC's "profiles."

What I do have to contribute that's new is the following visualization of the GML schema for a single polygon. This is a railroad diagram, pretty straightforwardly interpreted by following the tracks. (Note that the tool I used sometimes puts sections right-to-left, so follow the arrows.) You'll need to click on it and zoom to 100% for it to be even remotely comprehensible.

A few simple observations and notes:
  • Everything is optional. You can have a completely empty polygon with no interior or exterior as far as GML is concerned.
  • Yes, there are five different ways to represent the points that make up a LineString for the boundary of a polygon's interior. Oh, and five different ways to represent a polygon boundary itself: as a LinearRing, or as a Ring with Curve, a CompositeCurve, an OrientableCurve or just a plain ol' LineString curveMembers. Nice.
  • The solid blue boxes are nonterminals ('complex types' in the XML Schema lingo): there are many more tags and choices hidden under them. I couldn't fully expand them without making a ridiculously large image.
  • I left off all XML attributes, in particular the grab-bag of attributes that sits on most every element. (srsName, arcRole, ref, etc)
  • The tools I used to create the image aren't perfect, so don't use this as a reference to GML. Instead think of it as a nice demonstration of the complexity you face.
I produced the image using ebnf2ps, which is awesome; although this is a bit of an abuse of its powers. I used an unfinished homebrew tool, which at some point may be released as open source, to process the schema's XSD files into EBNF. (Its actual purpose is to produce RELAX-NG schemas, which are very close cousins to EBNF.)

Thursday, February 7, 2008

Programming functionally: A proto-XPath implementation in Javascript

This has been sitting on my hard drive for a while. I thought I'd clean it up and post it! Well ... at least I've posted it.

If you don't know what "functional programming" is, wikipedia has a fairly technical overview. Loosely, "object" is to "object-oriented programming" as "function" is to "functional programming".

I wrote this to give a taste of the kind of power functional programming can bring to bear on a problem, as well as to document the basics of my psuedo-xpath implementation for myself. Some love to hate Javascript, but I think this comes more from the hostile browser environment than the language, which is quite nice. Note that, while I'm heavily using Javascript's functional features, I'm not making any effort to be 'purely functional'. (If I were using a purist style, I would not be using variable assignments and avoiding for/while loops, instead using recursion and if statements to the same effect.)

JS makes it quite easy to pass functions to as arguments, one of the core distinctions of programming in a functional style:
function doSomethingWithPi(arg) {
function eat(x) {
alert(x + " sure was yummy!");
doSomethingWithPi(eat); // ------> "3.1415 sure was yummy!"

Simple enough. Now consider:
function siblingAxis(f,node) {
var v = node.nextSibling;
while (v != null) {
v = v.nextSibling;

This function, when given a function & a DOM node, calls that function for every sibling of that node (until out of siblings). If you're familiar with the XPath spec, you already know why the function is called "siblingAxis". Here's another, very similar function -- it calls the provided function once for each child of the given node:
function childAxis(f,node) {
var v = node.firstChild;
if (v != null) {
f(v); // call the function for the first child node
siblingAxis(f,v); // do the same for all its siblings...

And here's one that takes two functions; the first acts as a test for the given DOM node and the second is only called if the test succeeds with the given node.
function test(test_f, f, node) {
if (test_f(node)) {

Now, if we could wire our test function up to our childAxis function, we would have the beginnings of something fairly powerful. The trick is to curry the function test. It's easier to show what this means rather than try to explain it.
function test(test_f, f) {
return function(node) {
if (test_f(node)) {

Now, test(test_f,action_f) does not perform its test or action right away. Instead it returns a new function, which expects a node and in turn runs the test against the node, and the other function if it's successful. Javascript implicitly keeps test_f and action_f arguments intact and available in the returned function. This is called a closure in FP jargon. By currying test, we can wire it up with e.g., childAxis:
function isLink(node) {
return (node.nodeType == 1 && node.localName.toLowerCase() == "a");

function doSomethingWithAllLinkChildren(f, node) {
childAxis(test(isLink,f), node); // easy peasy

It only makes sense to do the exact same thing with the axis functions themselves. This means that each axis function takes a "do something with nodes on this axis" function, and returns a function that, from a starting node, actually traverses the given axis and does the something.
function siblingAxis(f) {
return function(node) {
var v = node.nextSibling;
while (v != null) {
v = v.nextSibling;

function childAxis(f) {
return function(node) {
var v = node.firstChild;
if (v != null) {
f(v); // call the function for the first child node
siblingAxis(f)(v); // siblingAxis is curried: siblingAxis(f,v) becomes siblingAxis(f)(v)

What have we gained from this? Well, here's a function findem that alerts with the children of each span found within a div child of the given node. In xpath terms, we're after "/div/span/*" (with the starting node treated as the root):
function nameTest(nm) {
return function(node) {
return (node.nodeType == 1 && node.localName.toLowerCase() == nm.toLowerCase());

function nodealert(node) {
alert("found " + node.localName);

function findem(node) {
var path = childAxis(test(nameTest("div"),(childAxis(test(nameTest("span"),childAxis(nodealert)))));

Let's make another small change to test, currying it again:
function test(test_f) {
return function(f) {
return function(node) {
if (test_f(node)) {

Now, we can use a handy bit of non-javascript notation and say that f ⋅ g is shorthand for function(x) { f(g(x)) }, and rewrite findem in this style:
function findem(node) {
var path = childAxis⋅test(nameTest("div"))⋅childAxis⋅test(nameTest("span"))⋅childAxis⋅nodealert;

Observe the wonderful correspondence between our functions and the various components of the path they represent ("/div/span/*"):
/     ---> childAxis
div ---> test(nameTest("div"))
/ ---> childAxis
span ---> test(nameTest("span"))
/ ---> childAxis
* ---> (n/a)

At this point, having XPath (or something like it) up and running in javascript is a very straightforward job.

  • Each axis (child, sibling, descendant, etc) can be represented by : function someAxis(f) { return function(node) { /* traverse the axis from the node, calling 'f' along the way */ } }

  • Each test (both "node tests" and other predicates) can represented as a function that takes a node and returns a boolean, then gets wrapped by test. We'd need to write helper functions such as nameTest above to construct the actual test functions.

  • The input path is parsed and translated into a list of axis and test functions. Note that all of them have the form function(f) { return function(node) { /*...*/ } }

  • Combine the list of functions into the form a(b(c(d(ff))) where ff is the 'action' function that you want to be called for any nodes returned by the path. This has the effect of calling the outer function(f) { /*...*/} for each node step with f as the next node step, and results in a function awaiting a node...

  • Call it with the root node; your ff will be called for every succeeding node!

Click here to see the complete example of the proto-XPath implementation that I've been using along the way. It adds parent,self, and descendant axes; it also insert an implicit test at each step of evaluation to ensure that each node is visited only once. (This prevents repeat visits during paths like "//div//div"). The parsing code is hokey but works.


  • I'm using the term "currying" a bit loosely to refer both to the transformation of a function f(a,b,c) { /* ... */ } into f(a) { return f(b) { return f(c) { /*...*/ } } }, and to calling a function written in this style with partial arguments to return a new function, as in f(a_argument)(b_argument).

  • The code I've written is not purely functional, nor even close to it: However, it makes good use of some functional programming features, and is a nice functional/imperative compromise for javascript.

  • There is still quite a bit of work to get a full implementation of XPath up and running, as XPath is a large language. However, I believe the remaing work -- if you were interested in a full implementation of XPath -- consists mostly of straightforward extension of the sketch laid out here.

Saturday, January 5, 2008

Starting up

Here I am, embarking on the development of my new software business! Verdict so far: busy, but good busy.

I've stepped out on my own to develop software that works with and within the growing ecosystem of geographic information systems. This is a very exciting time in the geospatial world, with lots going on -- an opportune time for the birth of a new company. I'm eager to share what I can along the journey.

This blog will be a mixed announcement platform, dumping-ground, soap-box, and entry point into the many conversations happening out there on the web. Here's what to expect, especially while I'm focusing most intently on brewing up new software:
  • Code: I've been the beneficiary of open source for a very, very, long time now. Where it makes sense to do so, I plan on giving back.
  • Articles: I enjoy (over-) explaining things, particularly programming-related. Perhaps someone might find it useful; almost everything I've learned about coding in the last five years has been from the web.
  • "Blog" material: Pointers to useful resources, off-the-top-of-my-head pontification, product reviews, links to offbeat YouTube videos...
Here's to the interesting times ahead!