Sunday, January 25, 2009

Open Street Map 2008: A Year of Edits

I didn't see this video up on Planet GS yet, and I thought it was pretty neat. It's a well-done animation of OSM's evolution during 2008: both pretty to watch and a compelling testament to the power of allowing users to create and meddle with data.




OSM 2008: A Year of Edits.

Unrelated: I've been quiet on the blogging front, but that's because I've been rather busy finishing out some contract work. Expect some updates to Boundary Generator soon!

Saturday, October 11, 2008

Inside Boundary Generator: Computational geometry for the GIS programmer

I'm going to take a post or two to dive into the internals of Boundary Generator, my GPL geoprocessing tool that had its "alpha" release last week. The nice thing about open source is that I can point directly at the code as I go along. Up this dreary Saturday afternoon: The underlying computational geometry (some code but mostly math!) behind finding shared boundaries. In my experience, it's fairly rare to get your hands dirty with this when doing GIS development, so hopefully others will find it as interesting as I do. (This is also fairly elementary as far as computational geometry goes; this blog post has nothing that wouldn't be taught in the first few sessions of an undergraduate computer graphics course, or even an expected prerequisite.)

The problem that we need to solve is that of finding overlapping line segments. In addition, our solution should have well-defined tolerances and not be subject to (too much) numerical instability.

Representing geometry

When sitting down to code, the first problem is that of representation: How do we store a line segment? Most peoples' instinctive representation is as a pair of endpoints -- (x0,y0),(x1,y1). But a bit of thought shows that there are many other options. For example, we could store line segments in a more algebraic form, such as the values m, b, x0 and x1 from the following description of a line segment:
  • y = mx + b where x0 <= x <= x1
Although this isn't as natural as simply working with endpoints, it has some big advantages over simply storing endpoints for finding overlaps:
  • If two line segments have the same slope (m) and y-intercept (b), we can see if they overlap just by checking their x0..x1 ranges for an overlap.
  • We can also test for line crossings by solving a system of two equations (y = (m0)(x) + b0 and y = (m1)(x) + b1) for x and y. Boundary Generator doesn't need this, but other geoprocessing functions in the future will.
However, this approach is plagued with numerical instability and special cases -- what do you do with lines that are vertical or near-vertical? -- that will quickly turn code into spaghetti as you add special cases to address these problems. It is also difficult to specify overlap tolerances in a meaningful way.

Instead, boundary generator uses a more abstract (and much better) representation that dives a little further (but not too far) past basic algebra: It is a parametric representation that consists of an origin point (o), a vector (v), and a scalar parameter (t). Just as an infinite line may be defined by a slope and y-intercept value, it may instead be defined as an origin and a vector; and just as a line segment can be defined upon an infinite line as a range of x values, a line segment can be defined as a range of t values. The line is all points p which satisfy p = vt + o; the line segment is the portion of the line where 0 <= t <= 1. If the preceeding made sense you may skip the next section or two. :)

Vectors (Go to the code; go to a more in-depth Wikipedia article.)

What is a vector? It is this: a magnitude (length), and a direction. It is typically and most conveniently represented as an (x,y) pair; this defines both direction and length. (The direction is the direction from the origin (0,0) through the point (x,y) and the magnitude is the length of a line segment from the origin to (x,y) ).

Vectors can be added to each other; if we have two vectors a and b, a + b is the vector (ax + bx, ay + by). (They can also be subtracted.) You can also scale a vector by an amount s, changing its length without changing its direction, simply by multiplying both x and y by the scale amount: (sx, sy).

We can also (obviously, since a vector is just an x,y pair) choose to represent a point via a vector simply by interpreting it as the magnitude and direction from the origin.

Line segments (Go to the code.)

Our parametric representation of an infinite line (repeating from before) is: a starting point o, a direction/magnitude v, and a parameter t: vt + o. By varying t, we get all points on the line. When t is 0 it gives us the starting point. We can also test to see if a point p is on the line by trying to solve p = vt +o for t.

And to turn our infinite line into a line segment: simply restrict t to lie between zero and one. Since v already includes a length as well as a direction, the line segment is nicely defined -- the start point is o; the end point is o + v (by setting t to 1.)

This also is a very, very good representation for finding overlapping segments as well as crossing segments. But first, I need to introduce...

The vector dot product (Go to the code; go to the Wikipedia article)

This is the swiss-army-knife of vector operations. It takes in two vectors, and returns (the length of the first vector) * (the length of the second vector) * (the cosine of the angle between them). It is also very easy to calculate. This has a ton of uses, to name a few:
  • The dot product of a vector with itself is equal to the square of its length.
  • The angle between two vectors a and b, both of magnitude 1, is acos(dot(a,b)).
  • If two vectors are perpendicular, their dot product will be zero.
The code uses the dot product in all these ways. For testing whether line segments overlap, we need to ensure they are parallel within a specific tolerance. This can be done by normalizing the vectors (multiplying each by 1/length to preserve its direction while making its length 1), then taking the dot product and inverse cosine; the two are parallel if the angle is within the tolerance to zero or 180 degrees.

The actual boundary generator implementation of AreParallel() uses a bit of algebra during parallel testing to eliminate all the slower and numerically unstable bits. (E.g., replacing divisions w/ multiplications; taking the cos() of the tolerance up front rather than acos() of the dot product).

The distance from a point to a line (Go to the code)

The dot product is also used to help determine the distance from a point to a line. It requires a bit of algebraic reasoning (I put a nicer version of this derivation on mathbin.net, but I don't know how long the link will last):
  • The line l is defined by vectors o and v as above; the point is the p.
  • If we draw a line from p to the closest point of l, it will be perpendicular to l. Call this closest point q.
  • Since q is on l, there is a t value such that q = tv + o.
  • Since the line from p to q is perpendicular to l, dot(v, p-q) = 0.
  • Need to introduce a pair of identities: dot(a, b+c) = dot(a,b) + dot(a,c) and dot(a, sb) = (s)(dot(a,b))
  • Thus, (dot(v,p) - dot(v,q)) = 0
  • dot(v,p) - dot(v,(vt+o)) = 0
  • dot(v,p) = dot(v,vt+o))
  • dot(v,p) = dot(v,vt) + dot(v,o)
  • dot(v,p) = t * dot(v,v) + dot(v,o)
  • t = (dot(v,p) - dot(v,o)) / dot(v,v)
Once we have t, we can find q easily, then the distance between p and q.

Finally: Finding the overlaps between line segments (Go to the code.)

With all these preliminaries out of the way, finding out if and how two line segments a and b overlap is quite straightforward, and the implementation of LineSeg::Overlay should be easy to read. First, check that angle between a and b's direction vectors is within the angle tolerance. Then, check to see that each endpoint of line b is within the distance tolerance of line b, saving the t values along the way. Finally, if the (min_t_val..max_t_val) range overlaps (0..1), there is an overlap! (If not, b lies upon the same infinite line as a but they don't have anything shared.) Finally, the implementation constructs the overlapping segment as well as the clipped portions of a and/or b that were not overlapping.

There are a few tweaks to keep the implementation is fast and reasonably numerically stable -- it uses squared distances everywhere instead of distances to avoid sqrt(), for example, and ends up only having two division operations involved.

There is also fairly extensive unit testing for all the above code -- essential to keep this type of code from having nasty corner-case errors!

This overlay operation alone would be enough to implement a very slow version of boundary generator: Test every line segment of every polygon against every other line segment of every polygon and record the overlaps. Unfortunately its performance is unacceptable.

A final interesting side note

The representation we chose for lines, and line segments means that everything -- all of the math and all of the code -- remains correct and true, even if we are working in three (or more!) dimensions. All that needs to change is to start using three-dimensional vectors rather than two; everything else is built atop them using vector operations that exist in any number of dimensions.

Next time

Next, when I find the time, will be a description of the quadtree spatial index that Boundary Generator uses to keep things at least somewhat snappy.

Thoughts?


I see this sort of tutorial content all over the place in programming communities, but less often in the geospatial developer world. If you're finding this interesting or helpful, please comment or drop me an email so I keep doing it!

Monday, October 6, 2008

Announcement: Boundary Generator, v. 0.1 (An open source geoprocessing tool.)

The time has come to introduce my weekends-and-evenings project for the last month to the world. World, meet Boundary Generator; Boundary Generator, meet world! It's an open source (GPL) geoprocessing tool. There are still some rough edges and the functionality is primitive, but it's good enough for people to download and give it a whirl.


What it does: Given an input layer of polygons, it creates lines that represent the shared borders between those polygons, attributing it with the IDs of the polygons on each side.

Why that is useful:
  • The output is an adjacency table, useful for all sorts of analyses. (For an example I'm familiar with, it is a required input to the MARXAN family of tools for automated conservation reserve design).
  • The output is also a normal shapefile, allowing you view it with interesting symbology or perform further processing. (Examples: Highlight antagonistic county borders by doing a join then symbolizing on the difference in political party identification between the sides; buffer international borders and measure populations w/in given distances, etc.)
  • For the programmers, it is a nice chunk of sample code for building a geoprocessing tool in C#. It is also an example of working directly with the underlying methods of computational geometry rather than relying on a pre-built API.
Quality and performance:
  • Boundary Generator is extremely alpha, so expect bugs and use with caution. I have done some testing, but far from enough.
  • The performance is not horrible; on my setup it runs at about 0.25 ms/line segment, or about three minutes to find the borders between the 3,488 counties in the US 2000 census. (My environment is WinXP under VirtualBox and linux, on an older 1.8Ghz laptop; the counties shapefile has ~803k line segments total.)
What the GPL open source license, as I understand it (obligatory: I am not a lawyer, so check first with yours!), lets you do:
  • You can run the tool and use the results however you like.
  • You can copy and paste code from the tool, include it wholesale in a bigger project of yours, or modify it to fit your needs, so long as you never distribute the larger project to another entity (it will forever be internal-only), or you distribute your project under the GPL as well.
  • If you'd like to include any code from this tool in a project of your own under other terms, you can contact me to discuss them.
Finally, why I took the time to build and release this:
  • Altruistically, to give GIS practitioners a useful tool -- for free! -- and GIS developers a larger example of ArcObjects geoprocessing to examine.
  • Less altruistically, my company will be releasing commercial products for GIS professionals in a fairly short timeframe; this is a great opportunity to attract potential beta users and measure the audience.
  • And finally: I have some availability for low-commitment consulting on top of ongoing product development. If you are reading this, like what you see in Boundary Generator, and have a small-yet-interesting geospatial problem to solve, please contact me.
I hope people find this interesting; I am planning on spending some more time cleaning up the rough edges over the next month or so. Feedback is extremely welcome!

Monday, September 15, 2008

You are in a maze of twisty passages, all alike. (An ArcObjects rant)

So, it's approaching midnight and the weekends-and-evenings side project I've been working on for the last three days or so suddenly stopped being fun and started being unpleasant.

Once again I find myself shaking my fist in the general direction of Redlands, CA. Can't they ever get anything right?

Hint: When your API contains the following interfaces;
  • IFeatureLayer
  • IFeatureClass
  • IGeoFeatureLayer
  • IDEFeatureClass
  • IGPFeatureLayer
; all of which are wrappers around pretty much the exact same concept... you're doing it wrong.

And to push things over the edge, the design wizards in Redlands have decided that it takes a dozen or or more twisty lines of under-documented property accesses and mystery casts to get from any one of them to another.

(snagged from the Fail blog)

For ESRI/ArcObjects virgins, you can get a glimpse of the horror in a conveniently visual form here: http://edndoc.esri.com/Arcobjectsonline/Diagrams/AllOMDs.pdf (warning: huge PDF; also an older version but what google found quickly).

And for those of you who have shared my pain, and might be interested as to what spurred this particular outburst (I'm probably into the hundreds of these, but this is the first I've published outside of a nasty email or two):

I'm implementing a geoprocessing tool, and trying to do it properly and robustly. ((Which means, at the moment, somehow getting from an IGPValue which contains a catalog path to an (output) feature layer, to the appropriate IFeatureWorkspace that can create a feature class in that location -- even if, for example, it's inside a personal geodatabase instead of a good ol' shapefile.))

Other things not working right yet:
  • Having my IGPFunction::Validate magically create an IDEFeatureLayer that appropriately describes its output. Necessary, I think, for people to be able to wire my tool up in Model Builder. (Side rant for another day -- Model Builder's value as a scripting/programming/automation tool: nearly worthless. Model Builder's value as a PowerPoint slide generator: $$$$.)
  • Getting IGPUtilities::GenerateDefaultOutputValue do something more interesting than E_FAIL.
Good news: Once things are working right, I'm planning to post and GPL the whole thing: that way others can possibly learn a bit from my struggles and/or help me overcome any lingering ArcObjects battles.

The tool itself is a reincarnation of Boundary Maker, an old Avenue script I wrote ages ago.

Saturday, May 17, 2008

Weekend listen: The Giant Pool of Money

It's an hour long, off topic, and I'm behind on work. But it's worth a listen if you're a non-finance-whiz, and interested in understanding the housing crisis:

http://www.thisamericanlife.org/Radio_Episode.aspx?episode=355

("Free download" link is on the left side for a full mp3.)

For a geospatial take, the NY Fed has been putting together some very nice mortgage maps with quite recent data.

Wednesday, May 7, 2008

On sharing geospatial data

Blogger James Fee put up a post this monday that's generated a lot of interesting discussion.

Whenever geospatial data sharing comes up there's a pretty strong tendency to gloss over the fact that "sharing data" is actually a two-parter:
  • There is the sort of sharing you do with co-workers or associates: getting your data with them in, to borrow a phrase from the GPL, the "preferred form of the work for making modifications to it."
  • Then there is publishing GIS data (internally or externally) once it's "finalized."
As everyone (or at least everyone commenting on James' article) recognizes, the current state of the art sucks. We are lost in a maze of twisty file formats, standards, and protocols, none alike; even the simple task of emailing a shapefile to a co-worker is often the cause of bitter enmity between GIS and IT technicians.

Status quo for simple sharing: Zip your data up, making sure not to leave off any of the many files it is spread across, put a password on the zip file (the attachment will be stripped if it contains a personal geodatabase mdb visible to the spam filter), change the extension to something not '.zip', attach and pray. Gosh forbid you are sending anything larger than a few megabytes, or working with whole map documents at once. (Did you remember to include every related layer in that zip file with relative paths in the mxd?)

As for map data publishing, there is an embarrassment of options, none of them really mature (or rather, dominant) yet. Most require some programming in order to set up a publisher side, and on the consumer side you'll need to jump through some awkward conversion hoops as soon as you want to do something more involved than look at a pretty picture in IE or Google Earth. If you are like me, and you prefer or need to get your hands dirty, you breath a deep sigh of relief when there are plain old shapefiles or GeoTiffs available for download over HTTP.

Then there's discovering published data in the first place.... with a few exceptions, if you don't know in advance that it exists and have a good idea of where to look for it, it might as well not exist.


I like the idea of a SQLite data format, at least as a "preferred working format" for vector data. As a full relational database in a file, it supports the rich data model that was shoehorned into GML, but in a format designed especially for working with relational data and a full set of mature tools and libraries. And it doesn't have the drawbacks of being flaky and proprietary the way personal geodatabases are. What would be sweeten the pot for its use as a working data format, besides vendor support in the tools everyone uses: basic versioning or a 'track changes' equivalent. Way, way better would be supporting distributed branches and merging the way us programmers get to treat source code!

(Consider if the folks working on the WDPA had the GIS equivalent of git or darcs to do their jobs: branches for published versions where error corrections can still be made; the ability to accept or reject edits via email from anyone; local experts with their own working copies; larger datasets that contain other information on protected areas from other sources, but which can still automatically merge in the latest WDPA changes...)

That is the direction I think things will eventually move, but it might take a very long time. Don't hold your breath.

Footnote: I've put a reasonable amount of thought into these topics; without saying too much, they are extremely relevant to the product I am building. (ETA: inside of a couple of months!) (It's not decentralized version control for GIS, either: alack and alas.)

Tuesday, April 29, 2008

Where do people find the time?

I recently stumbled across this excellent, optimistic, little talk by Clay Shirky about user participation from the Web 2.0 conference. (Transcript.)