Saturday, October 11, 2008

Inside Boundary Generator: Computational geometry for the GIS programmer

I'm going to take a post or two to dive into the internals of Boundary Generator, my GPL geoprocessing tool that had its "alpha" release last week. The nice thing about open source is that I can point directly at the code as I go along. Up this dreary Saturday afternoon: The underlying computational geometry (some code but mostly math!) behind finding shared boundaries. In my experience, it's fairly rare to get your hands dirty with this when doing GIS development, so hopefully others will find it as interesting as I do. (This is also fairly elementary as far as computational geometry goes; this blog post has nothing that wouldn't be taught in the first few sessions of an undergraduate computer graphics course, or even an expected prerequisite.)

The problem that we need to solve is that of finding overlapping line segments. In addition, our solution should have well-defined tolerances and not be subject to (too much) numerical instability.

Representing geometry

When sitting down to code, the first problem is that of representation: How do we store a line segment? Most peoples' instinctive representation is as a pair of endpoints -- (x0,y0),(x1,y1). But a bit of thought shows that there are many other options. For example, we could store line segments in a more algebraic form, such as the values m, b, x0 and x1 from the following description of a line segment:
  • y = mx + b where x0 <= x <= x1
Although this isn't as natural as simply working with endpoints, it has some big advantages over simply storing endpoints for finding overlaps:
  • If two line segments have the same slope (m) and y-intercept (b), we can see if they overlap just by checking their x0..x1 ranges for an overlap.
  • We can also test for line crossings by solving a system of two equations (y = (m0)(x) + b0 and y = (m1)(x) + b1) for x and y. Boundary Generator doesn't need this, but other geoprocessing functions in the future will.
However, this approach is plagued with numerical instability and special cases -- what do you do with lines that are vertical or near-vertical? -- that will quickly turn code into spaghetti as you add special cases to address these problems. It is also difficult to specify overlap tolerances in a meaningful way.

Instead, boundary generator uses a more abstract (and much better) representation that dives a little further (but not too far) past basic algebra: It is a parametric representation that consists of an origin point (o), a vector (v), and a scalar parameter (t). Just as an infinite line may be defined by a slope and y-intercept value, it may instead be defined as an origin and a vector; and just as a line segment can be defined upon an infinite line as a range of x values, a line segment can be defined as a range of t values. The line is all points p which satisfy p = vt + o; the line segment is the portion of the line where 0 <= t <= 1. If the preceeding made sense you may skip the next section or two. :)

Vectors (Go to the code; go to a more in-depth Wikipedia article.)

What is a vector? It is this: a magnitude (length), and a direction. It is typically and most conveniently represented as an (x,y) pair; this defines both direction and length. (The direction is the direction from the origin (0,0) through the point (x,y) and the magnitude is the length of a line segment from the origin to (x,y) ).

Vectors can be added to each other; if we have two vectors a and b, a + b is the vector (ax + bx, ay + by). (They can also be subtracted.) You can also scale a vector by an amount s, changing its length without changing its direction, simply by multiplying both x and y by the scale amount: (sx, sy).

We can also (obviously, since a vector is just an x,y pair) choose to represent a point via a vector simply by interpreting it as the magnitude and direction from the origin.

Line segments (Go to the code.)

Our parametric representation of an infinite line (repeating from before) is: a starting point o, a direction/magnitude v, and a parameter t: vt + o. By varying t, we get all points on the line. When t is 0 it gives us the starting point. We can also test to see if a point p is on the line by trying to solve p = vt +o for t.

And to turn our infinite line into a line segment: simply restrict t to lie between zero and one. Since v already includes a length as well as a direction, the line segment is nicely defined -- the start point is o; the end point is o + v (by setting t to 1.)

This also is a very, very good representation for finding overlapping segments as well as crossing segments. But first, I need to introduce...

The vector dot product (Go to the code; go to the Wikipedia article)

This is the swiss-army-knife of vector operations. It takes in two vectors, and returns (the length of the first vector) * (the length of the second vector) * (the cosine of the angle between them). It is also very easy to calculate. This has a ton of uses, to name a few:
  • The dot product of a vector with itself is equal to the square of its length.
  • The angle between two vectors a and b, both of magnitude 1, is acos(dot(a,b)).
  • If two vectors are perpendicular, their dot product will be zero.
The code uses the dot product in all these ways. For testing whether line segments overlap, we need to ensure they are parallel within a specific tolerance. This can be done by normalizing the vectors (multiplying each by 1/length to preserve its direction while making its length 1), then taking the dot product and inverse cosine; the two are parallel if the angle is within the tolerance to zero or 180 degrees.

The actual boundary generator implementation of AreParallel() uses a bit of algebra during parallel testing to eliminate all the slower and numerically unstable bits. (E.g., replacing divisions w/ multiplications; taking the cos() of the tolerance up front rather than acos() of the dot product).

The distance from a point to a line (Go to the code)

The dot product is also used to help determine the distance from a point to a line. It requires a bit of algebraic reasoning (I put a nicer version of this derivation on mathbin.net, but I don't know how long the link will last):
  • The line l is defined by vectors o and v as above; the point is the p.
  • If we draw a line from p to the closest point of l, it will be perpendicular to l. Call this closest point q.
  • Since q is on l, there is a t value such that q = tv + o.
  • Since the line from p to q is perpendicular to l, dot(v, p-q) = 0.
  • Need to introduce a pair of identities: dot(a, b+c) = dot(a,b) + dot(a,c) and dot(a, sb) = (s)(dot(a,b))
  • Thus, (dot(v,p) - dot(v,q)) = 0
  • dot(v,p) - dot(v,(vt+o)) = 0
  • dot(v,p) = dot(v,vt+o))
  • dot(v,p) = dot(v,vt) + dot(v,o)
  • dot(v,p) = t * dot(v,v) + dot(v,o)
  • t = (dot(v,p) - dot(v,o)) / dot(v,v)
Once we have t, we can find q easily, then the distance between p and q.

Finally: Finding the overlaps between line segments (Go to the code.)

With all these preliminaries out of the way, finding out if and how two line segments a and b overlap is quite straightforward, and the implementation of LineSeg::Overlay should be easy to read. First, check that angle between a and b's direction vectors is within the angle tolerance. Then, check to see that each endpoint of line b is within the distance tolerance of line b, saving the t values along the way. Finally, if the (min_t_val..max_t_val) range overlaps (0..1), there is an overlap! (If not, b lies upon the same infinite line as a but they don't have anything shared.) Finally, the implementation constructs the overlapping segment as well as the clipped portions of a and/or b that were not overlapping.

There are a few tweaks to keep the implementation is fast and reasonably numerically stable -- it uses squared distances everywhere instead of distances to avoid sqrt(), for example, and ends up only having two division operations involved.

There is also fairly extensive unit testing for all the above code -- essential to keep this type of code from having nasty corner-case errors!

This overlay operation alone would be enough to implement a very slow version of boundary generator: Test every line segment of every polygon against every other line segment of every polygon and record the overlaps. Unfortunately its performance is unacceptable.

A final interesting side note

The representation we chose for lines, and line segments means that everything -- all of the math and all of the code -- remains correct and true, even if we are working in three (or more!) dimensions. All that needs to change is to start using three-dimensional vectors rather than two; everything else is built atop them using vector operations that exist in any number of dimensions.

Next time

Next, when I find the time, will be a description of the quadtree spatial index that Boundary Generator uses to keep things at least somewhat snappy.

Thoughts?


I see this sort of tutorial content all over the place in programming communities, but less often in the geospatial developer world. If you're finding this interesting or helpful, please comment or drop me an email so I keep doing it!

Monday, October 6, 2008

Announcement: Boundary Generator, v. 0.1 (An open source geoprocessing tool.)

The time has come to introduce my weekends-and-evenings project for the last month to the world. World, meet Boundary Generator; Boundary Generator, meet world! It's an open source (GPL) geoprocessing tool. There are still some rough edges and the functionality is primitive, but it's good enough for people to download and give it a whirl.


What it does: Given an input layer of polygons, it creates lines that represent the shared borders between those polygons, attributing it with the IDs of the polygons on each side.

Why that is useful:
  • The output is an adjacency table, useful for all sorts of analyses. (For an example I'm familiar with, it is a required input to the MARXAN family of tools for automated conservation reserve design).
  • The output is also a normal shapefile, allowing you view it with interesting symbology or perform further processing. (Examples: Highlight antagonistic county borders by doing a join then symbolizing on the difference in political party identification between the sides; buffer international borders and measure populations w/in given distances, etc.)
  • For the programmers, it is a nice chunk of sample code for building a geoprocessing tool in C#. It is also an example of working directly with the underlying methods of computational geometry rather than relying on a pre-built API.
Quality and performance:
  • Boundary Generator is extremely alpha, so expect bugs and use with caution. I have done some testing, but far from enough.
  • The performance is not horrible; on my setup it runs at about 0.25 ms/line segment, or about three minutes to find the borders between the 3,488 counties in the US 2000 census. (My environment is WinXP under VirtualBox and linux, on an older 1.8Ghz laptop; the counties shapefile has ~803k line segments total.)
What the GPL open source license, as I understand it (obligatory: I am not a lawyer, so check first with yours!), lets you do:
  • You can run the tool and use the results however you like.
  • You can copy and paste code from the tool, include it wholesale in a bigger project of yours, or modify it to fit your needs, so long as you never distribute the larger project to another entity (it will forever be internal-only), or you distribute your project under the GPL as well.
  • If you'd like to include any code from this tool in a project of your own under other terms, you can contact me to discuss them.
Finally, why I took the time to build and release this:
  • Altruistically, to give GIS practitioners a useful tool -- for free! -- and GIS developers a larger example of ArcObjects geoprocessing to examine.
  • Less altruistically, my company will be releasing commercial products for GIS professionals in a fairly short timeframe; this is a great opportunity to attract potential beta users and measure the audience.
  • And finally: I have some availability for low-commitment consulting on top of ongoing product development. If you are reading this, like what you see in Boundary Generator, and have a small-yet-interesting geospatial problem to solve, please contact me.
I hope people find this interesting; I am planning on spending some more time cleaning up the rough edges over the next month or so. Feedback is extremely welcome!