Thursday, September 23, 2010
Monday, June 21, 2010
So I'm happy to announce the first public release of PyRtree, a pure python spatial index. It is quite bare bones, and alpha quality, at the moment, but has proven very useful to me nonetheless.
It isn't as speedy as the Rtree library, which links to a C library, but it has the benefits of simplicity, code flexibility, and ease of use: no compilation or library version headaches!
For those who haven't worked with this corner of computer science before, expect a post in the next week (or three) explaining what an R-Tree is and how to use it.
Wednesday, June 16, 2010
Check it out and please let me know here what you think! You'll need a fairly recent web browser for it work.
Sunday, May 2, 2010
I do the same thing via RSS, using Google Reader, and it works much better than email. Unfortunately, most of the mailing lists I'm interested in don't have RSS-ified archives. So I wrote a short script that downloads email messages directly into an RSS feed; this is just a quick post to share it with the world: http://github.com/danshoutis/mail-to-rss
Here's the short rundown on how I've set things up:
- I set up a filter in gmail that automatically labels mailing list traffic and has it get archived immediately.
- The email-to-rss script downloads new messages from that gmail label, converts them to HTML, and builds an RSS index. I have it save the results into a public dropbox folder to save me the step of uploading them to a server somewhere.
- Then, I just give the public address of the final rss file to Google Reader, and schedule the script to run automatically.
Sunday, January 25, 2009
I didn't see this video up on Planet GS yet, and I thought it was pretty neat. It's a well-done animation of OSM's evolution during 2008: both pretty to watch and a compelling testament to the power of allowing users to create and meddle with data.
OSM 2008: A Year of Edits.
Unrelated: I've been quiet on the blogging front, but that's because I've been rather busy finishing out some contract work. Expect some updates to Boundary Generator soon!
Saturday, October 11, 2008
The problem that we need to solve is that of finding overlapping line segments. In addition, our solution should have well-defined tolerances and not be subject to (too much) numerical instability.
When sitting down to code, the first problem is that of representation: How do we store a line segment? Most peoples' instinctive representation is as a pair of endpoints -- (x0,y0),(x1,y1). But a bit of thought shows that there are many other options. For example, we could store line segments in a more algebraic form, such as the values m, b, x0 and x1 from the following description of a line segment:
- y = mx + b where x0 <= x <= x1
- If two line segments have the same slope (m) and y-intercept (b), we can see if they overlap just by checking their x0..x1 ranges for an overlap.
- We can also test for line crossings by solving a system of two equations (y = (m0)(x) + b0 and y = (m1)(x) + b1) for x and y. Boundary Generator doesn't need this, but other geoprocessing functions in the future will.
Instead, boundary generator uses a more abstract (and much better) representation that dives a little further (but not too far) past basic algebra: It is a parametric representation that consists of an origin point (o), a vector (v), and a scalar parameter (t). Just as an infinite line may be defined by a slope and y-intercept value, it may instead be defined as an origin and a vector; and just as a line segment can be defined upon an infinite line as a range of x values, a line segment can be defined as a range of t values. The line is all points p which satisfy p = vt + o; the line segment is the portion of the line where 0 <= t <= 1. If the preceeding made sense you may skip the next section or two. :)
Vectors (Go to the code; go to a more in-depth Wikipedia article.)
What is a vector? It is this: a magnitude (length), and a direction. It is typically and most conveniently represented as an (x,y) pair; this defines both direction and length. (The direction is the direction from the origin (0,0) through the point (x,y) and the magnitude is the length of a line segment from the origin to (x,y) ).
Vectors can be added to each other; if we have two vectors a and b, a + b is the vector (ax + bx, ay + by). (They can also be subtracted.) You can also scale a vector by an amount s, changing its length without changing its direction, simply by multiplying both x and y by the scale amount: (sx, sy).
We can also (obviously, since a vector is just an x,y pair) choose to represent a point via a vector simply by interpreting it as the magnitude and direction from the origin.
Line segments (Go to the code.)
Our parametric representation of an infinite line (repeating from before) is: a starting point o, a direction/magnitude v, and a parameter t: vt + o. By varying t, we get all points on the line. When t is 0 it gives us the starting point. We can also test to see if a point p is on the line by trying to solve p = vt +o for t.
And to turn our infinite line into a line segment: simply restrict t to lie between zero and one. Since v already includes a length as well as a direction, the line segment is nicely defined -- the start point is o; the end point is o + v (by setting t to 1.)
This also is a very, very good representation for finding overlapping segments as well as crossing segments. But first, I need to introduce...
The vector dot product (Go to the code; go to the Wikipedia article)
This is the swiss-army-knife of vector operations. It takes in two vectors, and returns (the length of the first vector) * (the length of the second vector) * (the cosine of the angle between them). It is also very easy to calculate. This has a ton of uses, to name a few:
- The dot product of a vector with itself is equal to the square of its length.
- The angle between two vectors a and b, both of magnitude 1, is acos(dot(a,b)).
- If two vectors are perpendicular, their dot product will be zero.
The actual boundary generator implementation of AreParallel() uses a bit of algebra during parallel testing to eliminate all the slower and numerically unstable bits. (E.g., replacing divisions w/ multiplications; taking the cos() of the tolerance up front rather than acos() of the dot product).
The distance from a point to a line (Go to the code)
The dot product is also used to help determine the distance from a point to a line. It requires a bit of algebraic reasoning (I put a nicer version of this derivation on mathbin.net, but I don't know how long the link will last):
- The line l is defined by vectors o and v as above; the point is the p.
- If we draw a line from p to the closest point of l, it will be perpendicular to l. Call this closest point q.
- Since q is on l, there is a t value such that q = tv + o.
- Since the line from p to q is perpendicular to l, dot(v, p-q) = 0.
- Need to introduce a pair of identities: dot(a, b+c) = dot(a,b) + dot(a,c) and dot(a, sb) = (s)(dot(a,b))
- Thus, (dot(v,p) - dot(v,q)) = 0
- dot(v,p) - dot(v,(vt+o)) = 0
- dot(v,p) = dot(v,vt+o))
- dot(v,p) = dot(v,vt) + dot(v,o)
- dot(v,p) = t * dot(v,v) + dot(v,o)
- t = (dot(v,p) - dot(v,o)) / dot(v,v)
Finally: Finding the overlaps between line segments (Go to the code.)
With all these preliminaries out of the way, finding out if and how two line segments a and b overlap is quite straightforward, and the implementation of LineSeg::Overlay should be easy to read. First, check that angle between a and b's direction vectors is within the angle tolerance. Then, check to see that each endpoint of line b is within the distance tolerance of line b, saving the t values along the way. Finally, if the (min_t_val..max_t_val) range overlaps (0..1), there is an overlap! (If not, b lies upon the same infinite line as a but they don't have anything shared.) Finally, the implementation constructs the overlapping segment as well as the clipped portions of a and/or b that were not overlapping.
There are a few tweaks to keep the implementation is fast and reasonably numerically stable -- it uses squared distances everywhere instead of distances to avoid sqrt(), for example, and ends up only having two division operations involved.
There is also fairly extensive unit testing for all the above code -- essential to keep this type of code from having nasty corner-case errors!
This overlay operation alone would be enough to implement a very slow version of boundary generator: Test every line segment of every polygon against every other line segment of every polygon and record the overlaps. Unfortunately its performance is unacceptable.
A final interesting side note
The representation we chose for lines, and line segments means that everything -- all of the math and all of the code -- remains correct and true, even if we are working in three (or more!) dimensions. All that needs to change is to start using three-dimensional vectors rather than two; everything else is built atop them using vector operations that exist in any number of dimensions.
Next, when I find the time, will be a description of the quadtree spatial index that Boundary Generator uses to keep things at least somewhat snappy.
I see this sort of tutorial content all over the place in programming communities, but less often in the geospatial developer world. If you're finding this interesting or helpful, please comment or drop me an email so I keep doing it!
Monday, October 6, 2008
What it does: Given an input layer of polygons, it creates lines that represent the shared borders between those polygons, attributing it with the IDs of the polygons on each side.
Why that is useful:
- The output is an adjacency table, useful for all sorts of analyses. (For an example I'm familiar with, it is a required input to the MARXAN family of tools for automated conservation reserve design).
- The output is also a normal shapefile, allowing you view it with interesting symbology or perform further processing. (Examples: Highlight antagonistic county borders by doing a join then symbolizing on the difference in political party identification between the sides; buffer international borders and measure populations w/in given distances, etc.)
- For the programmers, it is a nice chunk of sample code for building a geoprocessing tool in C#. It is also an example of working directly with the underlying methods of computational geometry rather than relying on a pre-built API.
- Boundary Generator is extremely alpha, so expect bugs and use with caution. I have done some testing, but far from enough.
- The performance is not horrible; on my setup it runs at about 0.25 ms/line segment, or about three minutes to find the borders between the 3,488 counties in the US 2000 census. (My environment is WinXP under VirtualBox and linux, on an older 1.8Ghz laptop; the counties shapefile has ~803k line segments total.)
- You can run the tool and use the results however you like.
- You can copy and paste code from the tool, include it wholesale in a bigger project of yours, or modify it to fit your needs, so long as you never distribute the larger project to another entity (it will forever be internal-only), or you distribute your project under the GPL as well.
- If you'd like to include any code from this tool in a project of your own under other terms, you can contact me to discuss them.
- Altruistically, to give GIS practitioners a useful tool -- for free! -- and GIS developers a larger example of ArcObjects geoprocessing to examine.
- Less altruistically, my company will be releasing commercial products for GIS professionals in a fairly short timeframe; this is a great opportunity to attract potential beta users and measure the audience.
- And finally: I have some availability for low-commitment consulting on top of ongoing product development. If you are reading this, like what you see in Boundary Generator, and have a small-yet-interesting geospatial problem to solve, please contact me.