Wednesday, 22 April 2009

Some results


I have plotted distance against a metric that compares counts in bnc against counts in Geograph (modified from ACM GIS paper). Not sure I can see a relationship here. distance is x, the "common textness" measure is y, negaite numbers being more common text that geographic. clearly it clusters at shorter distances, and tends to be above the 0 line on the other metric.
I also checked how much "stopword" sets overlapped for the bnc derived top 1000 and the longest distance top 1000, they do not overlap more than random, although when you look at the words in each list they frequently seem plausible common english words. This suggests that the two lists do not validate each other, but should be combined. This however then leaves the problem of showing that they are valid to exclude. In particular I can hardly exclude all the furthest away places and then say "look the points cluster round the centroid better"!
I am working today on bucketing the numbers so it might be easier to see what is ging on and creating a graph showing the distribuion of distances, which I will thn run controlling for resource, region size and stopwordness. Hopefully these distributions will look different.

No comments:

Post a Comment