rob's PhD: April 2009

Tuesday, 28 April 2009

Stopwording the terms seems to decrease the distance from the centroid. This is important! (at least in my little fish-bowl).

Sunday, 26 April 2009

The download of the first 250 pages for each os50k settlement name in the uk has finished. CICS are probably still trying to find their missing bandwidth, and the uni network is probably running twice as fast now (only joking the latency of each batch of 50 has been the main cause of the slowness of the program).
Done now, won't have to do it again! Now to index it and see if the results are better from this corpus (it certainly has more placenames in it judging by my indexing of the first 150 of each).

Wednesday, 22 April 2009

This is resource plotted against distance for all regions from the region crawl (direct from Yahoo!). The pointy one is osl, the twin peaks one is oscp and the other one is os50k. I though postcodes would tend to be closer to the centroid than street names.

Some results

I have plotted distance against a metric that compares counts in bnc against counts in Geograph (modified from ACM GIS paper). Not sure I can see a relationship here. distance is x, the "common textness" measure is y, negaite numbers being more common text that geographic. clearly it clusters at shorter distances, and tends to be above the 0 line on the other metric.
I also checked how much "stopword" sets overlapped for the bnc derived top 1000 and the longest distance top 1000, they do not overlap more than random, although when you look at the words in each list they frequently seem plausible common english words. This suggests that the two lists do not validate each other, but should be combined. This however then leaves the problem of showing that they are valid to exclude. In particular I can hardly exclude all the furthest away places and then say "look the points cluster round the centroid better"!
I am working today on bucketing the numbers so it might be easier to see what is ging on and creating a graph showing the distribuion of distances, which I will thn run controlling for resource, region size and stopwordness. Hopefully these distributions will look different.

Tuesday, 14 April 2009

This is the non-random set of geocodes (genuinely from the webpages), stopworded.

Now indexing the first 3 directories of os50kcorpus. I am still collecting another 2, but have run out of patience to see what exactly I have in the first 3, which is pages 1-150 of each settlement name. Will it be skewed by the way it was created?

The other thing that can happen with this corpus is a search for the 2500 regions, that I can geocoded against a random selection of pages; are there any differences in scopes etc? Since this process is slow, I might choose a smaller set of "regions" the middle set is uk settlements, which are bounded by rural areas anyway and seem different to snis and counties. There are also many more of them, and it seems to me there are too many. It would probably have been better to select about 100 - 500 of them, but to put more effort into finding neighbourhood data (like snis) for different cities. Probably a bit late now.

Thursday, 9 April 2009

Not much variation with size of region. this does not support the gir2007 paper, but the methodology was different.

Bug fixed. number of items in each resource is extremely evenly distributed. Building the stopword list now to see what that does.

Perhaps the point should be that although the georefs are evenly distributed that for smaller regions you need smaller georefs, so focusing on those is the most important thing. Don't know, will have to have a think.

Wednesday, 8 April 2009

Found a "bugette" in the counting algorithm. Basically I counted settlements separately and excluded them from the os50k count, then did not put hem on the report. Fixing...

Still trying to get a report of resource totals to tally. Reading at the same time. Have read through the Monty Hall problem (Bayes) again. It is strange that most new ideas have that "oh yes of course" to them, but this one just seems too counter intuitive. Hopefully I will be able to believe in it one day...

Monday, 6 April 2009

Reading again while the machine builds a "stopword" list of geo/non-geo anbiguous placenames.
Also rebuilding counts of resources membership in regioncrawl and a randomised copy of regioncrawl. They should be the same. Will then check distance stats between the two which will differ. Will try looking at regioncrawl resource counts when stopworded, I suspect most stopwords appear in os50k. A more marked difference between levels may be obvious when the stopworded placenames are removed.
I am still downloading 150-200th web pages for the os50kcorpus, and when the machine is less busy 200-250. When I have these I will probably stop downloading pages; time is getting short and this takes too long.

Friday, 3 April 2009

Reading again today (and keeping the machine working).

Trying to find out how people have done geoparsing in the past, whether they try to use context, and if so how. I am hoping to take the view that the other georefs in a page are all the context needed and to use gazetteers to work out the relationships between them. This has been done before, but I am not sure anyone has investigated why certain assumptions should hold. Most of it has been done with implementation in mind and evaluated by testing the results.

Thursday, 2 April 2009

Doing a bit of reading in order to write today.

Wednesday, 1 April 2009

trying to build xaira on linux or find a way of running it with multiple names on windows has not been successful. Not using xslt script found in bnc installation disk to strip back to text. Now can use lucene to index it and get those lovely tf/df counts.

rob's PhD