Tuesday, 31 March 2009

Still backing up! It's crucial that I do this before the next step because I cannot afford to lose my data. It's boring and it's not progress, but it has to be done. I have nearly filled the 1tb partition I put aside for crawls/downloads. I have a 1tb usb drive on order, and am currently backing up to the 500gb one I got earlier in the year; it is laborious zipping each directory (each has to be <2gb). Worth doing though because text compresses really well.

I have a set of pages from a direct search of the yahoo! index, I also have a set where all the geo references have been replaced with random ones. I will show that one is more focused than the other. However this is not really a helpful experiment because web pages each have scopes and now effectively my randomised ones have uk wide scope. This does not help. What I need to do is search for a random set of web pages and compare that against the direct search. I have only got the first 150pages of the corpus derived from the os50k settlements. I will try indexing this with lucene and see what I get for the regions; it may be that there are not 50 pages for each region even. I cannot download more until the admin is done on my machine. I have the first 1000 pages for the region list; although I am currently repairing one of the directories where most of it was missing (never got downloaded in the first place).

I cannot install Xaira, the indexing system for the bnc on my linux box because apt-get is broken as I am on an old version of ubuntu which has been left behind by the upgrade path. There are ways to upgrade, but they risk losing the machine for a while, so since all the things I currently do on it still work I am not willing to risk it. I will have to download a text file and run it on my pc which has Xaira on it. Not insummountable, but annoying.

I gave up on the idea of running 2 pcs, not sure it would save any time in the end.

Monday, 30 March 2009

Something boring today. I have to back up my crawls. After all it took weeks (months?) to collect them and I would be stuffed if they go missing. Only problem is I didn't realise how much stuff I had collected. 20 * 50 * 34000 is a lot of files and runs into nearly 300 Gb. That's just one of them too.

Friday, 27 March 2009

I have been looking for patterns. Does something in the stats I have collected correlate with better definition of regions? I cannot find anything yet, all suprisingly randomly distributed.

So I am going back a step. In Geo-Tagging For Imprecise Regions of Different Sizes we found that the resouces from which georeferences came altered depending on the size of the region being searched for. We did this for a very small sample of a short list of region names. All the same it was a reasonable effort. The reason the sample was so small was that manual geo-tagging was employed to provide a ground truth. Thus it was possible to say where the error was. I now have a list of regions (NOT imprecise), and the boundaries for them. I am going to count the resources now for each region and see if the counts change dependant on the size of the region.

Additionally the resource rows are of various sizes within the resources (and they overlap in size). I wonder if there is a better way to characterise the sizes of the resource items? In Mapping Geographic Coverage of the Web we found Yahoo! document count a good surrogate (though certain places were very ambiguous and needed to be excluded). Maybe that will work?

Tuesday, 24 March 2009

stop words

In previous work ambiguity was examined. geo-nongeo ambiguity seemed to cause more problems that geo-geo. A stop word list was created by comparing counts of occurances in a corpus of everyday English, and in a geographical corpus (derived from geograph.org). When ranked by count things that were higher in the non geo corpus were assumed to be non-geo names on average and vice versa. This rough and ready technique was used to prune placenames in an experiment that estimated web-coverage fromvarious sources. It improved the results, making our estimates of web coverage correlate better.

There may be other ways to do the same thing, perhaps based on spatial measures. I am looking at this, and hope to show that places that are often further away from the region, are often stop words (links, login etc).
I currently have two corpora. One is derived from a set of administrative regions and one from all the settltement names that appear on the Ordnance Survey 1:50,000 (os50k) maps. I have the top 1000 hits from Yahoo! for each. I can index this in Lucene and I have geocoded some of it. I have download the corpus 50 web pages at a time for each query. I takes about a week to geoparse and geocode one set of pages like this. I do this using GATE from Sheffield University.

Currently I am looking at ranking terms that co-occur with region names, the expectation is that places near to the region will occur more than places far from it. I am interested in terms that are single words (maybe such as steel, fishing etc) and placenames (which can be "high street", "truro" etc). Not quite sure where this is going, but it tests an assumption made in earlier work.

There is always an element of scale in this work. previous work has looked at "The Midlands" and county sized regions, but maybe it is possible to define them at a much smaller size such as Hunter's Bar (a place in Sheffield 1km x 1km) etc.

Ambiguity is the other ever-present aspect. think for example of Sheffield and we think of the one in South Yorkshire, there is another one however in Cornwall (very small) as well as many reasonably large Sheffields in the US. There are also 37 places called Norton in UK, all about the same size. There is a place called Bath, one called Rugby and many small places with names such as flood, links, login etc. All of these places exist in the OS resources as a string and some co-ordinates; there isno indication to size, population etc.

Communications

Those that know me, or are connected in some way with my study will know that I communicate sporadically. I have created this blog in the hope that I will use it to communicate what I am up to and how things are progressing.
I am doing a PhD. This investigates how to mine definitions of imprecise regions from the web. Imprecise regions are regions such as The Midlands, or The rough area around the docks. People use these as if they were placenames, yet no official definition of the extent exists.