rob's PhD

I currently have two corpora. One is derived from a set of administrative regions and one from all the settltement names that appear on the Ordnance Survey 1:50,000 (os50k) maps. I have the top 1000 hits from Yahoo! for each. I can index this in Lucene and I have geocoded some of it. I have download the corpus 50 web pages at a time for each query. I takes about a week to geoparse and geocode one set of pages like this. I do this using GATE from Sheffield University.

Currently I am looking at ranking terms that co-occur with region names, the expectation is that places near to the region will occur more than places far from it. I am interested in terms that are single words (maybe such as steel, fishing etc) and placenames (which can be "high street", "truro" etc). Not quite sure where this is going, but it tests an assumption made in earlier work.

There is always an element of scale in this work. previous work has looked at "The Midlands" and county sized regions, but maybe it is possible to define them at a much smaller size such as Hunter's Bar (a place in Sheffield 1km x 1km) etc.

Ambiguity is the other ever-present aspect. think for example of Sheffield and we think of the one in South Yorkshire, there is another one however in Cornwall (very small) as well as many reasonably large Sheffields in the US. There are also 37 places called Norton in UK, all about the same size. There is a place called Bath, one called Rugby and many small places with names such as flood, links, login etc. All of these places exist in the OS resources as a string and some co-ordinates; there isno indication to size, population etc.

rob's PhD

Tuesday, 24 March 2009

No comments:

Post a Comment

Followers

Blog Archive

About Me