Reading again while the machine builds a "stopword" list of geo/non-geo anbiguous placenames.
Also rebuilding counts of resources membership in regioncrawl and a randomised copy of regioncrawl. They should be the same. Will then check distance stats between the two which will differ. Will try looking at regioncrawl resource counts when stopworded, I suspect most stopwords appear in os50k. A more marked difference between levels may be obvious when the stopworded placenames are removed.
I am still downloading 150-200th web pages for the os50kcorpus, and when the machine is less busy 200-250. When I have these I will probably stop downloading pages; time is getting short and this takes too long.
Subscribe to:
Post Comments (Atom)
Lucene standard analyser has builtin stopwording. so A avenue, avenue, "beach, the", beach all coming out the same. fixed. re counted. stoplist therefore nearly ready.
ReplyDeleteError in random place assigning also fixed, RE for picking up postcodes did not work on despaced postcodes, and the resource from which I drew the postcodes is despaced (doh).