Friday, 30 October 2009

error!

I have been sorting out the graphs for the document. not as easy as it could be using excel and knowing it will have to be b&w for the thesis. Anyway as I have been outputting the results, they have looked different to before with lovely smooth curves from the web and knobbly ones from the corpus, also far fewer georefs in the corpus. I was going to output all the graphs and then try to understand what had happened.

Turns out I got them round the wrong way, an early error and the fact that I reference the data sets by a code rater than anything descriptive meant I have been disproving my theory all week. When I saw the smooth graphs I could not beleive they were my corpus ones because that would suggest that my corpus is better not just the same as the web. I also used the wrong corpus set, I used first 100 docs whereas I have another set that matches the file quantities for the web crawl. The correct set is bulding now, it is taking an age because as I first though it has 5x (about 5 million) the georefs in it. Someone recently called me "the stupidest clever person she knows"; Mmmmm.

I think the results will be quite good, when I get them.

I am going to have to work the weekend mostly, because I spent more time that I should have going to meetings that did not happen because the person who called them did not turn up themselves. If they could only have told me I would have saved 5 hours of my time in travel and sitting about (no fun when you are not paid bth and are already working every Saturday). Oh, and apparently I didn't need to be there anyway.

Wednesday, 14 October 2009

issues

There were some issues with the last post. The number of pages for each region name from ech source was different, and I counted total number, not mean per file. Now created a new set from the corpus tht mirrors the numbers of files from web. The index is still different because I use Lucene, and who knows how Yahoo! do it. Thus even when a region name is in the settlement set, the pages retreived by my index differs from the web one.

Tuesday, 13 October 2009

ambiguity and frequency

Strange. There are many more unique references in the geocode of the corpus comparedto straight from the web, but as a raw count the difference is not nearly do pronounced. Must now check distance to centroid and other such stats.