Friday, 29 May 2009

TR

I wonder if it is OK to evaluate Leidner's method on those where is is practical, and just exclude the ones where the complexity is too much. Maybe I can find some other feature that points to the likelyhood of high complexity and blame it on web pages being too broad in scope (ie multiple subjects, threads, passages per html file). This is probably the cause, but will need futher investigation. I think a "hotels in the midlands" could also bust the complexity limit (lots of addresses) and that is just the sort of page we do want.

survey has 578 responses

Wow!

Wednesday, 27 May 2009

Some documents want to iterate over e.g. 8.40E+197. Obviously out of the question. Lit suggests splitting up into passages, but then one reference per discourse is violated (potentially). This would reduce the complexity though. Algorithm works on references at present, not occurances of references, which it would need to make passages.

Other ways to reduce complexity would be to reduce ambiguity (how would you select?), or something else?...
To do one directory took 7 hours. One directory is about 50 files and takes much less than that using centroids.

so long

I let the file mentioned in the previous post process to completion (on my not very efficient version of Leidner's TR). I took about 7 hours. Not good. It is exponential, and Leidner used some other heiristic first that would have reduced the complexity, however it probably only works on documents with, say, 5 locations.

Tuesday, 26 May 2009

Implementing TR from Leidner

I thought I would see how Leidner's proposed TR works with the data and documents I have. Since I use street level data ambiguity can be much worse. This is a problem because there is a stage which tests all posible combinations of locations and builds an MBR for each, the area of this is the minimised.

I have a document with only 34 placenames in it which results in a matrix with 4 followed by 16 zeros more or less elements. "Union Road" for example appears 90 times in the resource, Norton 38 times and so-on.

Since I also use the web to find documents the chances are that there are documents with many more distinct place names in them. Some of these will also have big ambiguity. I think this makes thing unworkable in the proposed form. Another win for the apparently simplistic centroid method.

Thursday, 21 May 2009

Zipping through the responses, the photos of Hunters Bar have caused people problems.
1/ the photos are dark and unclear - I took them in winter
2/ they are too small - hampered by survey gizmo
3/ they could be any where - Interesting, this is a difference between what I did and street surveys
4/ many people did not know Hunters Bar anyway
5/ one suggestion of using street names
6/ another sugestion of using a map for people to draw on

It is interesting. I wondered if people would see things in the photos that they could say is or isn't in the region. I wonder if people questioned in the area could tell you if they were in HB or not. Mmmm, maybe I need to actually try that? Probably do not have time.

The difficulty of the last question was the most frequently commented aspect.

Also: the email address I gave was incomplete (now corrected as at date of this post).

survey gets 250 responses in a day

The survey to find people's perceptions of 4 specific imprecise regions went live yesterday to all student and staff at uni. It has already got over 250 completed responses. Great! I will have to upgrade my survey gizmo account in order to be able to read them all, but I will not do that until I know whether I need to view 1000 or 5000, which are different options at different costs per month. I seem to have created a survey that interests people anyway.

Tuesday, 19 May 2009

784,041 unique domains.
top 10:
wikipedia 54k
bbc 34k
local yahoo
estateangels
geograph
nestoria
yell
francisfrith
streetmap
bbc news
Unique urls number 3,212,305. That's not so surprising since many pages will be highly ranked for numberous places.
The corpus has 4,359,305 pages. Did not realise the failure rate must be quite high, I was expecting double that (34k * 250). I may need to investigate.

Tuesday, 12 May 2009

Typically things were not as I had expected. I am now running results for a limited number of regions (203) carefully selected to give a range of sizes and to not be ambiguous. I am also looking at the error (distance from region centre) rather than dist to centroid (which I suppose would be spread independant of error).

Tommorrow is the day I am supposed to stop experimenting and start writing! I have some results, but I think I will need to keep at it whilst writing. The results are probably enough to know what the story is though.