rob's PhD: 2009

Monday, 9 November 2009

today's error

I am losing the will to live! All I want to do is write, but I keep finding minor errors in my data. So frustrating, but hopefully I am catching all the errors.

Today's blooper. In the distance to region centre I got the centroid wrong for Sheffield City Centre (fat fingers), thus all the refs were about 200 miles away in a big hump on the graph. Great there is a hump, but why over there? Duly corrected and now must run 4 sets of data off, then accumulate it then run the reports perform all the changes to get an excel chart that can actually be read (esp in b&w), and put it in the document. Then I can start thinking about what it means.

At least there are only 4 vernacular regions not 200 like with the administrative regions, so it is quicker to correct.

Rob.

Friday, 30 October 2009

error!

I have been sorting out the graphs for the document. not as easy as it could be using excel and knowing it will have to be b&w for the thesis. Anyway as I have been outputting the results, they have looked different to before with lovely smooth curves from the web and knobbly ones from the corpus, also far fewer georefs in the corpus. I was going to output all the graphs and then try to understand what had happened.

Turns out I got them round the wrong way, an early error and the fact that I reference the data sets by a code rater than anything descriptive meant I have been disproving my theory all week. When I saw the smooth graphs I could not beleive they were my corpus ones because that would suggest that my corpus is better not just the same as the web. I also used the wrong corpus set, I used first 100 docs whereas I have another set that matches the file quantities for the web crawl. The correct set is bulding now, it is taking an age because as I first though it has 5x (about 5 million) the georefs in it. Someone recently called me "the stupidest clever person she knows"; Mmmmm.

I think the results will be quite good, when I get them.

I am going to have to work the weekend mostly, because I spent more time that I should have going to meetings that did not happen because the person who called them did not turn up themselves. If they could only have told me I would have saved 5 hours of my time in travel and sitting about (no fun when you are not paid bth and are already working every Saturday). Oh, and apparently I didn't need to be there anyway.

Wednesday, 14 October 2009

issues

There were some issues with the last post. The number of pages for each region name from ech source was different, and I counted total number, not mean per file. Now created a new set from the corpus tht mirrors the numbers of files from web. The index is still different because I use Lucene, and who knows how Yahoo! do it. Thus even when a region name is in the settlement set, the pages retreived by my index differs from the web one.

Tuesday, 13 October 2009

ambiguity and frequency

Strange. There are many more unique references in the geocode of the corpus comparedto straight from the web, but as a raw count the difference is not nearly do pronounced. Must now check distance to centroid and other such stats.

Thursday, 17 September 2009

Don't follow the GPS too slavishly!

http://www.telegraph.co.uk/motoring/news/6197826/Driver-followed-satnav-to-edge-of-100ft-drop.html

Wednesday, 16 September 2009

Head down writing

48,485

I've been a bit quiet here.

I had a few family problems, but those are improving. They occupied me whilst I was supposed to be on holiday, so only 1/2 a holiday.

I am producing results, and statistics that allow me to assess them. I am also looking at the theory behind KDE surfaces; a bit mind boggling.

Friday, 24 July 2009

better Barnet

This seems to overlap better. I have developed a program for counting the overlap, and the distance from the centroid of the given amdin region to the red point in the diagram, which is the "peak".

new adaptive method

I had problems with the adaptive kde just mentioned. It did not work well on the results of the questionaire. I have changed it further and will now re-run it on the regions. the older one seems to have worked well on the regions, probably because the density of points is fairly uniform. The questionaire results are based on very few points (because we had to ask people whether places were in or out of a region). Not sure whether to use the new or the nearly new kde. It is questionable whether to adapt the bandwidth or the "height" of the points.

Also today will be looking at the relationship between distance and size of region in more detail.

Thursday, 16 July 2009

A county

I have created an adaptive kde that takes account of frequency. I have thresholded this such that a region of the same area as the known region is created. For Bedforshire it is pretty close. I am now investigating measures of overlap, which I expect will be similar to those for showing potential error in circle model and in using street bounding boxes to define regions.

A region!

Friday, 19 June 2009

% of overlap between region and circle model of same region

There is quite a difference in the simple model. I'm sure it's not as bad as with BBs, but in the regions I have selected there can be as low as 23% overlap. You can see why when looking at Abercanaid/Troedyrhiw shown here. The smaller the region the worse this effect is (settlements are "odder" shapes than counties?).
The flattening of the circle is due to some sort of datum/projection problem, there have been lots of conversions to get this graphic (this prob due to the cration of the tif, and the space it takes not being square, while the tif is square). It has no effect on the ratio of overlap area to polygon/circle area.

Wednesday, 17 June 2009

GIS

Note to self: when thinking about "just quickly doing x, using GIS" think a bit harder. The conversion routines such as gdal and ogr can seem simple and then turn out to be quite hard to deal with. EG gdal_rasterize does not actually create a tif file, you have to do it yourself first, and it has to be for the right area of the earth, the right size (granularity) with the right bands etc etc. What might initially seem easy can often not be and involve you in projections, and datums.

Monday, 15 June 2009

struggling with obtuse GIS conversion programs

Friday, 12 June 2009

Norm factor

this is what it is normalised by. the orange spike is just due to only having 7 items I guess.

normed on new regions

Normed by the underlying distribution. Smoothed by 21 point MA. The strange spike has gone as all the small places are now randomly selected,rather than just using all SNIS.

Calculating the area of the intersection of a polygon and a circle of the same area

How hard could that be? Quite hard it seems!

Thursday, 11 June 2009

region selection

I am reselecting the regions to use for these tests. I now randomly select from the population within each band up to a maximum. Some bands have only a few members, in which case I end up selecting them all in that band.

Wednesday, 10 June 2009

Underlying

Interesting. I suppose the spike is because all the small places are neigbourhoods of Sheffield, whereas in the other region sets there is a random spread.

normalised, by region size

Monday, 8 June 2009

results

I appear to have some results to look at. We had a good supervision meeting where we discussed how to present them to best effect.

Slight change of plan

It is now unlikely that I will go away for any length of time over the summer. This is better in many ways because I will have all the resources I need to make a good job of this PhD to hand.

Tuesday, 2 June 2009

Writing begins in earnest

I have started to structure my PhD. This is involving lots of reading because I forgot why certain citations were used in the transfer report. The thesis document looks nothing like that now and I must be certain that the cites are relevant and not misquoted. Doing well so far, but re-reading takes time. There are also later/more relevent ones (eg Garbin and Mani's investigation of explicit discriminators etc).

I am preparing to move the whole process to the other end of the country on Sunday. Actually this might be a good time for a change of scene, these 4 walls here are getting boring!

Friday, 29 May 2009

TR

I wonder if it is OK to evaluate Leidner's method on those where is is practical, and just exclude the ones where the complexity is too much. Maybe I can find some other feature that points to the likelyhood of high complexity and blame it on web pages being too broad in scope (ie multiple subjects, threads, passages per html file). This is probably the cause, but will need futher investigation. I think a "hotels in the midlands" could also bust the complexity limit (lots of addresses) and that is just the sort of page we do want.

survey has 578 responses

Wow!

Wednesday, 27 May 2009

Some documents want to iterate over e.g. 8.40E+197. Obviously out of the question. Lit suggests splitting up into passages, but then one reference per discourse is violated (potentially). This would reduce the complexity though. Algorithm works on references at present, not occurances of references, which it would need to make passages.

Other ways to reduce complexity would be to reduce ambiguity (how would you select?), or something else?...

To do one directory took 7 hours. One directory is about 50 files and takes much less than that using centroids.

so long

I let the file mentioned in the previous post process to completion (on my not very efficient version of Leidner's TR). I took about 7 hours. Not good. It is exponential, and Leidner used some other heiristic first that would have reduced the complexity, however it probably only works on documents with, say, 5 locations.

Tuesday, 26 May 2009

Implementing TR from Leidner

I thought I would see how Leidner's proposed TR works with the data and documents I have. Since I use street level data ambiguity can be much worse. This is a problem because there is a stage which tests all posible combinations of locations and builds an MBR for each, the area of this is the minimised.

I have a document with only 34 placenames in it which results in a matrix with 4 followed by 16 zeros more or less elements. "Union Road" for example appears 90 times in the resource, Norton 38 times and so-on.

Since I also use the web to find documents the chances are that there are documents with many more distinct place names in them. Some of these will also have big ambiguity. I think this makes thing unworkable in the proposed form. Another win for the apparently simplistic centroid method.

Thursday, 21 May 2009

Zipping through the responses, the photos of Hunters Bar have caused people problems.
1/ the photos are dark and unclear - I took them in winter
2/ they are too small - hampered by survey gizmo
3/ they could be any where - Interesting, this is a difference between what I did and street surveys
4/ many people did not know Hunters Bar anyway
5/ one suggestion of using street names
6/ another sugestion of using a map for people to draw on

It is interesting. I wondered if people would see things in the photos that they could say is or isn't in the region. I wonder if people questioned in the area could tell you if they were in HB or not. Mmmm, maybe I need to actually try that? Probably do not have time.

The difficulty of the last question was the most frequently commented aspect.

Also: the email address I gave was incomplete (now corrected as at date of this post).

survey gets 250 responses in a day

The survey to find people's perceptions of 4 specific imprecise regions went live yesterday to all student and staff at uni. It has already got over 250 completed responses. Great! I will have to upgrade my survey gizmo account in order to be able to read them all, but I will not do that until I know whether I need to view 1000 or 5000, which are different options at different costs per month. I seem to have created a survey that interests people anyway.

Tuesday, 19 May 2009

784,041 unique domains.
top 10:
wikipedia 54k
bbc 34k
local yahoo
estateangels
geograph
nestoria
yell
francisfrith
streetmap
bbc news

Unique urls number 3,212,305. That's not so surprising since many pages will be highly ranked for numberous places.

The corpus has 4,359,305 pages. Did not realise the failure rate must be quite high, I was expecting double that (34k * 250). I may need to investigate.

Tuesday, 12 May 2009

Typically things were not as I had expected. I am now running results for a limited number of regions (203) carefully selected to give a range of sizes and to not be ambiguous. I am also looking at the error (distance from region centre) rather than dist to centroid (which I suppose would be spread independant of error).

Tommorrow is the day I am supposed to stop experimenting and start writing! I have some results, but I think I will need to keep at it whilst writing. The results are probably enough to know what the story is though.

Tuesday, 28 April 2009

Stopwording the terms seems to decrease the distance from the centroid. This is important! (at least in my little fish-bowl).

Sunday, 26 April 2009

The download of the first 250 pages for each os50k settlement name in the uk has finished. CICS are probably still trying to find their missing bandwidth, and the uni network is probably running twice as fast now (only joking the latency of each batch of 50 has been the main cause of the slowness of the program).
Done now, won't have to do it again! Now to index it and see if the results are better from this corpus (it certainly has more placenames in it judging by my indexing of the first 150 of each).

Wednesday, 22 April 2009

This is resource plotted against distance for all regions from the region crawl (direct from Yahoo!). The pointy one is osl, the twin peaks one is oscp and the other one is os50k. I though postcodes would tend to be closer to the centroid than street names.

Some results

I have plotted distance against a metric that compares counts in bnc against counts in Geograph (modified from ACM GIS paper). Not sure I can see a relationship here. distance is x, the "common textness" measure is y, negaite numbers being more common text that geographic. clearly it clusters at shorter distances, and tends to be above the 0 line on the other metric.
I also checked how much "stopword" sets overlapped for the bnc derived top 1000 and the longest distance top 1000, they do not overlap more than random, although when you look at the words in each list they frequently seem plausible common english words. This suggests that the two lists do not validate each other, but should be combined. This however then leaves the problem of showing that they are valid to exclude. In particular I can hardly exclude all the furthest away places and then say "look the points cluster round the centroid better"!
I am working today on bucketing the numbers so it might be easier to see what is ging on and creating a graph showing the distribuion of distances, which I will thn run controlling for resource, region size and stopwordness. Hopefully these distributions will look different.

Tuesday, 14 April 2009

This is the non-random set of geocodes (genuinely from the webpages), stopworded.

Now indexing the first 3 directories of os50kcorpus. I am still collecting another 2, but have run out of patience to see what exactly I have in the first 3, which is pages 1-150 of each settlement name. Will it be skewed by the way it was created?

The other thing that can happen with this corpus is a search for the 2500 regions, that I can geocoded against a random selection of pages; are there any differences in scopes etc? Since this process is slow, I might choose a smaller set of "regions" the middle set is uk settlements, which are bounded by rural areas anyway and seem different to snis and counties. There are also many more of them, and it seems to me there are too many. It would probably have been better to select about 100 - 500 of them, but to put more effort into finding neighbourhood data (like snis) for different cities. Probably a bit late now.

Thursday, 9 April 2009

Not much variation with size of region. this does not support the gir2007 paper, but the methodology was different.

Bug fixed. number of items in each resource is extremely evenly distributed. Building the stopword list now to see what that does.

Perhaps the point should be that although the georefs are evenly distributed that for smaller regions you need smaller georefs, so focusing on those is the most important thing. Don't know, will have to have a think.

Wednesday, 8 April 2009

Found a "bugette" in the counting algorithm. Basically I counted settlements separately and excluded them from the os50k count, then did not put hem on the report. Fixing...

Still trying to get a report of resource totals to tally. Reading at the same time. Have read through the Monty Hall problem (Bayes) again. It is strange that most new ideas have that "oh yes of course" to them, but this one just seems too counter intuitive. Hopefully I will be able to believe in it one day...

Monday, 6 April 2009

Reading again while the machine builds a "stopword" list of geo/non-geo anbiguous placenames.
Also rebuilding counts of resources membership in regioncrawl and a randomised copy of regioncrawl. They should be the same. Will then check distance stats between the two which will differ. Will try looking at regioncrawl resource counts when stopworded, I suspect most stopwords appear in os50k. A more marked difference between levels may be obvious when the stopworded placenames are removed.
I am still downloading 150-200th web pages for the os50kcorpus, and when the machine is less busy 200-250. When I have these I will probably stop downloading pages; time is getting short and this takes too long.

Friday, 3 April 2009

Reading again today (and keeping the machine working).

Trying to find out how people have done geoparsing in the past, whether they try to use context, and if so how. I am hoping to take the view that the other georefs in a page are all the context needed and to use gazetteers to work out the relationships between them. This has been done before, but I am not sure anyone has investigated why certain assumptions should hold. Most of it has been done with implementation in mind and evaluated by testing the results.

Thursday, 2 April 2009

Doing a bit of reading in order to write today.

Wednesday, 1 April 2009

trying to build xaira on linux or find a way of running it with multiple names on windows has not been successful. Not using xslt script found in bnc installation disk to strip back to text. Now can use lucene to index it and get those lovely tf/df counts.

Tuesday, 31 March 2009

Still backing up! It's crucial that I do this before the next step because I cannot afford to lose my data. It's boring and it's not progress, but it has to be done. I have nearly filled the 1tb partition I put aside for crawls/downloads. I have a 1tb usb drive on order, and am currently backing up to the 500gb one I got earlier in the year; it is laborious zipping each directory (each has to be <2gb). Worth doing though because text compresses really well.

I have a set of pages from a direct search of the yahoo! index, I also have a set where all the geo references have been replaced with random ones. I will show that one is more focused than the other. However this is not really a helpful experiment because web pages each have scopes and now effectively my randomised ones have uk wide scope. This does not help. What I need to do is search for a random set of web pages and compare that against the direct search. I have only got the first 150pages of the corpus derived from the os50k settlements. I will try indexing this with lucene and see what I get for the regions; it may be that there are not 50 pages for each region even. I cannot download more until the admin is done on my machine. I have the first 1000 pages for the region list; although I am currently repairing one of the directories where most of it was missing (never got downloaded in the first place).

I cannot install Xaira, the indexing system for the bnc on my linux box because apt-get is broken as I am on an old version of ubuntu which has been left behind by the upgrade path. There are ways to upgrade, but they risk losing the machine for a while, so since all the things I currently do on it still work I am not willing to risk it. I will have to download a text file and run it on my pc which has Xaira on it. Not insummountable, but annoying.

I gave up on the idea of running 2 pcs, not sure it would save any time in the end.

Monday, 30 March 2009

Something boring today. I have to back up my crawls. After all it took weeks (months?) to collect them and I would be stuffed if they go missing. Only problem is I didn't realise how much stuff I had collected. 20 * 50 * 34000 is a lot of files and runs into nearly 300 Gb. That's just one of them too.

Friday, 27 March 2009

I have been looking for patterns. Does something in the stats I have collected correlate with better definition of regions? I cannot find anything yet, all suprisingly randomly distributed.

So I am going back a step. In Geo-Tagging For Imprecise Regions of Different Sizes we found that the resouces from which georeferences came altered depending on the size of the region being searched for. We did this for a very small sample of a short list of region names. All the same it was a reasonable effort. The reason the sample was so small was that manual geo-tagging was employed to provide a ground truth. Thus it was possible to say where the error was. I now have a list of regions (NOT imprecise), and the boundaries for them. I am going to count the resources now for each region and see if the counts change dependant on the size of the region.

Additionally the resource rows are of various sizes within the resources (and they overlap in size). I wonder if there is a better way to characterise the sizes of the resource items? In Mapping Geographic Coverage of the Web we found Yahoo! document count a good surrogate (though certain places were very ambiguous and needed to be excluded). Maybe that will work?

Tuesday, 24 March 2009

stop words

In previous work ambiguity was examined. geo-nongeo ambiguity seemed to cause more problems that geo-geo. A stop word list was created by comparing counts of occurances in a corpus of everyday English, and in a geographical corpus (derived from geograph.org). When ranked by count things that were higher in the non geo corpus were assumed to be non-geo names on average and vice versa. This rough and ready technique was used to prune placenames in an experiment that estimated web-coverage fromvarious sources. It improved the results, making our estimates of web coverage correlate better.

There may be other ways to do the same thing, perhaps based on spatial measures. I am looking at this, and hope to show that places that are often further away from the region, are often stop words (links, login etc).

I currently have two corpora. One is derived from a set of administrative regions and one from all the settltement names that appear on the Ordnance Survey 1:50,000 (os50k) maps. I have the top 1000 hits from Yahoo! for each. I can index this in Lucene and I have geocoded some of it. I have download the corpus 50 web pages at a time for each query. I takes about a week to geoparse and geocode one set of pages like this. I do this using GATE from Sheffield University.

Currently I am looking at ranking terms that co-occur with region names, the expectation is that places near to the region will occur more than places far from it. I am interested in terms that are single words (maybe such as steel, fishing etc) and placenames (which can be "high street", "truro" etc). Not quite sure where this is going, but it tests an assumption made in earlier work.

There is always an element of scale in this work. previous work has looked at "The Midlands" and county sized regions, but maybe it is possible to define them at a much smaller size such as Hunter's Bar (a place in Sheffield 1km x 1km) etc.

Ambiguity is the other ever-present aspect. think for example of Sheffield and we think of the one in South Yorkshire, there is another one however in Cornwall (very small) as well as many reasonably large Sheffields in the US. There are also 37 places called Norton in UK, all about the same size. There is a place called Bath, one called Rugby and many small places with names such as flood, links, login etc. All of these places exist in the OS resources as a string and some co-ordinates; there isno indication to size, population etc.

Communications

Those that know me, or are connected in some way with my study will know that I communicate sporadically. I have created this blog in the hope that I will use it to communicate what I am up to and how things are progressing.

I am doing a PhD. This investigates how to mine definitions of imprecise regions from the web. Imprecise regions are regions such as The Midlands, or The rough area around the docks. People use these as if they were placenames, yet no official definition of the extent exists.