Still backing up! It's crucial that I do this before the next step because I cannot afford to lose my data. It's boring and it's not progress, but it has to be done. I have nearly filled the 1tb partition I put aside for crawls/downloads. I have a 1tb usb drive on order, and am currently backing up to the 500gb one I got earlier in the year; it is laborious zipping each directory (each has to be <2gb). Worth doing though because text compresses really well.
I have a set of pages from a direct search of the yahoo! index, I also have a set where all the geo references have been replaced with random ones. I will show that one is more focused than the other. However this is not really a helpful experiment because web pages each have scopes and now effectively my randomised ones have uk wide scope. This does not help. What I need to do is search for a random set of web pages and compare that against the direct search. I have only got the first 150pages of the corpus derived from the os50k settlements. I will try indexing this with lucene and see what I get for the regions; it may be that there are not 50 pages for each region even. I cannot download more until the admin is done on my machine. I have the first 1000 pages for the region list; although I am currently repairing one of the directories where most of it was missing (never got downloaded in the first place).
I cannot install Xaira, the indexing system for the bnc on my linux box because apt-get is broken as I am on an old version of ubuntu which has been left behind by the upgrade path. There are ways to upgrade, but they risk losing the machine for a while, so since all the things I currently do on it still work I am not willing to risk it. I will have to download a text file and run it on my pc which has Xaira on it. Not insummountable, but annoying.
I gave up on the idea of running 2 pcs, not sure it would save any time in the end.
Subscribe to:
Post Comments (Atom)
Missed another directory. Hopefully the two mised ones will be there in the morning.
ReplyDeleteI only have 12 directories out of 20 for regioncrawl. I had collected all the boss hits but not the actual html. I suppose I have made this process less than easy to follow. It might take a few days to collect the html, not that I am sure I need it. Collecting os50k stuff is probably more important.
ReplyDelete