rob's PhD

Monday, 30 March 2009

Something boring today. I have to back up my crawls. After all it took weeks (months?) to collect them and I would be stuffed if they go missing. Only problem is I didn't realise how much stuff I had collected. 20 * 50 * 34000 is a lot of files and runs into nearly 300 Gb. That's just one of them too.

3 comments:

rob30 March 2009 at 06:53
zip has a 2gb limit, not helpful. Means that I have to split up my directories first then zip or find something with no such limit. Backups should be simple, so I am just going to buy a big external disk and copy everything onto there. Pity because zip reduced size by 80%! S'pose I could do it directory by directory...
ReplyDelete
Replies
rob30 March 2009 at 15:30
It's been slower than I thought.
ReplyDelete
Replies
rob31 March 2009 at 01:25
Of course I only have the first 3 directories of full html for the os50kcorpus. But every page up to 1000th for the regioncrawl. There are "only" 2560 regions so 20 * 50 * 2560 of full html and 3 * 50 * 2560 of oscorpus settlement derived corpus.
I wonder how useful this corpus is? I can compare techniques of searching within the os50kcorpus to get a set of pages equivelent to the regioncrawl, so it's useful to me.
ReplyDelete
Replies

Add comment

rob's PhD

Monday, 30 March 2009

3 comments:

Followers

Blog Archive

About Me