Monday, 30 March 2009

Something boring today. I have to back up my crawls. After all it took weeks (months?) to collect them and I would be stuffed if they go missing. Only problem is I didn't realise how much stuff I had collected. 20 * 50 * 34000 is a lot of files and runs into nearly 300 Gb. That's just one of them too.

3 comments:

  1. zip has a 2gb limit, not helpful. Means that I have to split up my directories first then zip or find something with no such limit. Backups should be simple, so I am just going to buy a big external disk and copy everything onto there. Pity because zip reduced size by 80%! S'pose I could do it directory by directory...

    ReplyDelete
  2. It's been slower than I thought.

    ReplyDelete
  3. Of course I only have the first 3 directories of full html for the os50kcorpus. But every page up to 1000th for the regioncrawl. There are "only" 2560 regions so 20 * 50 * 2560 of full html and 3 * 50 * 2560 of oscorpus settlement derived corpus.
    I wonder how useful this corpus is? I can compare techniques of searching within the os50kcorpus to get a set of pages equivelent to the regioncrawl, so it's useful to me.

    ReplyDelete