Wednesday, 1 April 2009

trying to build xaira on linux or find a way of running it with multiple names on windows has not been successful. Not using xslt script found in bnc installation disk to strip back to text. Now can use lucene to index it and get those lovely tf/df counts.

4 comments:

  1. BNC is not one text per file (too easy). so although I can get the raw text it is split up as it should be. Might need to find the xml tag for document separation and split them on that, thus being able to get df. Otherwise tf is available, and in fact is what we used for ACM GIS paper. Might have to do it this way for now at least.

    ReplyDelete
  2. Actually maybe it is one text per file. In which case df will be meaningless because the files vary a lot in size (they are different medias too). Can probably justify using tf then.

    ReplyDelete
  3. If the BNC is made up of a variety of sizes of file then placenames will not be random in it because they will feature more in stories depending on the geo focus of the story. Would google 5-gram be better?
    If I use yahoo counts is there some sort of recursive loop?
    BNC on the whole is still a measure of "normal english wordiness" though. Just that say Weem (a placename occuring in a long text will be there too much).

    ReplyDelete