rob's PhD

Wednesday, 1 April 2009

trying to build xaira on linux or find a way of running it with multiple names on windows has not been successful. Not using xslt script found in bnc installation disk to strip back to text. Now can use lucene to index it and get those lovely tf/df counts.

4 comments:

rob1 April 2009 at 06:17
not should read NOW
ReplyDelete
Replies
rob1 April 2009 at 06:37
BNC is not one text per file (too easy). so although I can get the raw text it is split up as it should be. Might need to find the xml tag for document separation and split them on that, thus being able to get df. Otherwise tf is available, and in fact is what we used for ACM GIS paper. Might have to do it this way for now at least.
ReplyDelete
Replies
rob1 April 2009 at 06:42
Actually maybe it is one text per file. In which case df will be meaningless because the files vary a lot in size (they are different medias too). Can probably justify using tf then.
ReplyDelete
Replies
rob1 April 2009 at 06:55
If the BNC is made up of a variety of sizes of file then placenames will not be random in it because they will feature more in stories depending on the geo focus of the story. Would google 5-gram be better?
If I use yahoo counts is there some sort of recursive loop?
BNC on the whole is still a measure of "normal english wordiness" though. Just that say Weem (a placename occuring in a long text will be there too much).
ReplyDelete
Replies

Add comment

rob's PhD

Wednesday, 1 April 2009

4 comments:

Followers

Blog Archive

About Me