1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | #Program to read a file(corpus) and find frequency of each token #!/usr/bin/python import re #read file file=open("raw_corpus.txt","r+") #dictionary to save tokens as keys and values fruquency as values wordcount={} for word in file.read().split(): #split() will split according to whitespace that includes ' ' '\t' and '\n'. It will split all the values into one list. #print (word) #cleaning corpus word = word.lower() #convert to lowercase word = re.sub('\.', "", word) #substitute . with empty #check if current token already exists in dictionary if word not in wordcount: wordcount[word] = 1 else: wordcount[word] += 1 #print the dictionary with keys and values #for k,v in wordcount.items(): #print (k, v) #print the dictionary with sorted keys(tokens) and values for k in sorted(wordcount): print (k, wordcount[k]) |
This script opens the file 'raw_corpus.txt', each lines is split into words. Each word is stored in dictionary, with key as word and value as the frequency. When the same word(key) is encountered again value is incremented by 1.
No comments:
Post a Comment