Tuesday, 2 October 2018

Finding word frequency in Python - Dictionary


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#Program to read a file(corpus) and find frequency of each token


#!/usr/bin/python
import re

#read file 
file=open("raw_corpus.txt","r+")


#dictionary to save tokens as keys and values fruquency as values
wordcount={}

for word in file.read().split():
#split() will split according to whitespace that includes ' ' '\t' and '\n'. It will split all the values into one list.
    #print (word)

    #cleaning corpus
    word = word.lower() #convert to lowercase
    word = re.sub('\.', "", word) #substitute . with empty

    #check if current token already exists in dictionary
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1


#print the dictionary with keys and values
#for k,v in wordcount.items():
    #print (k, v)

#print the dictionary with sorted keys(tokens) and values
for k in sorted(wordcount):
    print (k, wordcount[k])

This script opens the file 'raw_corpus.txt', each lines is split into words. Each word is stored in dictionary, with key as word and value as the frequency. When the same word(key) is encountered again value is incremented by 1.


No comments:

Post a Comment