1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | #open a file and clean its contents tokenize and identifyits sentence boundary using . #!/usr/bin/python import re with open('raw_corpus.txt') as fp: lines = fp.read().split("\n") #here lines contains entire file contents #sentence incremental variable i=1; #to access file contents line by line for line in lines: #if empty break from current iteration if line == "": break #convert to lowercase # line = line.lower() #leaning line = re.sub(r'\.', " .", line) #substitute . with space . line = re.sub(r',', " ,", line) #substitute , with space , line = re.sub(r'\?', " ?", line) #substitute ? with space ? line = re.sub(r'!', " !", line) #substitute ! with space ! #replace multiple spaces into single spaces line = re.sub(r'\s+', " ", line) #get words in current line if line != "" and line != " ": sentences = line.split('.') for sentence in sentences: #print ("Iam|",sentence,"|",sep='') #debugging statement if sentence !="" and sentence !=" ": words = sentence.split(' ') print ("<Sentence Id='",i,"'>",sep='') #use sep='' to suppress white space while printing j=1 #token counter for word in words: if word != "" and word !=" ": print (j,"\t",word) j=j+1 print (j,"\t.",word) print ("</Sentence>",sep='') i = i + 1 #increment i |
This script opens 'raw_coprus.txt', reads its contents line by line.
Then splits each line using '.' which is identified as a sentence boundary. Each sentence is now been split into tokens using space. These tokens are incremented for each sentence and printed along with current sentence.