Showing posts with label tokenization. Show all posts
Showing posts with label tokenization. Show all posts

Tuesday, 2 October 2018

Identifying sentence boundary in a paragraph for only fullstops - Python


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#open a file and clean its contents tokenize and identifyits sentence boundary using .

#!/usr/bin/python
import re

with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#sentence incremental variable
i=1;

#to access file contents line by line
for line in lines:

#if empty break from current iteration
    if line == "":
        break

#convert to lowercase
   # line = line.lower()

#leaning
    line = re.sub(r'\.', " .", line) #substitute . with space .
    line = re.sub(r',', " ,", line) #substitute , with space ,
    line = re.sub(r'\?', " ?", line) #substitute ? with space ?
    line = re.sub(r'!', " !", line)  #substitute ! with space !

#replace multiple spaces into single spaces
    line = re.sub(r'\s+', " ", line)

#get words in current line
    if line != "" and line != " ":
        sentences = line.split('.')
        
        for sentence in sentences:
            #print ("Iam|",sentence,"|",sep='') #debugging statement
            if sentence !="" and sentence !=" ":

                words = sentence.split(' ')

                print ("<Sentence Id='",i,"'>",sep='')  #use sep='' to suppress white space while printing

                j=1   #token counter
                for word in words:
                    if word != "" and word !=" ":
                        print (j,"\t",word)
                        j=j+1
                print (j,"\t.",word)     
                print ("</Sentence>",sep='')
                i = i + 1                 #increment i 


This script opens 'raw_coprus.txt', reads its contents line by line.

Then splits each line using '.' which is identified as a sentence boundary. Each sentence is now been split into tokens using space. These tokens are incremented for each sentence and printed along with current sentence.

Tokenization and sentence boundary assuming each sentence is in new line


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#open a file and clean its contents

#!/usr/bin/python
import re

with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#incremental variable
i=1;

#to access file contents line by line
for line in lines:

#if empty break from current iteration
    if line == "":
        break

#convert to lowercase
    line = line.lower()

#leaning
    line = re.sub(r'\.', " .", line) #substitute . with space .
    line = re.sub(r',', " ,", line) #substitute , with space ,
    line = re.sub(r'\?', " ?", line) #substitute ? with space ?
    line = re.sub(r'!', " !", line)  #substitute ! with space !

#replace multiple spaces into single spaces
    line = re.sub(r'\s+', " ", line)

#get words in current line
    words = line.split(' ')

    print ("<Sentence Id='",i,"'>",sep='')  #use sep='' to suppress white space while printing

    j=1   #token counter
    for word in words:
        print (j,"\t",word)
        j=j+1
        
    
    print ("</Sentence>",sep='')
    
    i = i + 1                 #increment i 

This script will open 'raw_corpus.txt' and remove junks.

It will also print sentence boundaries and tokenize a sentence into words