Tuesday, 2 October 2018

Identifying sentence boundary in a paragraph for only fullstops - Python


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#open a file and clean its contents tokenize and identifyits sentence boundary using .

#!/usr/bin/python
import re

with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#sentence incremental variable
i=1;

#to access file contents line by line
for line in lines:

#if empty break from current iteration
    if line == "":
        break

#convert to lowercase
   # line = line.lower()

#leaning
    line = re.sub(r'\.', " .", line) #substitute . with space .
    line = re.sub(r',', " ,", line) #substitute , with space ,
    line = re.sub(r'\?', " ?", line) #substitute ? with space ?
    line = re.sub(r'!', " !", line)  #substitute ! with space !

#replace multiple spaces into single spaces
    line = re.sub(r'\s+', " ", line)

#get words in current line
    if line != "" and line != " ":
        sentences = line.split('.')
        
        for sentence in sentences:
            #print ("Iam|",sentence,"|",sep='') #debugging statement
            if sentence !="" and sentence !=" ":

                words = sentence.split(' ')

                print ("<Sentence Id='",i,"'>",sep='')  #use sep='' to suppress white space while printing

                j=1   #token counter
                for word in words:
                    if word != "" and word !=" ":
                        print (j,"\t",word)
                        j=j+1
                print (j,"\t.",word)     
                print ("</Sentence>",sep='')
                i = i + 1                 #increment i 


This script opens 'raw_coprus.txt', reads its contents line by line.

Then splits each line using '.' which is identified as a sentence boundary. Each sentence is now been split into tokens using space. These tokens are incremented for each sentence and printed along with current sentence.

Finding word frequency in Python - Dictionary


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#Program to read a file(corpus) and find frequency of each token


#!/usr/bin/python
import re

#read file 
file=open("raw_corpus.txt","r+")


#dictionary to save tokens as keys and values fruquency as values
wordcount={}

for word in file.read().split():
#split() will split according to whitespace that includes ' ' '\t' and '\n'. It will split all the values into one list.
    #print (word)

    #cleaning corpus
    word = word.lower() #convert to lowercase
    word = re.sub('\.', "", word) #substitute . with empty

    #check if current token already exists in dictionary
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1


#print the dictionary with keys and values
#for k,v in wordcount.items():
    #print (k, v)

#print the dictionary with sorted keys(tokens) and values
for k in sorted(wordcount):
    print (k, wordcount[k])

This script opens the file 'raw_corpus.txt', each lines is split into words. Each word is stored in dictionary, with key as word and value as the frequency. When the same word(key) is encountered again value is incremented by 1.


Finding character frequency using Python - Dictionary


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#open a file and find its character frequency
with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#incremental variable
i=1;

#dictionary to save characters as keys and values as fruquency
charcount={}


#to access file contents line by line
for line in lines:

    #convert to lowercase
    lower_line = line.lower()

    chars = lower_line

    #for loop to access current line characters 
    for char in chars:
        if char not in charcount:
            charcount[char] = 1
        else:
            charcount[char] += 1
    
    #print (i,"\t",lower_line)
    
    i = i + 1                 #increment i 


#print the dictionary with sorted keys(tokens) and values
for k in sorted(charcount):
    print (k, charcount[k])

This script will open the file 'raw_corpus.txt' read its contents line by line, then find each character frequency and store in dictionary.

Dictionary in Python is similar to hashes in Perl. It stores a values for each corresponding key, duplicate keys are overridden when a same key is encountered while storing.

Tokenization and sentence boundary assuming each sentence is in new line


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#open a file and clean its contents

#!/usr/bin/python
import re

with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#incremental variable
i=1;

#to access file contents line by line
for line in lines:

#if empty break from current iteration
    if line == "":
        break

#convert to lowercase
    line = line.lower()

#leaning
    line = re.sub(r'\.', " .", line) #substitute . with space .
    line = re.sub(r',', " ,", line) #substitute , with space ,
    line = re.sub(r'\?', " ?", line) #substitute ? with space ?
    line = re.sub(r'!', " !", line)  #substitute ! with space !

#replace multiple spaces into single spaces
    line = re.sub(r'\s+', " ", line)

#get words in current line
    words = line.split(' ')

    print ("<Sentence Id='",i,"'>",sep='')  #use sep='' to suppress white space while printing

    j=1   #token counter
    for word in words:
        print (j,"\t",word)
        j=j+1
        
    
    print ("</Sentence>",sep='')
    
    i = i + 1                 #increment i 

This script will open 'raw_corpus.txt' and remove junks.

It will also print sentence boundaries and tokenize a sentence into words

open file using 'open mode'


1
2
3
4
5
6
7
8
9
#open file using open file mode
fp = open('raw_corpus.txt') # Open file on read mode
lines = fp.read().split("\n") # Create a list containing all lines
fp.close() # Close file


#read file line by line
for line in lines:
    print (line)


This script will print contents of file 'raw_corpus.txt' line by line.

Reading a file using "with"


1
2
3
4
5
6
7
8
#file open example using "with" (recomemded)
with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents


#to access file contents line by line
for line in lines:
    print (line)

When you run this python script, the contents of the file 'raw_corpus.txt' are printed line by line.

Monday, 17 September 2018

Variables in Python

In python variable is created as in following example

1
2
3
4
x = 5 
y = "John"
print(x)
print(y)

Rules for variable names:

1. It should start wih either a letter or the underscore character.
2. Variable cannot start with a number.
3. Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
4. Uppercase and lowercase names are treated differently.

Python has five standard data types −
  • Numbers
  • String
  • List
  • Tuple
  • Dictionary
In the example above x is a number type and y is a string type variable.

A list contains items separated by commas and enclosed within square brackets []. To some extent, lists are similar to arrays in C. One difference between them is that all the items belonging to a list can be of different data type.

Example: 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/python

mylist = [ 'abcd', 786 , 2.23, 'john', 70.2 ]
smalllist = [123, 'john']

print mylist          # Prints complete list
print mylist[0]       # Prints first element of the list
print mylist[1:3]     # Prints elements starting from 2nd till 3rd 
print mylist[2:]      # Prints elements starting from 3rd element
print smalllist * 2  # Prints list two times
print mylist + smalllist # Prints concatenated lists 
  

Tuples are kind of lists except that they are read only. They are enclosed by using paranthesis instead of square brackets(for lists). While lists can be updated tuples cannot be updated.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/python

mytuple = ( 'abcd', 786 , 2.23, 'john', 70.2 )
smalltuple = (123, 'john')

print tuple          # Prints complete tuple
print tuple[0]       # Prints first element of the tuple
print tuple[1:3]     # Prints elements starting from 2nd till 3rd 
print tuple[2:]      # Prints elements starting from 3rd element
print smalltuple * 2  # Prints tuple two times
print tuple + smalltuple # Prints concatenated tuples

Dictionary:

 In Python's dictionaries are kind of hash table. They work like hashes in Perl and consist of key-value pairs. Dictionaries are enclosed by curly braces { } and values can be assigned and accessed using square braces [].

 Example : 
 
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/usr/bin/python

dict = {}
dict['one'] = "This is one"
dict[2]     = "This is two"

dicts = {'name': 'john','code':6734, 'dept': 'sales'}


print dict['one']       # Prints value for 'one' key
print dict[2]           # Prints value for 2 key
print dicts          # Prints complete dictionary
print dicts.keys()   # Prints all the keys
print dicts.values() # Prints all the values

References:

https://www.tutorialspoint.com/python/python_variable_types.htm

https://www.w3schools.com/python/python_variables.asp

Monday, 10 September 2018

How to remove 'whitespace' in the variable while 'print'ing?

Python3: 

Seperator is used to suppress whitespace  in print


  print (name, ",How are you?",sep='')

To delete end line terminator (e.g. \n, \r, \s etc.)


print (name, ",How are you?",end='')

Saturday, 8 September 2018

Useful and Important commands for Fedora 28

After fresh installlation:

sudo dnf update  

Enable and start SSH:

sudo systemctl start sshd.service

sudo systemctl enable sshd.service

RPM Fusion:

sudo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm

Install Vlc:

sudo dnf install vlc

Restart apache:   

systemctl restart httpd  
                       or
service httpd restart
  
Enable apache on startup:

systemctl enable httpd

Allow http connections:

firewall-cmd --add-service=http --permanent
firewall-cmd --reload

Install  php and Mysql(mariadb):

dnf install php-cli
dnf install mariadb mariadb-server
systemctl restart mariadb

finalize mariadb installation

/usr/bin/mysql_secure_installation
dnf install php-mysqlnd (For php-mysql driver)

After doing all these restart apache

systemctl restart httpd

Make a bootable USB DRIVE(pendrive) in Linux

After Ubuntu 12 making a bootable pendrive using startup disk creator has not been smooth and has been difficult to do. Ofcourse this can be used in other linux platforms too.

There is a new software "Ether" that is cross platform, open source tool in the market to burn images to SD card and USB. It’s called Etcher.


Download Etcher AppImage from the link below:

Once downloaded, you need to make it executable. Right click on the downloaded file and go to Properties


And in here, check the “Allow executing file as program” option.

Then double click Etcher, Click on Select image and browse to the location where you have downloaded the ISO. Etcher automatically recognizes the USB drive. You can change it if you have multiple USBs plugged in. Once it has selected the ISO and USB drive, it’ll give you the option to flash the ISO to USB drive. Click on Flash do start flashing the drive with the selected ISO.

By the end of this process you will have a bootable USB drive with your ISO. 

 Reference and Credits: https://itsfoss.com/create-fedora-live-usb-ubuntu/