Labels

Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Monday, 18 March 2019

Finding last element in array - Check if current element is last element in Python arrays

When we are using loops in Python we might fumble onto a situation where we have to check if we reached the end of the array or simply say that we are iterating through each element in array and checking if the current element exists in a hash/dictionary/database and upon a match we still want to continue until the end of the array is reached.

So there are multiple ways of achieving this but I am going to give a simple solution that is easier to understand.

for suffix in suffixes:
        if(re.search(suffix+'$', my_string)):
            #do something
        elif(suffix == suffixes[-1]):
            print("reached end of array")

So the line that checks for the last element of array is

        elif(suffix == suffixes[-1]):

Similarily, to access for last but second element use suffixes[-2] and so on.

Hope its useful.

Thursday, 14 March 2019

python capture regex groups in variables



#python script to capture regex matched groups into variables
import re

suffix = 'en'
word = 'children'

#print(word,suffix)

m = re.search(r'(.*)'+suffix + '$', word)

print(m.group(1))

Friday, 8 March 2019

What is a hash or a dictionary in Python? Understanding hashes.

Hash tables or Dictionaries(as referred to in Python) are associative arrays. From Wikipedia associative arrays are defined as a collection of key, value pairs such that each key appears at most once in the hash table.

Question: Why can't we use arrays?

Answer: Because, when we use arrays its difficult to find an element in the array since the searching will loop through all the elements in the array until the element is found. This will compromise efficiency if the array is large in size. This problem is solved in hash table as elements can be accessed quickly without looping through the entire array.

Okay lets dig deeper...

I am going to explain this one with an example that we see daily. In this example we are going to store all members of a family relations and their names in a hash.

So the hash name is family_dict = {}

I am listing out all the elements I am going to store in it. It is going to contain wife, son, daughter, friend, father, mother....

All these relations have a name that we can call with. Now to build a hash we need keys and values. Identifying keys and values is the important thing because ultimately it will satisfy our need to use hash data structure.

In our hash we are going to store relations and their names. Before that one thing we all need to keep in mind while we build  a hash is that keys should be unique and values we will not care about them until there is really a need.

So our family hash needs unique things as keys. Names can't be unique as many people can be named with same name. This is as simple as that. Therefore, our hash is going to contain relations as keys and names as its corresponding values.

Now there will be a question what if our relations can also be same like when we have many brothers/sisters. Simply we are going to manipulate our keys are brother1, sister2 to make them unique. Enough of theory now and we will start our implementation.

family_dict = {
  "me": "Mr.x",
  "father": "Mr.y",
  "mother": "Mrs.z",
  "son":"kid1",
  "daughter1":"d1"
  "daughter2":"d2",
  "wife":"w1"
}

This seems simple. Each element mapped to its corresponding value. Now think of a situation where we might need to point same key to many values. For it lets assume Mr.x is a bit cheeky and has another wife w2. One just can't put two wives in same house. So in computers what happens is when you add same key with different value like "wife":"w2".  Our hash would store only one key that would be the one added last. That is previous keys are forgotten or overridden when same keys are added.

To solve this issue MR.x would compromise with his family and come to an agreement to put two of them in the same hut. But how do we do it here? No delay just scroll down.

family_dict = {
  "me": "Mr.x",
  "father": "Mr.y",
  "mother": "Mrs.z",
  "son":"kid1",
  "daughter1":"d1"
  "daughter2":"d2",
  "wife":"w1, w2"
}

If you observe clearly we just updated our hash key such that its value holds the previous one too. So in real time programming we should always check for existence of a key in hash if it already exists so as to make sure all our values being taken care of and none are overridden because of duplicate entry.

And this is how is is done. Now think of a hash which is going to store how many times a word occurs in the given text. For this we will store each word as a hash and its count as value. So everytime a word is revisited we will check for the value of the word and increment it by one.

    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1

And finally we will print our hash:

#print the dictionary with sorted keys(words) and count as values
for k in sorted(wordcount):
    print (k, wordcount[k])

This has been a bit long post but thanks for coming here.
Bonus: Finding word frequency using Python

Wednesday, 6 March 2019

PDF Splitter using PyPDF2 module of Python - Split PDF into multiple pages

Often when working with a large PDF we fumble upon of a need where we need each page of the PDF in one separate  PDF file.

So in this article we are exactly going to do this but not using any Linux command but using Python.

For that lets get our dependencies get installed. Just run the below command and you are all ready.

pip3 install pypdf2

Now open a editor and save the following code in to your editor.

from PyPDF2 import PdfFileWriter, PdfFileReader

inputpdf = PdfFileReader(open("largefile.pdf", "rb"))

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open("largefile-page%s.pdf" % i, "wb") as outputStream:
        output.write(outputStream)

Thanks to StackOverflow for this part. Thats it now you are ready to split a PDF file into multiple files with each file containing one page.

Thursday, 28 February 2019

OpenNMT installation using PyTorch

Reference from OpenNMT Official Site

This manual will guide you through openNMT installation using PyTorch.

We assume that you have python3 already installed and 'pip3' the python3 package installer and your OS is assumed to be Linux(ideally Ubuntu)

Step1 Install PyTorch

pip3 install torch torchvision
If you are having python2 version then you may use the below command.
pip install torch torchvision
While this package downloads and installs you may have a cup of tea as this will take a while. Remember you need to have good internet connection as this package is about 582.5MB.

Step2 Clone the OpenNMT-py repository

git clone https://github.com/OpenNMT/OpenNMT-py
cd OpenNMT-py

Step3 Install required libraries

pip3 install -r requirements.txt

For python2 use

pip install -r requirements.txt
Thats it now you are ready to take off. To get familiarize about how to use openNMT follow the link

Monday, 25 February 2019

Python Mysql connection and extracting sample data

In this tutorial we are going to learn how to connect to MYSQL using python.

First of all we need a config.py that will act as a configuration file that our python script will read. This configuration file consists information of MYSQL database username and password, database name and host where the MYSQL is at.

This is how config.py will look like

server = dict(
    #serverip = 'localhost',
    dbhost = 'localhost',
    dbname = 'userdb',
    dbuser = 'root',
    dbpassword = 'root123'
)

Then the actual python script that will connect using these parameters to our db and fetch results as needed.

Tuesday, 2 October 2018

Identifying sentence boundary in a paragraph for only fullstops - Python


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#open a file and clean its contents tokenize and identifyits sentence boundary using .

#!/usr/bin/python
import re

with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#sentence incremental variable
i=1;

#to access file contents line by line
for line in lines:

#if empty break from current iteration
    if line == "":
        break

#convert to lowercase
   # line = line.lower()

#leaning
    line = re.sub(r'\.', " .", line) #substitute . with space .
    line = re.sub(r',', " ,", line) #substitute , with space ,
    line = re.sub(r'\?', " ?", line) #substitute ? with space ?
    line = re.sub(r'!', " !", line)  #substitute ! with space !

#replace multiple spaces into single spaces
    line = re.sub(r'\s+', " ", line)

#get words in current line
    if line != "" and line != " ":
        sentences = line.split('.')
        
        for sentence in sentences:
            #print ("Iam|",sentence,"|",sep='') #debugging statement
            if sentence !="" and sentence !=" ":

                words = sentence.split(' ')

                print ("<Sentence Id='",i,"'>",sep='')  #use sep='' to suppress white space while printing

                j=1   #token counter
                for word in words:
                    if word != "" and word !=" ":
                        print (j,"\t",word)
                        j=j+1
                print (j,"\t.",word)     
                print ("</Sentence>",sep='')
                i = i + 1                 #increment i 


This script opens 'raw_coprus.txt', reads its contents line by line.

Then splits each line using '.' which is identified as a sentence boundary. Each sentence is now been split into tokens using space. These tokens are incremented for each sentence and printed along with current sentence.

Finding word frequency in Python - Dictionary


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#Program to read a file(corpus) and find frequency of each token


#!/usr/bin/python
import re

#read file 
file=open("raw_corpus.txt","r+")


#dictionary to save tokens as keys and values fruquency as values
wordcount={}

for word in file.read().split():
#split() will split according to whitespace that includes ' ' '\t' and '\n'. It will split all the values into one list.
    #print (word)

    #cleaning corpus
    word = word.lower() #convert to lowercase
    word = re.sub('\.', "", word) #substitute . with empty

    #check if current token already exists in dictionary
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1


#print the dictionary with keys and values
#for k,v in wordcount.items():
    #print (k, v)

#print the dictionary with sorted keys(tokens) and values
for k in sorted(wordcount):
    print (k, wordcount[k])

This script opens the file 'raw_corpus.txt', each lines is split into words. Each word is stored in dictionary, with key as word and value as the frequency. When the same word(key) is encountered again value is incremented by 1.


Finding character frequency using Python - Dictionary


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#open a file and find its character frequency
with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#incremental variable
i=1;

#dictionary to save characters as keys and values as fruquency
charcount={}


#to access file contents line by line
for line in lines:

    #convert to lowercase
    lower_line = line.lower()

    chars = lower_line

    #for loop to access current line characters 
    for char in chars:
        if char not in charcount:
            charcount[char] = 1
        else:
            charcount[char] += 1
    
    #print (i,"\t",lower_line)
    
    i = i + 1                 #increment i 


#print the dictionary with sorted keys(tokens) and values
for k in sorted(charcount):
    print (k, charcount[k])

This script will open the file 'raw_corpus.txt' read its contents line by line, then find each character frequency and store in dictionary.

Dictionary in Python is similar to hashes in Perl. It stores a values for each corresponding key, duplicate keys are overridden when a same key is encountered while storing.

Tokenization and sentence boundary assuming each sentence is in new line


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#open a file and clean its contents

#!/usr/bin/python
import re

with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents

#incremental variable
i=1;

#to access file contents line by line
for line in lines:

#if empty break from current iteration
    if line == "":
        break

#convert to lowercase
    line = line.lower()

#leaning
    line = re.sub(r'\.', " .", line) #substitute . with space .
    line = re.sub(r',', " ,", line) #substitute , with space ,
    line = re.sub(r'\?', " ?", line) #substitute ? with space ?
    line = re.sub(r'!', " !", line)  #substitute ! with space !

#replace multiple spaces into single spaces
    line = re.sub(r'\s+', " ", line)

#get words in current line
    words = line.split(' ')

    print ("<Sentence Id='",i,"'>",sep='')  #use sep='' to suppress white space while printing

    j=1   #token counter
    for word in words:
        print (j,"\t",word)
        j=j+1
        
    
    print ("</Sentence>",sep='')
    
    i = i + 1                 #increment i 

This script will open 'raw_corpus.txt' and remove junks.

It will also print sentence boundaries and tokenize a sentence into words

open file using 'open mode'


1
2
3
4
5
6
7
8
9
#open file using open file mode
fp = open('raw_corpus.txt') # Open file on read mode
lines = fp.read().split("\n") # Create a list containing all lines
fp.close() # Close file


#read file line by line
for line in lines:
    print (line)


This script will print contents of file 'raw_corpus.txt' line by line.

Reading a file using "with"


1
2
3
4
5
6
7
8
#file open example using "with" (recomemded)
with open('raw_corpus.txt') as fp:
    lines = fp.read().split("\n")   #here lines contains entire file contents


#to access file contents line by line
for line in lines:
    print (line)

When you run this python script, the contents of the file 'raw_corpus.txt' are printed line by line.

Monday, 17 September 2018

Variables in Python

In python variable is created as in following example

1
2
3
4
x = 5 
y = "John"
print(x)
print(y)

Rules for variable names:

1. It should start wih either a letter or the underscore character.
2. Variable cannot start with a number.
3. Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
4. Uppercase and lowercase names are treated differently.

Python has five standard data types −
  • Numbers
  • String
  • List
  • Tuple
  • Dictionary
In the example above x is a number type and y is a string type variable.

A list contains items separated by commas and enclosed within square brackets []. To some extent, lists are similar to arrays in C. One difference between them is that all the items belonging to a list can be of different data type.

Example: 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/python

mylist = [ 'abcd', 786 , 2.23, 'john', 70.2 ]
smalllist = [123, 'john']

print mylist          # Prints complete list
print mylist[0]       # Prints first element of the list
print mylist[1:3]     # Prints elements starting from 2nd till 3rd 
print mylist[2:]      # Prints elements starting from 3rd element
print smalllist * 2  # Prints list two times
print mylist + smalllist # Prints concatenated lists 
  

Tuples are kind of lists except that they are read only. They are enclosed by using paranthesis instead of square brackets(for lists). While lists can be updated tuples cannot be updated.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/python

mytuple = ( 'abcd', 786 , 2.23, 'john', 70.2 )
smalltuple = (123, 'john')

print tuple          # Prints complete tuple
print tuple[0]       # Prints first element of the tuple
print tuple[1:3]     # Prints elements starting from 2nd till 3rd 
print tuple[2:]      # Prints elements starting from 3rd element
print smalltuple * 2  # Prints tuple two times
print tuple + smalltuple # Prints concatenated tuples

Dictionary:

 In Python's dictionaries are kind of hash table. They work like hashes in Perl and consist of key-value pairs. Dictionaries are enclosed by curly braces { } and values can be assigned and accessed using square braces [].

 Example : 
 
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/usr/bin/python

dict = {}
dict['one'] = "This is one"
dict[2]     = "This is two"

dicts = {'name': 'john','code':6734, 'dept': 'sales'}


print dict['one']       # Prints value for 'one' key
print dict[2]           # Prints value for 2 key
print dicts          # Prints complete dictionary
print dicts.keys()   # Prints all the keys
print dicts.values() # Prints all the values

References:

https://www.tutorialspoint.com/python/python_variable_types.htm

https://www.w3schools.com/python/python_variables.asp

Monday, 10 September 2018

How to remove 'whitespace' in the variable while 'print'ing?

Python3: 

Seperator is used to suppress whitespace  in print


  print (name, ",How are you?",sep='')

To delete end line terminator (e.g. \n, \r, \s etc.)


print (name, ",How are you?",end='')

Saturday, 1 September 2018

Python Tutorial - Python Installation

From wikibooks.org, Python is an interpreted programming language. For those who don't know, a programming language is what you write down(instructions) to tell a computer what to do. However, the computer doesn't read the language directly—there are hundreds of programming languages, and it couldn't understand them all. So, when someone writes a program, they will write it in their language of choice, and then compile it—that is, turn it into lots of 0s and 1s, that the computer can easily and quickly understand. A Windows program that you buy is already compiled for Windows—if you opened the program file up, you'd just get a mass of weird characters and rectangles. Give it a go—find a small Windows program, and open it up in Notepad or Wordpad. See what garbled mess you get. 

Python Installation:

By default Linux users get Python Installed. If you want to confirm the same, just type the following command and see what happens.

python --version

If python is installed you will see something like "Python 2.7" which means you have Python installed and using version 2.7.

However many Linux distributions install version 2.7, but 3.x.x is the newest version and is not backward compatible. So I recommend to upgrade your Python to 3.x.x especially for beginners.

Installing Python 3.x.x: 

sudo apt-get install python 3.3.3

This will install Python 3.3.3, you can confirm the same using

python --version

Now you have two versions of Python one is Python2 and the other is Python3.

By default python2 will be used when you try to execute any python program. So to use python3 we have to alias python3 to python.

To make alias:

In Terminal open vi or any editor the file .bashrc located in your home directory

vi ~/.bashrc

Type “alias python='python3'” without double quotes and save it, then.
source .bashrc 

Thats it now you are all ready to use python3 as default version in your machine.