Natural Language Processing in Python

dexter · September 28, 2019, 6:01pm

I am using NLTK to process the language coming from Slack. Mostly a human-readable way for commands.

To install it - pip install nltk.
Then you need to do python in the terminal. It will open up the python. Do the following to install some dependencies we need.

import nltk
nltk.download(‘punkt’)
nltk.download(‘averaged_perceptron_tagger’)

Then here is the code snippet to extract all nouns from a sentence:

import nltk

lines = "what's of banknifty"
tokenized = nltk.word_tokenize(lines)
tagged = nltk.pos_tag(tokenized)
nouns = [word for (word, pos) in tagged if(pos[:2] == 'NN')]
print (nouns)

It will just return banknifty. Luckily, There are no FNO stocks which are non-noun.

For Further read - Penn Treebank P.O.S. Tags

dexter · September 28, 2019, 6:09pm

Actually, here is how I checked if there is all noun amidst FNO stocks.

import nltk
import requests

positions = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/stock_watch/foSecStockWatch.json').json()
i=0
endp = len(positions['data'])
for x in range(i, endp):
    lines = positions['data'][x]['symbol']
    tokenized = nltk.word_tokenize(lines)
    tagged = nltk.pos_tag(tokenized)
    nouns = [word for (word, pos) in tagged if(pos[:2] == 'NN')]
    print (nouns)

It printed -

['IDEA']

['MANAPPURAM'] ['MFSL'] ['BHARTIARTL'] ['COLPAL']

['SUNTV'] ['ICICIPRULI'] ['BATAINDIA'] ['EQUITAS'] ['BAJFINANCE'] ['MINDTREE'] ['BERGEPAINT']

['GODREJCP'] ['SIEMENS']

['PAGEIND'] ['ITC'] ['MUTHOOTFIN'] ['BAJAJFINSV'] ['PIDILITIND'] ['BIOCON'] ['KOTAKBANK'] ['RELIANCE'] ['IOC']

['TATAGLOBAL']

['VOLTAS'] ['HEXAWARE'] ['CIPLA'] ['MARICO'] ['CASTROLIND'] ['PETRONET'] ['NTPC'] ['NMDC'] ['AXISBANK'] ['NIITTECH']

['NATIONALUM'] ['HINDPETRO'] ['HDFCBANK'] ['LICHSGFIN']

['FEDERALBNK'] ['PNB']

['TORNTPOWER'] ['GMRINFRA']

['DABUR'] ['INFY'] ['UBL']

['CHOLAFIN']

['WIPRO']

['ULTRACEMCO']

['BANKBARODA']

['UPL'] ['HCLTECH'] ['SBIN'] ['TORNTPHARM']

['ASIANPAINT'] ['ICICIBANK']

['TATAELXSI']

['SHREECEM']

['DIVISLAB']

['MRF'] ['TITAN']

['BAJAJ-AUTO'] ['EICHERMOT'] ['ACC'] ['SRF']

['POWERGRID']

['INFRATEL'] ['EXIDEIND']

['OIL']

['BHARATFORG'] ['SRTRANSFIN']

['BOSCHLTD'] ['PFC'] ['PVR']

['HAVELLS'] ['TATACHEM']

['MGL']

['UJJIVAN']

['HINDUNILVR']

['HEROMOTOCO'] ['LT'] ['CADILAHC'] ['TECHM'] ['MCDOWELL-N']

['MARUTI']

['JUSTDIAL']

['TCS']

['HDFC']

['UNIONBANK'] ['DRREDDY']

['NESTLEIND'] ['LUPIN'] ['CANBK']

['TVSMOTOR']

['AMBUJACEM']

['IGL']

['GAIL']

['CESC'] ['APOLLOHOSP']

['DLF'] ['APOLLOTYRE'] ['JSWSTEEL'] ['MOTHERSUMI']

['COALINDIA']

['ADANIPOWER'] ['ASHOKLEY']

['RECLTD']

['CENTURYTEX'] ['RAMCOCEM']

['INDIGO'] ['M', 'MFIN'] ['BALKRISIND'] ['TATAPOWER'] ['JUBLFOOD']

['BANKINDIA']

['SUNPHARMA']

['AUROPHARMA']

['M', 'M']

['ADANIENT'] ['HINDALCO'] ['GRASIM']

['BEL'] ['ESCORTS']

['BRITANNIA'] ['CUMMINSIND']

['BPCL']

['IDFCFIRSTB'] ['ADANIPORTS']

['AMARAJABAT'] ['L', 'TFH'] ['BHEL'] ['SAIL']

['TATAMOTORS'] ['GLENMARK']

['TATAMTRDVR']

['NBCC']

['ONGC']

['CONCOR'] ['ZEEL'] ['TATASTEEL'] ['INDUSINDBK']

['YESBANK']

['NCC'] ['VEDL']

['RBLBANK']

['JINDALSTEL']

['IBULHSGFIN'] ['DISHTV'] ['PEL'] ['STAR']

Apart from & sign of few stocks there are no problem, so we can write a seperate classifier for it.

dexter · September 28, 2019, 6:21pm

Here goes the classifier too.

import nltk
import requests

positions = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/stock_watch/foSecStockWatch.json').json()
i=0
endp = len(positions['data'])
for x in range(i, endp):
    lines = positions['data'][x]['symbol']    
    tokenized = nltk.word_tokenize(lines)
    tagged = nltk.pos_tag(tokenized)
    nouns = [word for (word, pos) in tagged if(pos[:2] == 'NN')]
    try:
        var = str(nouns[0] +'&' +nouns[1])
    except:
        var = str(nouns[0])

    if(str(var)==str(lines)):print("true")

It confirms that our vendatta is correct. It posts true for all data

dexter · September 28, 2019, 6:45pm

Then the terms like ltp, price are noun. So we need to add another classifier towards that.

tags = ["ltp", "price"]

for tag in tags:
    if tag in str2:
        str2 = str2.replace(tag, '')