What are the Key Features of the Natural Language Toolkit (NLTK) in Python

Natural Language Processing, known as NLP, is an awesome field that makes it possible for computers to understand and interact with human languages. NLTK, which stands for the Natural Language Toolkit, is a powerful Python NLP package. It is a prominent tool for creating Python applications that interact with human language data. It is a robust Python library, a key component of NLP. NLTK offers a wide range of text-processing libraries for tasks such as classification, indexing, parsing, and more. It also includes a variety of text-processing libraries for different types of text-processing tasks, such as sentiment analysis, machine translation, and text generation.

Whether you're looking to analyze text, perform linguistic tasks, or build applications like chatbots, NLTK offers a comprehensive set of tools and resources to help you get started. In this blog post, we'll introduce NLTK, explore its features, and demonstrate a few basic examples to get you started.

Getting Started with NLTK

To begin utilizing NLTK, you'll need to install it via pip:

pip install nltk

Once installed, you can begin by importing the library and downloading the necessary datasets:

import nltk
nltk.download('all')

This command downloads all the datasets and resources provided by NLTK. You can also download specific datasets as needed.

Key Features of NLTK

1. Tokenization: Tokenization means it splits a text into smaller parts. NLTK provides simple functions to tokenize text, which is often the first step in processing language data.

Example:

from nltk.tokenize import word_tokenize
text = "Three generations with six decades of life experience."
tokens = word_tokenize(text)
print(tokens)
# Output: ['Three', 'generations', 'with', 'six', 'decades', 'of', 'life', 'experience', '.']

Here we use word_tokenize in order to divide the text into individual words. Other methods like the sent_tokenize function are used to segment a given text into a list of sentences, WordPunctTokenizer, which splits words based on punctuation boundaries, subword tokenization, which breaks words into smaller units, and character tokenization, which divides text into individual characters.

2. Stemming and Lemmatization: These techniques reduce words to their base or root form. While stemming chops off the end of the word, lemmatization transforms the word into its canonical form.

Stemming: To reduce words to their stem or root form, stemming is a more straightforward, rule-based procedure that eliminates or substitutes word ends. It's quicker, but occasionally it generates nothing at all.

Example:

from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["Running", "generations", "Walk", "Taking", "Talked"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
#Output : ['run', 'gener', 'walk', 'take', 'talk']

Lemmatisation: Lemmatisation is a more advanced approach that applies several normalisation principles based on a word's part of speech and context. Its goal is to return a word in its basic dictionary form, which is always a legitimate term.

Example:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))  # Output: "rock"
print("corpora :", lemmatizer.lemmatize("corpora"))  # Output: "corpus"
# v denotes verb in "pos"
print("running :", lemmatizer.lemmatize("running", pos="v"))  
# Output: "run"
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))  # Output: "good"

3. Text Classification: NLTK offers tools for text classification, enabling you to build models that can categorize text into predefined classes. This is especially useful for tasks like spam detection or sentiment analysis.

Example:

import nltk
from nltk.corpus import sentence_polarity
from nltk.classify import NaiveBayesClassifier
from nltk.tokenize import word_tokenize
# Sample data
sentences = [
    ("What time is it?", "question"),
    ("Where are you?", "question"),
    ("How old are you?", "question"),
    ("The sky is blue.", "Statement"),
    ("I like pizza.", "Statement"),
    ("Dogs are loyal animals.", "Statement"),
]
# Feature extractor
def get_features(sentence):
    words = word_tokenize(sentence.lower())
    return {'first_word': words[0], 'last_word': words[-1]}
# Prepare feature sets
feature_sets = [(get_features(sent), category) for (sent, category) in sentences]
# Train the classifier
classifier = NaiveBayesClassifier.train(feature_sets)
# Test the classifier
test_sentences = [
    "Who is the president?",
    "Cats are independent.",
    "Do you like coffee?",
    "The Earth is round.",
]
for sentence in test_sentences:
    features = get_features(sentence)
    category = classifier.classify(features)
    print(f"Sentence: {sentence}")
    print(f"Category: {category}\n")
# Output:
Sentence: Who is the president?
Category: question
Sentence: Cats are independent.
Category: Statement
Sentence: Do you like coffee?
Category: question
Sentence: The Earth is round.
Category: Statement

In this example, a Naive Bayes classifier is trained using sentences labeled as "question" or "Statement." We use a function to identify each sentence's first and last words as features. The classifier is trained on labeled sentences, pairing each sentence's features with its category. Once trained, the classifier can predict the category of new sentences based on their first and last words. Finally, the classifier's predictions for the test sentences are printed.

4. Parsing and Syntax Analysis: The library provides functions for parsing sentences to understand their grammatical structure, which is essential for tasks like part-of-speech tagging and dependency parsing.

5. Natural Language Understanding: NLTK supports various NLP tasks, such as named entity recognition (identifying names of people, places, etc.), sentiment analysis (determining the sentiment of a text), and question answering.

Named Entity Recognition: NER identifies and classifies named entities in text into predefined categories like persons, organizations, and locations.

Here is an example of NER:

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
# Download necessary data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Sample text
text = "Einstein was born in Germany, on March 14, 1879.He began his schooling at the Luitpold Gymnasium "
# Tokenize the text
tokens = word_tokenize(text)
# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)
# Perform named entity recognition
ner_result = ne_chunk(pos_tags)
# Extract named entities
named_entities = []
for chunk in ner_result:
    if hasattr(chunk, 'label'):
        entity = ' '.join(c[0] for c in chunk)
        entity_type = chunk.label()
        named_entities.append((entity, entity_type))
# Print the results
print("Named Entities:")
for entity, entity_type in named_entities:
    print(f"{entity}: {entity_type}")
# Output
Named Entities:
Einstein: PERSON
Germany: GPE
Luitpold Gymnasium: ORGANIZATION

This example demonstrates NLTK's built-in capability to recognize common types of named entities:

* PERSON: Names of people

* ORGANIZATION: Names of companies, institutions, etc.

* GPE (Geo-Political Entity): Names of countries, cities, states, etc.

Sentiment Analysis: determine the emotional tone behind a piece of text.

Example:

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download the VADER lexicon
nltk.download('vader_lexicon')
# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()
# Function to classify sentiment
def classify_sentiment(text):
    # Get sentiment scores
    scores = sia.polarity_scores(text)
    # Classify based on the compound score
    if scores['compound'] >= 0.05:
        return 'Positive'
    elif scores['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'
# Example sentences
sentences = [
    "I love this product! It's amazing.",
    "This movie was terrible. I hated it.",
    "The weather is okay today.",
    "I'm feeling great about the upcoming vacation!"
]
# Analyze sentiment for each sentence
for sentence in sentences:
    sentiment = classify_sentiment(sentence)
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

Output:

Sentence: I love this product! It's amazing.
Sentiment: Positive
Sentence: This movie was terrible. I hated it.
Sentiment: Negative
Sentence: The weather is okay today.
Sentiment: Neutral
Sentence: I'm feeling great about the upcoming vacation!
Sentiment: Positive

NLTK is a robust and versatile library that opens up the world of NLP to Python developers. Its extensive documentation, active community, and wide range of features make it an excellent choice for anyone looking to work with natural language data.

To read more about Configuration of Python & SQL Constraints in Odoo 17, refer to our blog Configuration of Python & SQL Constraints in Odoo 17.

If you need any assistance in odoo, we are online, please chat with us.