Preparing Text for Analysis with Natural Language Toolkit (NLTK)
*Note: This is the second in a series of blog posts about the computational analysis of open-ended survey questions. Read part one, “Writing Open-Ended Survey Questions for Computational Analysis.”
In the first post, we explained that Evolytics was asked to analyze approximately 68,000 open-ended responses to nine survey questions. The survey questions included asking respondents to list competitor brands they had tried, rate the competitors, and describe their rationale for the rating.
In this post, we’ll discuss preparing text for analysis. The techniques discussed here are general and apply to any form of text, not just survey responses.
Getting Started with Natural Language Processing
Prior to preparing our text for analysis, it’s important that we define three common terms used in natural language processing. First, a corpus is a collection of documents on which we conduct analysis. A document is any text that is subject to analysis. This could be a set of reports, social media posts, or, in this case, open-ended survey responses. Finally, tokens are meaningful groupings of characters. Tokens are often words or parts of words.
When preparing a document for analysis we tokenize it, or break it apart into discrete tokens. In many models, the presence and frequency of tokens characterize a document. However, not all tokens are informative. For example, some tokens (e.g., “a”, “of”, “the”) are extremely common and have little purpose other than grammatically tying the sentence together. In linguistics, these are referred to as function words. Typically, function words are included in a stopword list that contains words to exclude from tokenization because they have little value in statistical models.
It is common to stem or lemmatize tokens for further standardization. Stemming involves the removal of the end of a word to get its root. For example, “smarter” and “smartest” become “smart.” A drawback to stemming is that it may not return proper words. For instance, “accelerating” becomes “acceler.” In contrast, lemmatization attempts to return the base form of a word as it might be found in the dictionary. Using lemmatization, “women,” “Woman’s,” and “Womanly” become “woman” while “is” and “are” become “be.” However, you must tag a token’s part of speech (i.e., noun, adj, verb, etc..) to know how a given word should be lemmatized. Since stemming and lemmatization accomplish the same thing you should only choose one. Furthermore, you shouldn’t assume that you have to stem or lemmatize—try your models without stemming or lemmatization and see how they perform.Finally, you may wish to remove items such as numbers, punctuation, or URLs. Doing this is especially helpful for web data that may not conform to standard grammar. However, let me issue two warnings about applying these functions indiscriminately. First, altering your text can affect how a parts-of-speech classifier tags a given token which in turn can impact lemmatization. As a result, be cautious about aggressively filtering text before lemmatization—although you can certainly do so afterwards. Second, depending upon your corpus numbers, punctuation, and URLs may be informative. For example, while numbers by themselves are often not informative, stripping them from your tokens may prevent identification of things such as the unicode for emojis.
Below we’ve defined several useful functions for text cleaning. We’re using Python’s Natural Language Toolkit (NLTK) which contains a number of tools and models for text processing. If you’ve never used NLTK before, run the first cell below.
# only run if you haven't previously installed these NLTK tools.
import nltk nltk.download('stopwords') nltk.download('wordnet') nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words')
Stemming and Tokenization
Now let’s talk about how to tokenize and filter our text. First, use the string methods from the standard Python library to put text in lowercase. We can also filter out punctuation and numbers using Python’s str.maketrans () and translate( ) functions. translate( ) substitutes one character for another and maketrans( ) creates the map between characters used by translate( ). Here we substitute an empty string for all punctuation and numbers to remove them.
Second, we’re using regular expressions (regex) to find and remove URLs. If you’re reading this post, you’re likely familiar with regex, but if not, all you need to know is that it is a way of defining patterns. Here we’re using it to remove URLs by searching for a whitespace separated string beginning with “http” and substituting an empty string in its place.
Third, we’re using the WordPunctTokenizer from NLTK to tokenize text, creating a Python list of alphabetic and non-alphabetic tokens. Then, use Python list comprehensions to filter the token list to remove single character tokens and tokens that are found in our stopword list.
Finally, use the SnowballStemmer from NLTK to stem each token as discussed above.
import re import string from nltk.stem.snowball import SnowballStemmer from nltk.tokenize.regexp import WordPunctTokenizer from nltk.corpus import stopwords #Remove urls. def remove_urls(text): return re.sub(r"httpS+", "", text) #Remove punctuation. Note- This leaves a space so it plays nice w/ nltk's stopword list. def remove_punctuation(s): table = str.maketrans(ch: ' ' for ch in string.punctuation) # this line determines what the punct. is replaced with. return s.translate(table) #Remove numbers. def remove_numbers(s): remove_digits = str.maketrans('', '', string.digits) return s.translate(remove_digits) # Stems tokens. def stem_tokens(tokens, stemmer=SnowballStemmer("english", ignore_stopwords=True)): return [stemmer.stem(tkn) for tkn in tokens] #Tokenize texts. Note- It is possible to comment out steps to change how tokenization occurs. def tokenize(text, stem=False): text = remove_urls(text) # removes urls text = remove_numbers(text) # removes numbers text = text.lower() # sets to lowercase text = text.replace('-', '') # removes hyphens tkns = tokenizer.tokenize(text) # tokenizes text tkns = [remove_punctuation(tkn).strip() for tkn in tkns] #strips punctuation # stems tkns if stem: tkns = [tkn for tkn in tkns if tkn not in sw] # filters using stopwords tkns = [tkn for tkn in tkns if len(tkn) > 1] # no single character tkns tkns = stem_tokens(tkns) tkns = [tkn for tkn in tkns if tkn not in sw] return tkns tokenizer = WordPunctTokenizer() # Creates stopword list from NLTK. sw = stopwords.words("english") + ['']
Now we’ll do the actual tokenization. After tokenization our corpus will be a list of documents with each document being a list of tokens. Note that we also create a map of our indices. This enables us to link tokenized documents back to the original, untokenized version if we need to remove a particular document during model building. Keep track of these positionally and remove the corresponding index. Keeping a copy of the original documents lets us inspect them after creating a topic model, which assists in interpretation. The indices allow us to join the results of any text analysis back to our parent dataset in a survey for other analyses such as slicing by demographics.
# Toy corpus for our example doc_examples = ['We use data, statistical algorithms, and machine learning to help you make business decisions and targeted digital marketing efforts based on potential outcomes.', 'With propensity modeling, you can predict the likelihood of a visitor, lead, or current customer to perform a certain action on your website (i.e. browse your site, click a CTA, pick up their phone to call).', 'Once you can anticipate future customer and user behavior, you can plan for possible challenges and obstacles you’ll need to help that customer or user overcome.', 'We believe in the power of data to affect change and help make a difference in the world. We focus on web analytics and marketing optimization for business evolution ' 'and brand growth. As a full-service data analytics consulting company, we partner with clients to activate their data with best-in-class analytics tool implementation, ' 'meaningful insights, and expert training. Founded in 2005, we serve clients across different industries. We get to know your business right away so the consulting help ' 'we offer you is catered to your specific needs and goals. We don’t just sell you a service; we engage with your business as a partner. We care about your success, and we ' 'want to help your business thrive. ', "Evolytics can help you with your data science needs." ] print("Number of Documents: ", len(doc_examples))
Number of Documents: 5
docs =  orig_docs =  doc_index =  for i, d in enumerate(doc_examples): orig_docs.append(d) docs.append(tokenize(d, stem=True)) doc_index.append(i) i = 0 print("Orginal Document: ", orig_docs[i]) print('n') print("Tokenized & Stemmed Document: ", docs[i]) print('n') print("Document Index: ", doc_index[i])
Orginal Document: We use data, statistical algorithms, and machine learning to help you make business decisions and targeted digital marketing efforts based on potential outcomes. Tokenized & Stemmed Document: ['use', 'data', 'statist', 'algorithm', 'machin', 'learn', 'help', 'make', 'busi', 'decis', 'target', 'digit', 'market', 'effort', 'base', 'potenti', 'outcom'] Document Index: 0
Parts-of-Speech Tagging, Lemmatization, and Tokenization
The above approach works well if you wish to stem tokens, but slightly different handling is required for lemmatization. Specifically, we need to tag the part-of-speech using NLTK’s pos_tag method. Unfortunately, the parts-of-speech tags (see here for a full list) returned by nlkt.pos_tag are not those accepted by the wordnet lemmatizer. For example, NLTK returns a tag of “PRP” (i.e., personal pronoun) for the word “We” but the wordnet lemmitizer needs a more simple “n” (i.e., noun) tag. Therefore, we define a helper function, wordnet_pos, that converts the tag to the correct format.
from nltk.corpus import wordnet from nltk import WordNetLemmatizer i = 0 def wordnet_pos(pos): """Converts Penn Treebank PoS to Wordnet PoS Arg: pos(str): Penn treebank PoS tag Returns: str: wordnet PoS tag """ tag_dict = "J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV # Returns noun if not found to avoid lemmatization error. return tag_dict.get(pos, wordnet.NOUN) tokenized_sentence = nltk.word_tokenize(doc_examples[i]) #tokenizing sentence print("Tokenized Sentence: ", tokenized_sentence) tag_sent = nltk.pos_tag(tokenized_sentence) # tagging pos print('n') print("Penn Treebank PoS: ", tag_sent) words = [(word, wordnet_pos(word)) for word in tag_sent] # converting to wordnet pos print('n') print("Wordnet PoS: ", words)
Tokenized Sentence: ['We', 'use', 'data', ',', 'statistical', 'algorithms', ',', 'and', 'machine', 'learning', 'to', 'help', 'you', 'make', 'business', 'decisions', 'and', 'targeted', 'digital', 'marketing', 'efforts', 'based', 'on', 'potential', 'outcomes', '.'] Penn Treebank PoS: [('We', 'PRP'), ('use', 'VBP'), ('data', 'NNS'), (',', ','), ('statistical', 'JJ'), ('algorithms', 'NN'), (',', ','), ('and', 'CC'), ('machine', 'NN'), ('learning', 'NN'), ('to', 'TO'), ('help', 'VB'), ('you', 'PRP'), ('make', 'VB'), ('business', 'NN'), ('decisions', 'NNS'), ('and', 'CC'), ('targeted', 'JJ'), ('digital', 'JJ'), ('marketing', 'NN'), ('efforts', 'NNS'), ('based', 'VBN'), ('on', 'IN'), ('potential', 'JJ'), ('outcomes', 'NNS'), ('.', '.')] Wordnet PoS: [('We', 'n'), ('use', 'v'), ('data', 'n'), (',', 'n'), ('statistical', 'a'), ('algorithms', 'n'), (',', 'n'), ('and', 'n'), ('machine', 'n'), ('learning', 'n'), ('to', 'n'), ('help', 'v'), ('you', 'n'), ('make', 'v'), ('business', 'n'), ('decisions', 'n'), ('and', 'n'), ('targeted', 'a'), ('digital', 'a'), ('marketing', 'n'), ('efforts', 'n'), ('based', 'v'), ('on', 'n'), ('potential', 'a'), ('outcomes', 'n'), ('.', 'n')]
Note how the above output contains punctuation. After we lemmatize our tokens, use some of the general filtering techniques or functions discussed above to clean documents and remove punctuation and stopwords. The code below demonstrates this and presents us with a clean set of tokens. However, depending upon your analysis you may wish to retain your PoS tags. For example, named entity recognition (i.e., the identification of a person, location, organization, or product in unstructured text) requires PoS tags.
from nltk import sent_tokenize def wordnet_pos(pos): """Converts Penn Treebank PoS to Wordnet PoS Arg: pos(str): Penn treebank PoS tag Returns: str: wordnet PoS tag """ tag_dict = "J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV # Returns noun if not found to avoid lemmatization errror. return tag_dict.get(pos, wordnet.NOUN) lemmatizer = WordNetLemmatizer() # initializing lemmatizer tokenized_docs =  for doc in doc_examples: d =  sentences = nltk.sent_tokenize(doc) # creates a list of sentences for sentence in sentences: tokenized_sentence = nltk.word_tokenize(sentence) #tokenizes sentence tagged = nltk.pos_tag(tokenized_sentence) # pos tagging for tkn in tagged: if (tkn not in sw and tkn not in string.punctuation): #filtering punct & stopwords lemma_tkn = lemmatizer.lemmatize(word=tkn, pos=wordnet_pos(tkn)) #lemmatization d.append(lemma_tkn) tokenized_docs.append(d) print("Lemmatized Tokens: ", tokenized_docs)
Lemmatized Tokens: ['We', 'use', 'data', 'statistical', 'algorithm', 'machine', 'learning', 'help', 'make', 'business', 'decision', 'targeted', 'digital', 'marketing', 'effort', 'base', 'potential', 'outcome']
In this post, we’ve defined some basic terms for natural language processing and discussed how to prepare text for analysis using Python’s Natural Language ToolKit. Specifically, we discussed how to tokenize and filter text, stem and lemmatize tokens, and tag the parts of speech of tokens for subsequent analysis. Although our context is the analysis of open-ended survey questions, the above techniques will work with any body of text.
In our next post, we’ll discuss how to detect duplicate texts and perform named entity recognition to identify people, locations, and products in text.