Garrett Mayock's Blog

blawg

Natural Language Processing in Python with NLTK: Tokenizers

Following sentdex's 4-hr NLTK with Python 3 for Natural Language Processing playlist as an introduction to NLP.


  Garrett Mayock posted 2019-03-17 19:13:25 UTC

Natural Language Processing (NLP) in Python with Natural Language Toolkit (NLTK)

This blog post will be about Natural Language Processing, or NLP (be careful not to confuse natural language processing with neuro-linguistic programming, as they share an acronym). The toolkit I will use for this blog post is NLTK, or “Natural Language Toolkit”, which is one of the, if not the, leading NLP toolkit for Python.

Since it's the weekend, I'm taking some time away from the supply chain textbook. I am very interested in NLP for a very cool personal project of mine (still in stealth mode, of course), so I’m going to start learning. NLP is a very large field, with people getting MS and PhD degrees in computational linguistics and the like, so obviously this isn’t going to be an overnight study. In fact, the NLTK website is host to a book, Natural Language Processing with Python, which was over 500 pages when the edition for Python 2 was printed (the online version is updated for Python 3 and NLTK 3, but is not printed). Eventually I’ll go through it.

Nevertheless, every journey begins with just a step, so here are my first steps with NLP. I’m going to be following along with sentdex’s NLTK with Python 3 for Natural Language Processing playlist, which has 21 videos and a total duration just under four hours. As I follow the videos, I’ll be taking notes here for the blog, doing my own little exercises along with the video exercises, and linking to reference docs I read.

Tokenizing

To begin, we’ll discuss tokenizing. In NLP, tokenizing is the process of breaking up bodies of text (called corpora) into individual linguistic units. These can be paragraphs, sentences, words, etc.  

Tokenizing is done with something called a “tokenizer”. There are two main groups:

A word tokenizer separates by words. A sentence tokenizer separates by sentences.

Paragraph tokenizers aren’t very common to find because it’s generally easy to tokenize a paragraph – you can just split the text by line break, or web-pages by <p></p> tags, etc. Sentences and words get more complex with various rules like abbreviations (“Hello Mr. Smith” is one sentence, so breaking on “.” is out, for example) and so on.

To show this, we’ll take a quick look at NLTK’s sentence tokenizer (sent_tokenize) and word tokenizer (word_tokenize). I’ll grab a paragraph from a recent article on medicalexpress.com that I found linked from phys.org.

If we use the sentence and word tokenizers, we can see the text broken up into a list of the respective linguistic units. The code:

from nltk.tokenize import sent_tokenize, word_tokenize

sent_tokens = sent_tokenize(example_text_1)
word_tokens = word_tokenize(example_text_1)
print(sent_tokens,"\n\n",word_tokens,"\n\n")

Which results in:

Notice how the word tokenizer sets punctuation like parentheses and periods as their own item in the list. This is helpful to prevent a word like "surgery" at the end of a sentence becoming "surgery." and breaking exact match search, but following that rule in every case could be bad. Take for example the next text sample:

Note how the middle initial of Martin B. Leon's name would break the sentence if one simply looked at punctuation. The sentence tokenizer understands this, however, and keeps it together as one sentence:

However, the "." is still part of the middle initial B. in this case, as it's an abbreviation. We would lose critical information about the word if we removed the "." from it. Luckily, the word tokenizer knows this:

That's it for the first post - I'll try to keep these NLP blogs a tad shorter than the recent supply chain posts, as each will cover a small subject area.

Appendix A: additional definitions

Lexicon

A lexicon is a dictionary - words and their meanings. It refers to the special vocabulary of a specific profession, field of study, or author - for example, a financier's definition of "bull" would refer to the stock market, whereas a farmer's definition would refer to the animal.

Corpora

Corpora (which is the plural of corpus, due to its Latin roots) are bodies of text. Generally, it's used to refer to a body of text which are all kind of about the same thing - a body of medical journals, presidential speeches, or even a language (such as American English, British English, German, etc).

contact me