As anyone who has been to couples counseling knows, communication is key. How we communicate our thoughts, feelings, and intentions is just as important as the words themselves. As difficult as it can be to successfully communicate with another person, it is even more challenging to teach the subtleties of human communication to a computer.
Natural language processing (NLP) is the theory and technology that lays the foundation for building artificial systems that are able to understand, analyze, manipulate, and generate human language. In other words, it is a machine learning approach to the intelligent analysis of language.
NLP applications include:
-
Sentiment Analysis: Classifying the emotion or intent behind text content. For example, whether a restaurant review is good or bad. Algorithms and human both look to adjectives like “yummy” or “disgusting” to clue them in on intent.
-
Information extraction: Extracting data from text to determine the relationship of state and it’s capital or a specific person and their occupation.
-
Information retrieval or search: Think of Google - you type a few words into it’s search window, ie make a query, and the algorithm tries to retrieve the document you are looking for.
NLP is already incorporated in our everyday life through:
-
Email filters: categorizing emails as SPAM or promotions
-
Smart assistants: Amazon’s Alexa, Apple’s Siri, and Google Assistant
-
Predictive text: type in “How are” and it guesses “you?”
-
Language translation: Wie geht’s? → What’s up?
-
Digital phone calls: Not just robocalls! You’ve heard “this call may be recorded for training purposes.” If you’re like me, you probably never gave it much thought, but assumed that those being trained with said recording were human, but in reality it is mostly used for training models.
-
Chatbots: i.e. the gatekeepers between you and the customer service representative you want to talk to.
With all of the advances, anyone who has used a smart assistant or tried to troubleshoot with a chatbot knows there is still a long road from parlor trick to usefulness.
Why is NLP so challenging?
First of all there are the technical aspects of pre-processing data. You can think of it as a translation process. Think about learning a new language, especially one with a different alphabet or characters. At first a passage appears as just a blur of meaningless shapes. It is difficult to even distinguish words out of the mass. Then you begin to learn the alphabet, notice patterns, and begin learning words, then phrases, idioms, and finally true fluency comes when you can play with the language and make it your own. Let’s take a quick look at some of the steps machines take to learn human language.
-
Remove punctuation: While punctuation provides grammatical context for human understanding, machines are using a vectorizer to count the number of words and not the context. , it does not add value, so we remove all special characters. eg: What is your name? → What is your name
-
Tokenization: Tokenizing separates text into units, eg: He is tall → “he”, “is”, “tall” This adds structure to the text.
-
Remove stopwords: Stopwords are common words, such as “the” or “a” or “is” that appear throughout any normal text but don’t actually give us much information. These are the glue, rather than the substance. So a model would take the sentence: Ana goes to the store to buy some bread. And only keep Ana; store; buy; bread
-
Stemming: Stemming reduces words to their stems by removing suffices, like “ing”, “ly”, “s”, etc. Lighting, lightly, and lights are all truncated to “light”.
-
Lemmatizing: Lemmatizing seeks out the root of the word. There is a little more finesse to lemmatizing than stemming, which just chops off the end of words. Lemmatizing would group “lighting” and “lights” together, but place “lightly” into its own category.
-
Vectorizing Data: This is the process of translating text into numbers. ML models work with integers rather than letters, so our language must be “translated” into a form they can understand.
Once the data is preprocessed, a little more art and finesse come in through feature engineering and model selection:
-
Feature Engineering: This process requires domain knowledge of the data to create features. Features are attributes or properties shared by all independent units that give context for a model to make a prediction or analysis. This is what makes machine learning actionable.
-
Model selection: Different models manage different types of tasks, so it is important to choose the right model, in the right order to give you the results you are aiming for. A favored tactic is to create an ensemble of models to work in consort to properly classify and make sense of the data.
If what described above is the engine, labeled data is the fuel. Excellent architecture, domain knowledge, and high quality training data are all needed to get any NLP project off the ground. Next week in Part II: Understanding the challenges of NLP through Wittgenstein, we will look at what we can learn about communication from philosopher Ludwig Wittgenstein, a central figure in the linguistic turn, and how it applies to NLP.