NSLinguisticTagger, which was available as far back as the iOS 5 SDK, paved the path for Apple’s announcement of their Natural Language framework at WWDC 18. Everything from language identification to lemmatization and part-of-speech tagging, all of which were present in NSLinguisticTagger, are now a part of the Natural Language framework, with an API that’s been completely redesigned in Swift.
The added benefit that the Natural Language framework has over the NSLinguisticTagger is the ability to use custom NLP models.
During WWDC 2019, Apple announced the arrival of several powerful new tools to its Natural Language framework. The three notable ones that were introduced:
- Built-in sentiment analysis
- Word embedding
- Text catalog
Before we delve deeper into each of these features, let’s quickly walk through the important APIs that are already present in Apple’s Natural Language framework.
Table of contents
Exploring Apple’s NLP APIs
Natural language processing is responsible for taking unstructured text data as input and inferring a number of possible observations on it. The following are some of the key APIs that are used for NLP in iOS to process text in intelligent ways.
The NLLanguageRecognizer helps us determine a piece of text’s language from a string, as shown below:
The NLLanguageRecognizer class makes it possible to determine the dominant language code from the text’s context. Additionally, the API provides functionality to return the confidence level of the language that’s predicted.
The following function returns the top 5 languages predicted as a dictionary of the language code and a probability value:
Currently, Apple’s Natural Language Identification API is less accurate than its Firebase counterpart. Languages such as Hindi (the one I tested on a wide variety of texts) aren’t identified currently from context by the NLLanguageRecognizer.
Tokenization is the process of splitting a string into chunks of words, sentences, characters, or paragraphs. These segmented texts can then be processed together or separately, depending on the use case. To tokenize a string, the NLTokenizer class is used.
We need to specify the unit type on which the string should be tokenized. Based on the unit scheme assigned, the text is classified. In the below code, each of the questions are split into different tokens:
Lemmatization is the process of converting a word into its base form. Oftentimes you’ll come across use cases in your NLP applications where the only difference between a few words is the tense in which its used. For example “assumed” and “assuming” are different flavors of the word “assume” and possess the same core meaning.
Lemmatization is commonly used in word tagging and fuzzy matching (which identifies misspelled words, like what you’d see in a Google search). The following code shows how lemmatization is implemented with NLTagger:
Parts-of-Speech and Named Entity Recognition
The Natural Language framework also has the ability to classify and identify words from speech or a sentence as nouns, pronouns, verbs, adjectives, prepositions, idioms, etc. Part-of-speech tagging is done using NLTagger .
This isn’t as straightforward as it sounds, since the same word can be tagged as a verb or a noun based on the semantic context. The Natural Language framework determines the appropriate lexical class.
In the following code, we’ve used the same NLTagger as previously, but with a different scheme specified:
In the result below, you can see that the word “laugh” was tagged as a noun and a verb in different parts of the text.
Named entity recognition is a subset of PoS tagging. We can identify names of people, places, and organizations in similar fashion by setting the tag scheme to nameType and looping through the tags. Moreover, for our specific use cases, we can set our custom Core ML models on the NLTagger class, as shown below:
Now that we’ve had a good look at the older Natural Language framework tools, let’s dig deep into the newer ones.
Built-in Sentiment Analysis
Text classification got a boost with the inclusion of the Sentiment Analysis API. It analyses the degree of sentiment in the text, and based on that gives a score that ranges from -1 (highly negative) to 1 (very positive). Currently, the Natural Language framework supports 7 languages for sentiment analysis.
To use built-in sentiment analysis on a piece of text, simply pass the sentimentScore tag scheme to the NLTagger instance, as shown below:
Changing the unit to word or sentence would not work. For that, we’ll need to enumerate over the text as we did before. The sentiment score that gets assigned to each word or sentence is the same as the whole text; hence, it’s recommended to specify paragraph as the unit.
Word tagging is the other important aspect of NLP and word embedding is a part of it. Word embedding basically maps strings to their vector counterparts. In doing so, strings that have small vector distances are deemed similar.
The following diagram showcases a few random strings placed in the coordinate space. You can see that the semantically similar ones are clustered together:
Word embedding is a crucial part of search engines and indexing search, as it’s pretty common to search terms that are not directly present in the search index. For such cases, by using word embedding we can retrieve the closest possible matches.
Currently, the Natural Language framework supports built-in OS embedding in 7 languages: English, Spanish, French, Italian, German, Portuguese, and Simplified Chinese. However, we can also create our own custom word embeddings as well, as we’ll see shortly.
An NLEmbedding class instance is instantiated as follows:
Currently, OS embeddings require the word to be in lowercase only. Not doing so returns no result. The following method is used to retrieve the vector representation of a word:
The distance that computed between two words is a cosine value, and in cases where it can’t be computed (for example, a word doesn’t exist in the built-in OS embedding) the value returned is 2.0.
To find the top K most similar words, we can enumerate this request in the following way:
For specific use cases, custom word embeddings can be built using GloVe, Word2Vec, BERT, and FastText datasets. For demonstration purposes, we create a vector dictionary, as shown below:
MLWordEmbedding uses an automatic compression technique that can compress gigabytes of data into a very tiny Core ML model.
To use the Core ML model in the NLEmbedder, we need to pass over the URL of the compiled model, as shown below:
Text Catalog is a newly added functionality in the Natural Language Framework that allows us to customize word tagging models without the need to create a new word tagging model.
Instead, we create a dictionary of labels and their custom tags and pass them into the Create ML’s NLGazetteer model type. The output model that’s generated is an efficient (in terms of space and speed) form of the input dictionary that can then be used with a NLTagger:
Finally, we pass the compiled model URL to the NLGazetteer and set it in the NLTagger:
The NLTagger tags the name types with the tags specified in the Gazetteer:
We explored the different functionalities that the Natural Language framework provides and looked at their use case examples side-by-side.
Natural language processing is a complex yet powerful field of study and application. Complex in the sense that we’re dealing with unstructured data—human language, where the same word can be labeled differently depending on the context it’s present in.
Create ML allows us to build our own text classifiers and word taggers, and customize them by using word embedders and text catalogs. Now that we’ve had a good look at the framework, you can go ahead and try using NLP to build intelligent text processing applications. Something like determining similar sentences using word embedders would be a really interesting thing to implement.
That’s it for this one. I hope you enjoyed reading.