Quick tip on how to split text into words with the Natural Language framework in Swift.
Learn how the tokenizer of the Natural Language framework works.
19 Jun 2023 · 2 min read
When it comes to splitting a text into words, the first solution that might come to mind is to simply split the string components by whitespace. But this solution doesn't work for all languages. For example, Chinese and Japanese don't use spaces to delimit words.

The Natural Language framework tokenizer provides the possibility to tokenize a string ensuring correct behaviour for all languages. For that, the framework provides the NLTokenizer type:
func words(for text: String) -> [String] {let tokenizer = NLTokenizer(unit: .word)tokenizer.string = textreturn tokenizer.tokens(for: text.startIndex..<text.endIndex).map { String(text[$0]) }}
As shown above, we initialize the tokenizer with .word as unit and then simply call the tokens method which returns the ranges corresponding to the tokens.
And that's basically it. The tokenizer also accepts other units than words, for example .sentence to split a text into sentences or .paragraph to split it into paragraphs.

Newsletter
Like to support my work?
Say hi
Related tags
Articles with related topics
Latest articles and tips