Splitting RTL texts into sentences, part 1

In this article, we'll outline a solution to a problem we faced when splitting up a text into sentences in RTL languages. These are languages with a right-to-left written script such as Arabic, Farsi, Urdu, and Hebrew. You’ll learn about the technical implementation of sentence tokenization in the Yoast text analysis, and how we expanded this to also cover these RTL languages starting with Arabic in Yoast 14.8. Spoiler alert: it didn’t actually have anything to do with the writing direction! If you’re interested in this specific natural language processing problem, this article is for you!

But wait, there’s more! This article comes with a second part in which we talk about the process behind the search for a solution. So if you also want to improve your practices as a developer – and who wouldn’t? – make sure to also read part 2!

Sentence tokenization - the basics

The Yoast SEO content analysis consists of multiple assessments that give you information about the SEO-friendliness and readability of your post. Many of these assessments operate on sentences. For example, we tell you whether your sentences are too long. Also, when counting keyphrases or transition words, we do that per sentence. This means we need to split texts into sentences, which isn’t as simple to do adequately as it might sound. Yet, for most languages we’ve had this capability since the inception of the Yoast content analysis.

However, we found some issues when looking into expanding our analysis to RTL languages. RTL languages are languages that are written from right to left, such as Arabic, Hebrew, Farsi, and Urdu. When using our existing sentence splitting mechanism, we found that sentences weren’t split correctly. Compare the following example of an LTR script such as the Latin alphabet, which is used to write English, and an RTL script such as Hebrew. For each script, there’s an input text and a tokenized version. Note the incorrect tokenization in the Hebrew text.

Latin alphabet input text:

Lorem ipsum dolor sit amet. Sea an debet conceptam, in sit exerci vidisse, quo at paulo aperiam corrumpit. Ei everti.

Tokenized text:

  1. Lorem ipsum dolor sit amet.
  2. Sea an debet conceptam, in sit exerci vidisse, quo at paulo aperiam corrumpit.
  3. Ei everti.

Hebrew input text:

.נפלו ברית חפש בה. כלל עסקים בקרבת של, והוא האטמוספירה מדע אל, צ'ט תורת הגרפים תקשורת על. זאת מה הארץ

Tokenized text:

1) .נפלו ברית חפש בה. כלל עסקים בקרבת של, והוא האטמוספירה מדע אל, צ'ט תורת הגרפים תקשורת על. זאת מה הארץ

As you can see, the LTR text is split correctly. As we’d expect, sentences are split on all full stops. Looking at the RTL text, this looks a bit different. Now, you might not understand Hebrew – the text in the example isn’t real Hebrew by the way, but a Hebrew equivalent of a meaningless Lorem ipsum text – but you’ll be able to spot some full stops in the original. Just like in an LTR language, sentences should be split on those full stops. So it should look like this:

1) .נפלו ברית חפש בה

2) .כלל עסקים בקרבת של, והוא האטמוספירה מדע אל, צ'ט תורת הגרפים תקשורת על 

3) .זאת מה הארץ.

So why did that not work as expected? To answer that question, we'll first dive a bit into how we split sentences.

Sentence tokenizing in LTR languages

The raw input for our text analysis is an HTML document. This means the problem we need to solve isn’t only how to split a text into sentences, but we also need to separate out all HTML elements from that document.

To achieve this result, we process the text in two rounds: first, we tokenize the whole HTML document. For this, we use the external library tokenizer2. We feed certain rules into that tokenizer that will single out a string as a specific token. Rules are mostly constructed as regular expressions. For example, we have a regular expression that identifies an opening HTML block. The end result of the first round is a tokenized text in which we identify an HTML start token, an HTML end token, or a sentence token. Here’s an example of an HTML-formatted input text and the result of the first round of tokenization for that text:

Input text:

<p>Lorem ipsum dolor <b>sit</b> amet. Sea an www.yoast.com debet conceptam, in sit exerci vidisse, quo at (paulo aperiam) corrumpit? Ei everti.</p>

Tokens:

Results of the first round of sentence tokenization
Results of the first round of sentence tokenization

This representation already resembles something we can work with. We see that the text has been split up into both HTML elements as well as textual elements. From these, we need to puzzle together the sentences that we want to use for our analysis. This happens in the second round of sentence processing.

In the second round, we go over the tokens one by one and decide whether we should include them in sentences. We do this again following a set of rules. For instance, a sentence in its most basic form will be a sentence token starting with a capital letter. To this, we’ll add all following sentence tokens until we again encounter a full stop followed by a sentence token starting with a capital letter. When that happens, a new sentence will be started. With this and some other rules, we arrive a a final result which looks something like this:

Final result of sentence tokenization
Final result of sentence tokenization

Here, we see that sentences have been split as we’d expect. Note for example, that the full stops in the URL aren’t split as sentence starts, as they’re followed by a letter rather than white space. There’s still some HTML within sentences, but that’s something we deal with at a later stage. For the purposes of our analysis, this is suitable material.

So that’s the working sentence processing mechanism for LTR scripts. What about RTL scripts, where we saw that it doesn’t work? We'll outline this in the next section.

Sentence tokenizing in RTL languages

Now that we’ve seen how correct sentence processing works in LTR languages, let’s return to our problematic RTL example:

Hebrew text:

.נפלו ברית חפש בה. כלל עסקים בקרבת של, והוא האטמוספירה מדע אל, צ'ט תורת הגרפים תקשורת על. זאת מה הארץ

Incorrectly tokenized sentences:

1) .נפלו ברית חפש בה. כלל עסקים בקרבת של, והוא האטמוספירה מדע אל, צ'ט תורת הגרפים תקשורת על. זאת מה הארץ

Looking at the text above (and don’t forget to read right to left!) we see some Hebrew words and some full stops. Based on what we learned about sentence processing for LTR languages, we would expect the text to be split on those full stops. So we’d expect to get a number of split sentences like this:

  1. .נפלו ברית חפש בה 
  2. .כלל עסקים בקרבת של, והוא האטמוספירה מדע אל, צ'ט תורת הגרפים תקשורת על
  3. .זאת מה הארץ

Why isn’t that the case? Initially, we had a few hypotheses. It could be that there’s a problem with the sentence tokenizer somehow tokenizing RTL text the wrong way. It could even be that there’s a fundamental problem with JavaScript parsing of RTL text. In the end, the problem – and also the solution – turned out much simpler than that. In fact, it didn’t even have anything to do with writing direction. The answer to the problem was capital letters.

Remember what we mentioned above? A sentence in its most basic form will be a sentence token starting with a capital letter. The reason why this rule isn’t working in RTL languages is simply that Arabic and Hebrew don’t distinguish between capital letters and lower-case letters. So our check for a sentence start, which was a seemingly innocent check for a character that’s different from its lower case form, would always fail. Hence, no new sentence would be started. Instead, all tokens were continuously appended to one single initial sentence.

The fix turned out to be as simple as the problem: we just had to make sure that all letters from the scripts of Arabic, Hebrew, Farsi etc. were also considered as valid sentence beginnings. Of course, we also added extensive tests to make sure that there were no problems with other types of sentences, for example sentences ending in question marks or exclamation marks. We also added a few language-specific sentence endings like a specific Urdu sentence ending dash and an Arabic inverted question mark. And that’s it – no fundamental JavaScript problems to be circumvented, no tokenization library problems to be solved.

Conclusion

In this article, we explained the technical aspects of how we implemented correct sentence tokenization for RTL languages, starting with Arabic in Yoast SEO 14.8. We showed you how we process sentences in general, and how we tweaked this approach to make it work for RTL languages as well. But that’s not quite the end of the story. While the fix we described was indeed simple, it was a long journey to get there. If you want to know about all the pitfalls – mostly in terms of mindset – that had to be overcome to get to this solution, proceed to part 2 of this series!