Splitting RTL texts into sentences, part 2

In a previous article, I’ve told you the story of how we fixed the problem of sentence tokenization in RTL languages. If it’s only this specific technical problem you were interested in, read no further! I’m not going into more detail about the technical solution. But there’s also a bigger lesson to be learned here, a lesson about how to approach problems, how your hastily formed preconceptions can block you, and how you need to overcome them. If you’re interested in that broader lesson, this article is for you!

The history of the fix

I’ve explained the technical details of the original problem and the solution in a previous article. I’ve also told you that it was a simple fix. But the fact that a fix is simple doesn’t always mean it’s easy to spot. So the part I haven’t told you yet is that it took us a really long time to come up with a solution to the problem. I think there are two reasons for that. First, I was too quick to jump to the conclusion that it was actually a big problem. Second, this was one of those cases where I thought you need to fix the whole world, not just one small problem. (I know this sounds overly dramatic, but it’ll make sense soon!)

In what follows, I want to point out a few mistakes I made when tackling the sentence tokenizer issue. The main issue here wasn't technical oversight, but more the mindset with which I initially approached the problem.

Challenge your assumptions

For a long time, I thought there was some bigger, underlying issue with the sentence tokenizer. Based on the output of the sentence tokenizer, I always assumed that it was incorrectly reading RTL languages as LTR languages, so from left to right. As I mentioned earlier, I imagined it was some really big problem, potentially not with our custom rules with the tokenizer, but maybe even with the tokenizer library itself or even a general JavaScript problem. I drew these conclusions based on some relatively superficial debugging. As you’ve learned, in the end it turned out that the problem was not some big, horrible bug hidden in the depths of a library or even a whole programming language, but something much more mundane.

The danger of fossilized beliefs

Granted, when I first looked into the problem back in the day, I was really just a rookie, relatively new to our text analysis library as well as programming in general. So it’s not too crazy that I drew the wrong conclusions at the time.

What is crazy though is that this conclusion, once established, really fossilized into a firmly-held belief. “We can’t process RTL languages because there’s some big, underlying problem with sentence tokenization”, I’d repeat to people, because that’s what I actually believed. This way, a personal belief ended up a bit of a myth that also other people around me would believe.

How to challenge your own assumptions

Fast-forward a year or two. I’d love to be able to say, “Oh, I learned a lot in the meantime, took another look at the problem and finally realized that I made a mistake.” Alas, that’s not quite how it went. It took someone with more experience, but more importantly, a less biased view of the issue, to look into it to find out the simple fix that eventually solved the problem.

What can we learn from this story? The obvious lesson is: keep challenging your own assumptions. This statement is very true, but also very broad and therefore hard to put into practice. So here are a few more practical tips for how to apply this lesson to your work as a developer.

Tip 1: be explicit about what you know

The first tip is: be very explicit about what conclusions you’ve drawn based on what evidence. For example, I had done some debugging of the final output of the sentence tokenizer. Based on this, I concluded that the output was incorrect for languages like Arabic and Hebrew. On the other hand, it worked just fine for LTR scripts like the Latin characters we use for English or the Cyrillic characters used in Russian. As a consequence, I thought that writing direction was the problem (which it wasn’t) and that there was some fundamental problem with parsing LTR scripts (also wrong).

Jumping to such unwarranted conclusions probably becomes even more likely when you’re faced with unfamiliar or complex data. For example, I was faced with these scripts that I can’t read. It’s probably the same when you’re facing an unfamiliar file format, programming language, or programming style. The more unfamiliar the material you’re working with, the more cautious you should be.

Writing down clearly what you’ve observed gives you a much better view of the actual facts. It’s not always possible to dive deep into a problem, thoroughly investigate it, and fix it on the spot. But by documenting very clearly what you have investigated – and forgoing any unwarranted conclusions – you create a much more accurate picture of the status quo. That way, it becomes less likely for fossilized beliefs or myths to emerge and take on a life of their own.

Tip 2: get a second pair of eyes

The second tip is: make sure to share your findings with others and let them critically examine your conclusions. If there’s time and capacity, it’s even better if you can ask a colleague, a friend, or another kind-hearted soul to pair-program with you on the debugging process after you’ve established your hypothesis about the problem. It’s easy to fall down a rabbit hole when debugging a technical issue. A second, unbiased pair of eyes is always a great asset, potentially helping you to find other solutions.

So this section was a reminder to keep an open mindset about the conclusions you draw and the assumptions you make when investigating a problem. The next section is about another factor that is likely to cloud your judgment when it comes to avoiding premature conclusions: it’s about fixing problems in an imperfect world.

Solving problems in an imperfect world

This section is about looking for a small fix in an imperfect world. What’s the imperfect world, you might ask? For me, that was how our tokenizer works.

Let me give you a little reminder of its functionality: first, it takes an HTML document and segments both the HTML and the text embedded in it. After that, it puzzles together sentences. It skips some HTML but also leaves in some to be dealt with at a later point. It all works and there probably were some good, practical reasons for why it works the way it does. But as someone looking into it for the first time, I mostly got the impression that we should refactor the whole thing.

I was thinking: wouldn’t it be more elegant and easy to work with if we first took out all the HTML? This would leave us just with a text representation. We could take this “pure” text and tokenize into sentences, and not have to think about HTML anymore. In an ideal world, that’s probably how it would work. But the world doesn't have to be perfect to make some changes in it.

Why the imperfect world shouldn't block you

Here’s the point: the fact that the world is imperfect doesn’t mean we can’t fix actual problems within the imperfect world, like the LTR problem. Chances are that the world will never be perfect. Striving for this perfection shouldn't stop you from fixing real-world problems that you're facing right now.

Don’t get me wrong, I’m not advocating for never doing any large-scale refactoring. But that should be a conscious choice on its own. You should make this choice by weighing up all the pros and cons it entails. If you can approach it like that, as a separate problem, you’re fine. But that’s not what I was doing. I ended up being mentally blocked by the fact that the status quo didn’t represent an ideal situation. For me, that was a situation where we process HTML first and then go about sentence tokenization. While I still believe that it's an ideal we should strive for in the future, this shouldn’t block me. It shouldn't hinder me from implementing what’s needed right now as a concrete improvement for users of our software.

Taken together, my assumptions and preconceptions reinforced each other: on the one hand, I thought we had a big problem that required in-depth changes, on the other hand, I thought we had an imperfect world potentially causing that problem and needing a big revamp anyways. Together, these beliefs made me see one huge task rather than individual problems to be solved one at a time. And that’s not a good way to approach a problem.

Disentangle individual problems

So what lessons can we draw from this? For one, separate your problems and examine and prioritize them one by one. For example, looking in-depth for an RTL fix was one problem. Refactoring the tokenizer to make it easier to deal with was another. These problems might be related, but you shouldn’t assume from the beginning that they necessarily depend on each other. Again, clearly documenting separate issues and outlining their individual scopes can help. You'll get a better idea of the various problems you’re facing.

Of course, it might turn out to be the case that multiple problems should be solved together. Or the solution to one problem might depend on another. However, these things should always be conscious choices you make based on hard evidence. Also, if you’re working together with other people, there are usually multiple stakeholders involved in that decision-making process. And you, as a developer, should make sure the facts are spelled out as clearly as possible. That way, it's possible to make sound decisions based on these facts.

Conclusion

In this article, I’ve talked about how, as developers, it’s important to be aware of your own preconceptions and assumptions. I used one problem and its simple, yet long-drawn-out fix as an example. With this example, I've shown how perfectionist impulses and premature conclusions can seriously cloud your judgment. My take-home message, therefore, is this: be clear about what you know and what you don’t know, document everything well, let your assumptions be challenged by others, and make sure to let yourself be guided by conscious choices.