How to label text for sentiment analysis — good practises

How to label text for sentiment analysis — good practises
How to label text for sentiment analysis — good practises
How to label text for sentiment analysis — good practises
How to label text for sentiment analysis — good practises

Have you ever started a sentiment analysis or other text classification task only to see that you are not getting good results? The list of possible problems to look into is long, but there are two aspects that you may be underlooking and overlooking in this order.

The first aspect is the quality of the labels of your training data set, while the second is the model itself. We tend to spend a lot of time tweaking the model because — well, we learn to do things this way. When you start you first projects, you usually get a dataset already curated and cleaned. There’s nothing or very little to do in terms of preprocessing.

When I say preprocessing, forget the removing accents — tokenizing — removing stop words bla, bla and focus on the very first step of the all thing: the quality of the data.

Dealing with data preprocessing problems

Using data from internet repositories to study Machine Learning (ML) is great, but it comes with a price: a lot of the…

Why is removing stop words not always a good idea

Removing stop words is a difficult choice. You should not remove them every time. But when is this step really…

Labelling is hard even for humans

Have you ever tried labelling things only to discover that you suck on it? If you haven’t, here’s a great chance of discovering how hard the task is. I am sure that if you started your machine learning journey with a sentiment analysis problem, you mostly downloaded a dataset with a lot of pre-labelled comments about hotels/movies/songs. May question is: did you even stopped to read some of them?

If you did, you will find out that some of the labels are not exactly the ones you’d give on the first place. You may discord that some comments are really positive or negative. And this happens because negative/positive labelling is very subjective.

If you are unable to tell what’s positive or negative in there, your computer will surely perform as bad as you. That’s why I will insist: labelling data is an art and should be done by someone with a very deep knowledge of the problem that you are trying to solve from a human standpoint. But you can train yourself to get better at it.

Define clear rules

A good approach to label text is defining clear rules of what should receive which label. Once you do a list of rules, be consistent. If you classify profanity as negative, don’t label the other half of the dataset as positive if they contain profanity.

But this won’t always work. Depending on the problem, even irony can be a problem and a sign of negativity. So, the second rule of thumb for labelling text is to label the easiest examples first. The obvious positive/negative examples should be labelled as soon as possible, and the hardest ones should be left to the end, when you have a better comprehension of the problem.

Another possibility is pre-labelling the easiest examples and build a first model only with them. Them, you can submit the remaining examples to this model and check what’s the ‘opinion’ of the model about the hardest examples.

Test randomness

If you did all of the above and you are still not sure about the quality of your classification or of you model, you can try to test randomness. Do the following: get the examples that you are using to create your model and assign random labels to them. You can do it using Aruana .

When you randomly label your examples, you can check how important are the labels for the predictions. In other terms, you check that the text has good labels. If you are unsure about the rightness of the labels (let’s say that you think that the examples received bad labels in first place), you can assign random labels and see how the model performs.

Another possibility is that the model itself is broken. In order to test if the model is always giving the same predictions despite the examples it’s receiving, we can feed the model with random text. In this case, instead of changing only the labels, you can also create blobs of text, with no meaning, and see how the model performs.


I used the two theories above to test a model that I was working with. I was not sure if my model was broken or if the examples I was working with were not good labelled.

Here’s how I conducted the experience: using the same examples, I trained a model three times using the same configuration (but a little of randomness will always exist). On the first run I tested the model with random labels. On the second run, I used text blobs and on the third run, I used the correct examples. It’s important to say that I worked on a balanced dataset.

I loaded the data into a pandas data set with two columns: ‘text’ and ‘sentiment’. The sentiment column holds the text classification.

First run

from aruana import Aruana
aruana = Aruana('pt-br')
sentiment = aruana.random_classification(data['text'], classes=[0,1], balanced=True)
data['sentiment'] = sentiment

The results:





电子邮件地址不会被公开。 必填项已用*标注