Free text to Binary Rating or Why Sentiment Analysis can be trusted?

In this blog post we disclose findings from our aforementioned rating system "Binary Vs Star Rating Systems" blog post & also test our proposed rating system using a dataset we extracted from Skytrax.


Our first goal is to prove that by using Sentiment Analysis on written feedback we can reach the same high level conclusion/rating produced by a star rating system. Then we aim using Natural Language Processing to extract higher granularity feedback regarding user experience that couldn't be grasped from star or binary rating systems.


Skytrax was the perfect place to put what we proposed to the test. People tend to be generous with the text they share per review and give not one but several other ratings. This is how a common review looks like:


We selected a sample of twelve airlines, seven economy/regional ones:

  • Aer-Lingus
  • Aegean Airlines
  • Easyjet
  • Flybe
  • Olympic Air
  • Ryanair
  • Vueling


And five larger air carriers such as:

  • American Airlines
  • British Airways
  • Iberia
  • KLM Royal Dutch Airlines
  • Lufthansa


Our Objectives:

  • What is the Net Promoter Score per airline from Skytrax customers?
  • Does the overall sentiment of a review agrees with the direct feedback given by a client?

For our analysis we used Vader Sentiment Analysis implementation in Python that has created a lot of buzz lately, especially regarding its high accuracy in social datasets.

Net promoter score

Net promoter score is an endorsement metric, measuring customers’ willingness to recommend a brand to others. We calculated NPS per airline using the Yes/No “Recommended” flag on each review.




Sentiment Score Vs Customers’ Direct Feedback Towards an Airline

This task has a two-fold nature:

  • To validate Vader’s high accuracy in text sentiment scoring
  • To demonstrate the fact that we can efficiently extract binary feedback using a continuous scale. Sentiment scores range from -1 (extreme negative) to 1 (extreme positive). 

For each text classification Vader returns a dictionary of 4 numbers:

{'neg': 0.097, 'neu': 0.647, 'pos': 0.256, 'compound': 0.4404}

Instead of sticking to Vader’s classification it is considered good practice to use compound as a main measure and segment it into meaningful groups (for instance extremely negative, negative, neutral, positive, extremely positive), an assumption well backed from our experience as well.
This is the case we needed to be able to take neutral out of the equation and map sentiments to Yes and No approximating that way a customer’s binary recommendation flag.

We came up with a simple yet effective rule of “eliminating” neutral scores:


With percentages that narrowly missed the 80% barrier and others approximating and even exceeding the 90% one, I believe the verdict is clear regarding speech-rating alignment and sentiment analysis validation.


  • What is common positive words and phrases per Skytrax category and airline?
  • How sentiment scores relates to the ratings given per Skytrax category and airline over time?

Sounds interesting?

Get in touch now and we can schedule a free, one hour, consultation. No strings attached.