This week, I’ll be coordinating a technology course for law students in a collaboration between Técnico and Universidade Católica. In the module dedicated to artificial intelligence, students will train a system that assesses whether text messages convey positive or negative emotions. For that, we’ll use a set of 5,000 tweets that were classified by humans as positive and another 5,000 as negative. Among other applications, this analysis helps companies to identify customer complaints that, due to their negativity, may require further attention.
The pervasiveness of text messaging communication is a surprising phenomenon. Science fiction predicted mobile devices for communicating by audio and video but not by text. In 1984, when the GSM protocol was defined in Europe, it included the SMS service to send text messages to customers about billing or new voice mail. It was only in 1993 that the first mobile phone able to send such messages appeared. Its adoption was slow and led by the younger generations who saw in it a new form of ubiquitous communication, non-intrusive and, above all, without the pressure of an immediate response. Exchanging text messages has become a way to carry on a conversation, leading to the emergence of many instant messaging applications such as WhatsApp.
Physical interaction includes many non-verbal elements such as facial expressions, gestures, intonation and type of voice, which allow you to confirm that the message was well-received. Text messages have a greater risk of being misunderstood. The book “Digital Body Language”, by Erica Dhawan, presents many examples of problems created in companies by digital communication. By reducing the communication channel, there was a need to creatively compensate for the lack of these elements. We use punctuation excessively (“!!!”), repeat vowels to lengthen a word (“Nooo”), add emotion symbols (“:)”), and resort to acronyms (“LOL”).
Contrary to the general idea that this form of communication did not use punctuation or that it did so at random, Baron and Ling (2011) conclude that new generations follow coherent strategies in the use of punctuation if we include the repetition of symbols and emojis in it. At school, we were taught to use the exclamation mark sparingly. We were told that it serves to identify exclamatory phrases (“Good morning!”), imperative (“Get out of here!”) or interjections (“My God!”). In digital communication, a single exclamation mark has become an indicator of friendship (“Thank you!”). Its use has become so common that its absence has become an indicator of too much formality. While an exclamation point is read as favourable (“Excellent!”), a sequence of exclamation points is more difficult to interpret (“Excellent!!!!!”) as the repetition can be understood as either enthusiasm or irony. In contrast, the full stop almost disappears in digital communication. When used, it shows a lack of interest in continuing the conversation or that the statement is final, such as saying “and that’s it” at the end of a sentence. The ellipses, in addition to leaving the continuation of the text to the reader’s interpretation (“The early bird…”), also come to represent pauses in the oral discourse (“I went there… it was empty”). Baron and Ling identified gender differences in the use of this form of communication. Girls write more and use more punctuation than boys, who prefer succinct answers, and more often use the exclamation mark as a smooth way to finish a message, along with emojis.
Punctuation was first used to help to read a text aloud, the principle of rhetoric. The popularization of silent reading changed its role to a guide for the decoding of complex syntactic structures, the grammatical principle. Some researchers associate the new use of punctuation as a return to the principle of rhetoric. However, Busch (2021) proposes a new interactional principle, where punctuation appears as a mechanism to organize the sequence of written interactions, especially when it occurs at the beginning and end of the message.
In our course, we will use sentiment analysis to introduce students to some of the natural language processing methods. The process begins with the preparation of training material, the tweets classified as positive or negative, performing a morphosyntactic analysis, which adds a label to each word indicating that it is a noun, verb, pronoun, etc. Certain forms, such as “lead” can be ambiguous: noun (“heavy as lead”) or verb form (“I lead the team”). Using the neighbouring words it is possible to choose the correct label. In a simple system, you can eliminate all words that are not verbs or nouns and convert the verb forms into the infinitive (“to lead”) and the nouns into the masculine singular, in a process called lemmatization. After reducing each word to its lemma, the training dataset is used to estimate the probabilities of occurring in positive and negative messages. Applying the same processing to the words of a new message, we combine the probability estimates to classify it as positive or negative. The simple system we use is 99.5% accurate in classifying the 3,000 tweets that we randomly reserve for testing. However, it is unable to deal with the irony and sarcasm that even more sophisticated classifiers have difficulties in identifying. So the next time you make a complaint, use strong negative words and avoid irony and sarcasm: you’re more likely to have your problem solved.
Adapted from my column in Jornal i of September 21st, 2021