First stage: Comprehension
Comprehending written texts on the net takes a lot of effort given that we must not only know and apply formal language rules but also contemplate the different uses and grammar customs which prevail in that society, such as misspellings, grammar deformations, abbreviations, colloquial expressions, words separated by characters symbolizing letters, and so on.
We have the main purpose to control each and every single language level, in which we include colloquial expressions. And we do it well! We’ll show you what this is about, which are the different capabilities of our robot and how these are applied in text comprehension.
|| About this level
|| What does our robot do?
|| What for?
||Phonemes and graphemes
||Decomposes each word to its most basic graphemes and obtains from these, its constitutive phonemes.
||We recognize written terms with incorrect graphemes that represent the same phoneme.
||Ztupid —> Stupid
Laf —> Laugh
Idyot —> Idiot
|Incorporates new typical net languages’ graphemes
|| We recognize keyboarding
||Sdfsdf wq —> Non-sense
| We recognize arbitrary letter repetitions
||Foooool —> Fool
| We recognize typing-mistakes
||Therefre —> Therefore
| We recognize the use of alternative characters in replacement to letters, such as numbers and symbols.
||4rt —> Art
Stup1d —> Stupid
Møm —> Mom
|We recognize the use of special characters which have the objective of tricking filters.
||i-d.i/o.t —> Idiot
| We recognize when the use of capital letters is hiding words or showing some kind of aggression.
¿Can you HEAR what I’m saying?
||About units with meaning and the relations among these
||Recognizes the internal structure of each word (this means, its root and the rest of the constitutive morphemes).
||We program instructions which specify only the word on its simplest form.
||By only typing the word “stupid”, our robot detects any of the 38 million possible combinations (stupidly, stupidish, etc.)
|We count on additional information for each word (such as tense, gender and number), helping on the comprehension of the idea.
||They drank—>from verb infinitive: “drink”, simple past tense, 2°/3° person. Plural.
|We distinguish hidden words
||You are anidi ot o Youareanidiot
|It recognizes the syntactic function of the written words
||We are able to configure complex classification rules using morpho-syntactic information.
||The acceptance of self-addressed-aggressions (for example: “how foolish I am”); and the rejection of aggressions addressed to others (“how foolish you are”)
|We disambiguate the meaning of the terms.
||He beat the child—> DisapprovedThe drum’s beat—> Approved
||It’s about the meaning acquired by a sentence which comes from the meaning of the words composing it.
||It recognizes the formal and informal expressions of:- like and dislike,
- aggression and hostility (insults, sexual references, racism, etc.) separated by region,
-contact data, etc.
|We detect positive and negative statement (Sentiment Analysis)
||“I love this hotel” —> positive statement about the object “hotel”.
“I didn’t like the movie at all” —> negative statement about the object “movie”.
*We didn’t include the pragmatic level because the correct interpretation of a text at this level involves controlling elements which are beyond language (such as culture, context, and so on).
This level is about phonemes and graphemes composing a word. Even if the graphic system doesn’t strictly shape a language level, here we will treat it as such due to the importance that the graphic forms get in the interpretation of written text.
A phoneme is the simplest linguistic unity, which allows us to encode a message, and that can also brunch multiple “phonos”. For example, a person can pronounce the word “shape” enhancing the sound “sh”, while another person could use a lower tone. On both cases, there would be no difference on regards the meaning. And this happens because there are two different “phonos” (or ways of pronouncing the “sh” sound) which represent the same phoneme although we understand the word itself beyond its specific pronunciation features.
Written language introduces now another dimension: graphemes, which are the minimum graphics composing a term. In Spanish, for example, the “c” and “h” letters create the phoneme “ch” that works as a unique particle or a single letter. There are also graphemes representing the same phoneme, for example in Spanish, “v” and “b” imply the same phoneme /b/.
Our robot decomposes each word to its basic graphemes taking into account the language of the written text, and this is how we get the consecutive phonemes. It is possible to recognize a term even when the person didn’t write the word correctly (incorrect graphemes representing the same phoneme).
Let’s take a look at the word “love”, which could be potentially written as the following:
||Keepcon’s robot decomposes the word to basic graphemes
||…and turns them into the equivalent phoneme
|| l o v e
|| l u v
|| l o v
This capacity also allows us to perceive when a group of letters actually make sense, when these are the product of typing (e.g.: sdflkh iuvcsdnfkf), even when these are arbitrary repetitions of letters in order to avoid filters (e.g.: hooooooooorror), and also when these are typing mistakes.
We also incorporate new graphemes that arise from internet and mesaggings’ own way of speaking and writing, such as replacing letters for numbers or symbols. We give more examples with the word “love”:
||Devomposing to basic graphemes
||Conversion to equivalent phonemes
|| ooo = o
|| l o v e
|| 3 = e
|| l o v e
|| @ = o
|| l o v e
In conclusion, we are able to recognize “love” no matter in what way the word it’s written (looove, lov3, løve, luv, lov)
It is also possible to recognize when certain graphemes (such as spaces, scores, etc.) are used with the aim of tricking filters by separating the words (e.g.: i-diot, i.d.i.o.t.).
And finally, the robot is able to recognize the use of capital letters when these are being used to hide words or to express aggression -e.g.: “FantASStic” (hidden word) or “can you HEAR me?” (aggressive tone inferred by capital letters).
So Keepcon’s robot capability of understanding the previously mentioned language level allows us to specifically recognize the term referenced over a group of letters, even beyond the graphic deformations used to express it.
On the contrary, this level is about units with meaning. We’ll briefly introduce you into its two different sublevels:
At this level, understanding a language implies recognizing each word’s internal structure, which means perceiving its catchword or root and the rest of the morphemes, establishing its meaning (suffixes, prefixes, etc.).
The word “butterflies”, for example, comes from the root-word “butterfly”, and by adding the suffix “ies”, we are indicating its number (plural).
Syntactic level analyzes the relations between words in order to create complex structures such as syntagmas (group of words that have the same function on a statement), and phrases. In order to get an accurate statement’s syntax it’s necessary that the words forming it present a certain concordance at a morphological level.
During the process of natural language it is vital to recognize the meaning of a statement.
And when it comes to language, the capacity of handling morpho-syntactic level allows us to:
- Create instructions for root-words only without specifying every possible flexion of the word or derived word.
For example, let’s say we want to detect when the word “stupid” is mentioned in our community. It won’t actually be necessary to check on every registered derivation of this adjective; all we have to do is find the root-word. The different possible forms and derived-words reach up to more than 38 million (stupid, stupidish, stupidized, stupidly, etc.), so it’s logically impossible to register all of them thoroughly. Learn more about our linguistic blacklist.
- Count on additional information of the words, such as tense, gender, number, etc. This feature is essential in order to comprehend syntax, what’ll allow us to understand the meaning of a certain text.
- Perceive hidden words inside non-separated or erroneously-separated phrases (these strategies are also called “merging” and “splitting”), which have the objective of tricking filters. For example: “you are anidio t” or “youareanidiot”.
- Classify words at a morpho-syntactic level. Let’s say we want to only detect aggressions addressed to others, but not the self-addressed ones (referenced to oneself); our technology makes it possible to classify as an aggression the phrase “You are an idiot!”, and allowing the self-addressed one “I’m an idiot”.
- Disambiguate terms by using their morpho-syntactic features. For example, let’s think of the word “beat”: it could imply the idea of injuring someone and, in a different case; it could mean an instrumental rhythm.
In our first case, the verb “beat” gives the idea that the person physically injures another one, while in the second case; the noun refers to the instrumental rhythm. The result is that we will reject the first case given that the use of the verb is aggressive, and we’ll allow the second one because the idea is appropriate.
Semantic and lexical level
This level refers to the meaning acquired by a phrase which is, at the same time, given by the words composing it. It’s a great technological challenge to control this stage because we need to understand not only the word meaning or the statement structure, but also the meaning of the overall terms.
To achieve this, we have developed the “Ontology manager” tool that allows us to add additional information about the words and the relations they have between each other. These relations vary based on country, region, age range, social class, and so on (e.g.: a word could mean aggression for certain countries, but not for others).
We are a pioneer team when it comes to investigation and word research of vulgar or non-formal expressions (such as regional offenses, sexual, scatological or aggressive content which is considered intolerant). Our knowledge attains expressions from all over America, always contemplating regional words and phrases. In order to stay up-to-date, we keep daily track of the net-speaking tendencies and the eventual emergence of new creations that will immediately register in our database.
Our knowledge also covers data-contact infringement –such as new (non-standardized) and creative ways of conveying phone numbers-, spam and fraud, public and controversial figures, and expressions associated with illegality (these include cases from pedophilia to coup d’état encouragement).
Within the semantic universe, we have mainly focused on the detection of statements expressing acceptance or rejection of certain objects, brands and people in general.
For example, if we receive the following content, we are able to know what is being criticized and how:
“I love this hotel” –> positive statement about the object “hotel”
“I didn’t like the movie at all” —> negative statement about the object “movie”
This level of language specializes in the principles that regulate its use in communication. Which means it will always be influenced by extra-linguistic facts –such as culture, context, the life stories of the people involved in the communication process, etc.- which come to determine what kind of statements could be created along with their interpretations.
Therefore, the correct interpretation of a text in this level requires the total control of certain elements that exist beyond the language itself.
So here at Keepcon, we aim to continue to develop text comprehension on its different language levels, and specifically on syntactic, semantic and pragmatic levels in which there are lots of possibilities for improvement.
For example, when someone in the United States asks: “Can you pass the salt?”, clearly they are no referring to the physical capability but rather it is a request from the speaker.