Technology: The science behind Keepcon

This page contains detailed information on how we work on all the language levels, in order to comprehend and classify unstructured texts.

Since a very early age we take the ability to speak and understand other people for granted, that’s why we never stop and think about the complexity of the functions developed by our brain in order to make it happen. Creating a robot with the ability to understand natural language is one of the biggest challenges for science nowadays.

And we love that!

That’s why here at Keepcon we are dedicated to developing advanced technology for natural language comprehension and we create applications which make corporative work much easier.

Our technology allows us to comprehend user-generated content on web 2.0 sites, in order to moderate and control comments on the web. However, there’s a lot to do and the possibilities to apply our knowledge are wide.

Let’s take a look at Keepcon’s technology

Our job is divided in two parts: understanding the meaning of a text, and classifying it according to the criteria and the needs of each client.

First stage: Comprehension

Comprehending written texts on the net takes a lot of effort given that we must not only know and apply formal language rules but also contemplate the different uses and grammar customs which prevail in that society, such as misspellings, grammar deformations, abbreviations, colloquial expressions, words separated by characters symbolizing letters, and so on.

We have the main purpose to control each and every single language level, in which we include colloquial expressions. And we do it well! We’ll show you what this is about, which are the different capabilities of our robot and how these are applied in text comprehension.

      Level*      About this level  What does our robot do?      What for?                  Examples
Phonetic–Phonological-Graphic Level  Phonemes and graphemes Decomposes each word to its most basic graphemes and obtains from these, its constitutive phonemes. We recognize written terms with incorrect graphemes that represent the same phoneme. Ztupid —> Stupid

Laf —> Laugh

Idyot —> Idiot

Incorporates new typical net languages’ graphemes  We recognize keyboarding Sdfsdf wq —> Non-sense
 We recognize arbitrary letter repetitions Foooool —> Fool
 We recognize typing-mistakes Therefre —> Therefore
 We recognize the use of alternative characters in replacement to letters, such as numbers and symbols. 4rt —> Art

Stup1d —> Stupid

Møm —> Mom

We recognize the use of special characters which have the objective of tricking filters. i-d.i/o.t —> Idiot
 We recognize when the use of capital letters is hiding words or showing some kind of aggression. FantASStic

¿Can you HEAR what I’m saying?

Morpho-Syntactic Level About units with meaning and the relations among these Recognizes the internal structure of each word (this means, its root and the rest of the constitutive morphemes). We program instructions which specify only the word on its simplest form. By only typing the word “stupid”, our robot detects any of the 38 million possible combinations (stupidly, stupidish, etc.)
We count on additional information for each word (such as tense, gender and number), helping on the comprehension of the idea. They drank—>from verb infinitive: “drink”, simple past tense, 2°/3° person. Plural.
We distinguish hidden words You are anidi ot o  Youareanidiot
It recognizes the syntactic function of the written words We are able to configure complex classification rules using morpho-syntactic information. The acceptance of self-addressed-aggressions (for example: “how foolish I am”); and the rejection of aggressions addressed to others (“how foolish you are”)
We disambiguate the meaning of the terms. He beat the child—> DisapprovedThe drum’s beat—> Approved
 Semantic It’s about the meaning acquired by a sentence which comes from the meaning of the words composing it. It recognizes the formal and informal expressions of:-  like and dislike,

- aggression and  hostility (insults, sexual references, racism, etc.) separated by region,

-contact data, etc.

We detect positive and negative statement (Sentiment Analysis) “I love this hotel” —> positive statement about the object “hotel”.

“I didn’t like the movie at all” —> negative statement about the object “movie”.

*We didn’t include the pragmatic level because the correct interpretation of a text at this level involves controlling elements which are beyond language (such as culture, context, and so on).

Phonetic–Phonological-Graphic Level

This level is about phonemes and graphemes composing a word. Even if the graphic system doesn’t strictly shape a language level, here we will treat it as such due to the importance that the graphic forms get in the interpretation of written text.

A phoneme is the simplest linguistic unity, which allows us to encode a message, and that can also brunch multiple “phonos”. For example, a person can pronounce the word “shape” enhancing the sound “sh”, while another person could use a lower tone. On both cases, there would be no difference on regards the meaning. And this happens because there are two different “phonos” (or ways of pronouncing the “sh” sound) which represent the same phoneme although we understand the word itself beyond its specific pronunciation features.

Written language introduces now another dimension: graphemes, which are the minimum graphics composing a term. In Spanish, for example, the “c” and “h” letters create the phoneme “ch” that works as a unique particle or a single letter. There are also graphemes representing the same phoneme, for example in Spanish, “v” and “b” imply the same phoneme /b/.

Our robot decomposes each word to its basic graphemes taking into account the language of the written text, and this is how we get the consecutive phonemes. It is possible to recognize a term even when the person didn’t write the word correctly (incorrect graphemes representing the same phoneme).

Let’s take a look at the word “love”, which could be potentially written as the following:

Word Keepcon’s robot decomposes the word to basic graphemes …and turns them into the equivalent phoneme
Love  l   o   v   e   \’ləv\
Luv  l   u   v   \’ləv\
Lov  l   o   v   \’ləv\

This capacity also allows us to perceive when a group of letters actually make sense, when these are the product of typing (e.g.: sdflkh iuvcsdnfkf), even when these are arbitrary repetitions of letters in order to avoid filters (e.g.: hooooooooorror), and also when these are typing mistakes.

We also incorporate new graphemes that arise from internet and mesaggings’ own way of speaking and writing, such as replacing letters for numbers or symbols. We give more examples with the word “love”:

Text Grapheme recognition Devomposing to basic graphemes Conversion to equivalent phonemes
 Looove  ooo =  o    l   o   v   e  \’ləv\
 Lov3  3 =  e    l   o   v   e   \’ləv\
 Løve  @ =  o    l   o   v   e   \’ləv\

In conclusion, we are able to recognize “love” no matter in what way the word it’s written (looove, lov3, løve, luv, lov)
It is also possible to recognize when certain graphemes (such as spaces, scores, etc.) are used with the aim of tricking filters by separating the words (e.g.: i-diot, i.d.i.o.t.).

And finally, the robot is able to recognize the use of capital letters when these are being used to hide words or to express aggression -e.g.: “FantASStic” (hidden word) or “can you HEAR me?” (aggressive tone inferred by capital letters).

So Keepcon’s robot capability of understanding the previously mentioned language level allows us to specifically recognize the term referenced over a group of letters, even beyond the graphic deformations used to express it.

Morpho-syntactic Level

On the contrary, this level is about units with meaning. We’ll briefly introduce you into its two different sublevels:

Morphologic

At this level, understanding a language implies recognizing each word’s internal structure, which means perceiving its catchword or root and the rest of the morphemes, establishing its meaning (suffixes, prefixes, etc.).

The word “butterflies”, for example, comes from the root-word “butterfly”, and by adding the suffix “ies”, we are indicating its number (plural).

Syntactic

Syntactic level analyzes the relations between words in order to create complex structures such as syntagmas (group of words that have the same function on a statement), and phrases. In order to get an accurate statement’s syntax it’s necessary that the words forming it present a certain concordance at a morphological level.

During the process of natural language it is vital to recognize the meaning of a statement.

And when it comes to language, the capacity of handling morpho-syntactic level allows us to:

  1. Create instructions for root-words only without specifying every possible flexion of the word or derived word.
    For example, let’s say we want to detect when the word “stupid” is mentioned in our community. It won’t actually be necessary to check on every registered derivation of this adjective; all we have to do is find the root-word. The different possible forms and derived-words reach up to more than 38 million (stupid, stupidish, stupidized, stupidly, etc.), so it’s logically impossible to register all of them thoroughly. Learn more about our linguistic blacklist.
  2. Count on additional information of the words, such as tense, gender, number, etc. This feature is essential in order to comprehend syntax, what’ll allow us to understand the meaning of a certain text.
  3. Perceive hidden words inside non-separated or erroneously-separated phrases (these strategies are also called “merging” and “splitting”), which have the objective of tricking filters.  For example: “you are anidio t” or “youareanidiot”.
  4. Classify words at a morpho-syntactic level. Let’s say we want to only detect aggressions addressed to others, but not the self-addressed ones (referenced to oneself); our technology makes it possible to classify as an aggression the phrase “You are an idiot!”, and allowing the self-addressed one “I’m an idiot”.
  5. Disambiguate terms by using their morpho-syntactic features. For example, let’s think of the word “beat”: it could imply the idea of injuring someone and, in a different case; it could mean an instrumental rhythm.

In our first case, the verb “beat” gives the idea that the person physically injures another one, while in the second case; the noun refers to the instrumental rhythm. The result is that we will reject the first case given that the use of the verb is aggressive, and we’ll allow the second one because the idea is appropriate.

Semantic and lexical level

This level refers to the meaning acquired by a phrase which is, at the same time, given by the words composing it. It’s a great technological challenge to control this stage because we need to understand not only the word meaning or the statement structure, but also the meaning of the overall terms.

To achieve this, we have developed the “Ontology manager” tool that allows us to add additional information about the words and the relations they have between each other. These relations vary based on country, region, age range, social class, and so on (e.g.:  a word could mean aggression for certain countries, but not for others).

We are a pioneer team when it comes to investigation and word research of vulgar or non-formal expressions (such as regional offenses, sexual, scatological or aggressive content which is considered intolerant). Our knowledge attains expressions from all over America, always contemplating regional words and phrases.  In order to stay up-to-date, we keep daily track of the net-speaking tendencies and the eventual emergence of new creations that will immediately register in our database.

Our knowledge also covers data-contact infringement –such as new (non-standardized) and creative ways of conveying phone numbers-, spam and fraud, public and controversial figures, and expressions associated with illegality (these include cases from pedophilia to coup d’état encouragement).

Within the semantic universe, we have mainly focused on the detection of statements expressing acceptance or rejection of certain objects, brands and people in general.

For example, if we receive the following content, we are able to know what is being criticized and how:

“I love this hotel” –> positive statement about the object “hotel”

“I didn’t like the movie at all” —> negative statement about the object “movie”

Pragmatic level

This level of language specializes in the principles that regulate its use in communication. Which means it will always be influenced by extra-linguistic facts –such as culture, context, the life stories of the people involved in the communication process, etc.- which come to determine what kind of statements could be created along with their interpretations.

Therefore, the correct interpretation of a text in this level requires the total control of certain elements that exist beyond the language itself.

So here at Keepcon, we aim to continue to develop text comprehension on its different language levels, and specifically on syntactic, semantic and pragmatic levels in which there are lots of possibilities for improvement.

For example, when someone in the United States asks: “Can you pass the salt?”, clearly they are no referring to the physical capability but rather it is a request from the speaker.

Second stage: classification

Once we understand the text, we must classify it in order to make a decision in regards to it.  This classification goes beyond what is purely linguistic and is associated with the requirements of each client.

In order to accomplish this, we combine two technologies based on artificial intelligence.

  1. Symbolic technology: we enter classification rules using a tool called “Configuration Manager” based on what each client needs.  This way, we establish linguistic patterns that allow us, for example, to detect the sale of illegal objects, find when someone is criticizing a brand, detect if a user might be a pedophile, etc. Get to know more about our linguistic patterns.
  2. Statistics Technology or Machine Learning: we train algorithmic statistics with a combination of texts that are already classified using a “corpus”.  Though this technology is usually at the reach of any other technologic company, at Keepcon we have an advantage: we apply our capacity of comprehension and classification in order to remove the words that do not hold any value (like articles, conjunctions, and prepositions) and to homogenize words, improving the performance of the regular algorithms significantly.  We test these algorithms continually with real cases so that we can expand the level of recognition and improve precision.

What do we accomplish combining these two technologies?  The highest precision in automatic classification in the market! Verify this using our technology… contact us! 

Use beyond moderation

The Semantic technology that we developed has multiple uses, besides automatic moderation of contents.  Imagine all the applications that you can perform automatically comprehending written texts that are not structured!  We have already advanced in this direction, automatizing key business processes, such as the management of customer service, the monitoring of brands on the internet, the classification of high volumes of text, among other things.  We will explain these other applications with more detail:

  • Sentiment Analysis: is the classification of the opinion or sentiment of a user in regards to a subject.  The sentiment is usually classified as “positive”, “negative”, or “neutral” or any other categorization that is required by the client (angry, furious, happy, sad, etc.).
  • Positive Tagging: categorization of content according to the criteria defined by the client beforehand.  For example, detection of the reason of complaint by the received texts at a call center through their digital channels (such as Facebook, Twitter, YouTube, etc.).
  • Parental Control: detection of inappropriate content and of dangerous behavior patterns (such as invitations to meet up, sexual language, among others) on sites for children and teenagers, whether it be virtual worlds or content sites.
  • Customer Care: general classification of inquiries made by clients or future clients, derivation according to this classification and automatizing of the answers when possible.