Friday 25 April 2014

Industry voice: That's sick! Text Mining and words with multiple definitions

Industry voice: That's sick! Text Mining and words with multiple definitions

When you read the title of this article, you must wonder what I'm talking about when I say, "That's sick!"


It makes sense if I just witnessed a car accident so heinous that it made me feel sick to my stomach. However, it also makes sense if I just saw Sidney Crosby score the game-winning goal for the gold medal game at the 2014 Socchi Olympics. A difficulty with linguistics is that the same word can have multiple meanings.


In the English language, the word "sick" is defined by the Oxford dictionary as follows: "affected by physical or mental illness". What you won't find in the Oxford dictionary is the slang meaning for "sick", which urban dictionary defines as: "crazy; cool; insane".


Good or bad?


How can a machine decipher whether we are talking about the "good sick" or the "bad sick"?


Let's take a step back, how can humans tell which "sick" we are talking about? Humans get help from things like: body language, the tone of the communicator's voice, eye contact, facial expression, as well as cultural symbols like clothing, hair style, and location.


Natural language processing technology like text mining can't use the aforementioned methods of communication. It's just not possible… Yet. In about 5-10 years down the road, when image recognition and emotion analytics become more advanced, then we may start to get cues from body language and voice tone.


Text mining must rely on the contextual understanding of the sentence to tell the difference between the two meanings of the same word.


The words that surround "sick", and the order of these other words attribute to the contextual understanding of a sentence. Let's take a look at a couple of examples:


Example 1 – "Looking at that car accident made me feel sick"


A text mining engine knows that when the word "feel" is placed before the word "sick", "sick" is tagged with negative sentiment. The engine knows that feeling sick is bad.


Example 2 – "Wow, Crosby's goal was sick!"


Text Mining Engine


A text mining engine will know that a "goal" can't be "sick" by definition. A goal isn't a living thing, it can't be affected by illness, therefore, a goal can't be sick. (Most text mining engines reference their knowledge from some sort of semantic ontology. Here is an example of Lexalytics' text mining concept matrix.)


However, if you are working with a dataset about sports, you can train the engine to carry a positive sentiment for the word "sick" whenever it appears in a sentence near the word "goal."


This is not the "be-all end-all" solution. Words with multiple meanings, double entendres and sarcasm are very tricky things to work around when dealing with text mining. One day, we will have a flawless machine that is programmed with every known dialect, language, slang; literally everything that encompasses language!


But for the time being, it's really cool that we have the ability to train a machine to understand context like a human.



  • Scott Van Boeyen is the community manager for Lexalytics and Semantria. Aiming to help journalists/reporters with content related to big data and analytics, writing, blogging and providing thought leadership through social media.
















No comments:

Post a Comment