Text Analytics and Text Mining with SAP HANA platform

by Arwold Koelewijn on March 23, 2016

Last month I finished the Open SAP Course Text Analytics with SAP HANA Platform. This was a nice opportunity to get a grip of the concepts involved. So in this blog, I give a short overview of the concepts I learned about Text Analytics and Text Mining.

Text Analytics

Text Analytics can be used to:

search on text related contend i.e. crm related documents
extract meaningful, structured information from unstructured text
combine unstructured data with structured data

As an example, when an evaluation asks for a score and open questions, it will be useful to combine this information to see if there is a correlation.
Also, for consumer oriented companies, it can be used to structure the information on social media.

SAP HANA executes the following steps to analyze text:

1. Structuring the data
In this step all types of documents are taken apart stored in fact tables in a structured way.

File format filtering; convert any binary document format to text/HTML.
Language detection; identify the language to apply appropriate tokenization and stemming
Tokenization; decomposing word sequences e.g. “card-based payment systems” becomes “card” “based” “payment” “systems”
Stemming; normalizing tokens to linguistic base form e.g. houses -> house; ran -> run
Identify part-of-speech; tagging word categories, e.g. quick: Adjective; houses: Noun-Plural
Identify noun groups; identifying concepts e.g. text data; global piracy

2. Entity determination
In this step pre-defined entity types are used to classify the data; e.g. Winston Churchill: PERSON; U.K.: COUNTRY;
Possible entity types are:

Who; people, job title, and national identification numbers
What; companies, organizations, financial indexes, and products
When; dates, days, holidays, months, years, times, and time periods
Where; addresses, cities, states, countries, facilities, Internet addresses, and phone numbers
How much; currencies and units of measure
Generic concepts; big data, text data, global piracy, and so on

3. Fact extraction
Fact extraction is realized through rules that look for sentiments between entities.
Example: I love your product.
Love is in this context classified as a ‘Strong positive sentiment’.

Another example is known as the Voice of customer, with typical classifications:

Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, and problems
Requests: general and contact info
Emoticons: strong positive, weak positive, weak negative, strong negative
Profanity: ambiguous and unambiguous

With these steps, text analysis gives ‘structure’ to two sorts of elements from unstructured text: Entities and Facts. Counting Entities and Facts can then be combined with structured information.

Text Mining

The second topic in the course is Text Mining.

Text Mining works at the document level, it is about making semantic determinations about the overall content of documents relative to other documents.
This differs from text analysis, which does linguistic analysis and extracts information embedded within documents.
With text mining you can:

identify similar documents;
identify key terms of a document;
identify related terms;
categorize new documents based on a training corpus.

The way Text Mining works is by representing a document collection as a huge terms/documents matrix. The elements of this matrix represent the weight of this term in this document.
Based on the elements of this matrix the Vector Space Module is used to calculate the similarity between documents.

To categorize documents a “reference set” of previously classified documents is used. By comparing an input document to the documents in the reference set the most likely categories are returned.

Text Mining is used for example to:

Highlight the key terms when viewing a patent document
Identify similar incidents for faster problem solving
Categorize new scientific papers along a hierarchy of topics

To summarize: Text Analytics and Text Mining are very interesting tools to deal with unstructured data.

Next post: SAP BW/4HANA – the next step towards simplicity

Previous post: SAP BO Design Studio – release 1.6 is out