Last month I finished the Open SAP Course Text Analytics with SAP HANA Platform. This was a nice opportunity to get a grip of the concepts involved. So in this blog, I give a short overview of the concepts I learned about Text Analytics and Text Mining.
Text Analytics can be used to:
- search on text related contend i.e. crm related documents
- extract meaningful, structured information from unstructured text
- combine unstructured data with structured data
As an example, when an evaluation asks for a score and open questions, it will be useful to combine this information to see if there is a correlation.
Also, for consumer oriented companies, it can be used to structure the information on social media.
SAP HANA executes the following steps to analyze text:
1. Structuring the data
In this step all types of documents are taken apart stored in fact tables in a structured way.
- File format filtering; convert any binary document format to text/HTML.
- Language detection; identify the language to apply appropriate tokenization and stemming
- Tokenization; decomposing word sequences e.g. “card-based payment systems” becomes “card” “based” “payment” “systems”
- Stemming; normalizing tokens to linguistic base form e.g. houses -> house; ran -> run
- Identify part-of-speech; tagging word categories, e.g. quick: Adjective; houses: Noun-Plural
- Identify noun groups; identifying concepts e.g. text data; global piracy
2. Entity determination
In this step pre-defined entity types are used to classify the data; e.g. Winston Churchill: PERSON; U.K.: COUNTRY;
Possible entity types are:
- Who; people, job title, and national identification numbers
- What; companies, organizations, financial indexes, and products
- When; dates, days, holidays, months, years, times, and time periods
- Where; addresses, cities, states, countries, facilities, Internet addresses, and phone numbers
- How much; currencies and units of measure
- Generic concepts; big data, text data, global piracy, and so on
3. Fact extraction
Fact extraction is realized through rules that look for sentiments between entities.
Example: I love your product.
Love is in this context classified as a ‘Strong positive sentiment’.
Another example is known as the Voice of customer, with typical classifications:
- Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, and problems
- Requests: general and contact info
- Emoticons: strong positive, weak positive, weak negative, strong negative
- Profanity: ambiguous and unambiguous
With these steps, text analysis gives ‘structure’ to two sorts of elements from unstructured text: Entities and Facts. Counting Entities and Facts can then be combined with structured information.
The second topic in the course is Text Mining.
Text Mining works at the document level, it is about making semantic determinations about the overall content of documents relative to other documents.
This differs from text analysis, which does linguistic analysis and extracts information embedded within documents.
With text mining you can:
- identify similar documents;
- identify key terms of a document;
- identify related terms;
- categorize new documents based on a training corpus.
The way Text Mining works is by representing a document collection as a huge terms/documents matrix. The elements of this matrix represent the weight of this term in this document.
Based on the elements of this matrix the Vector Space Module is used to calculate the similarity between documents.
To categorize documents a “reference set” of previously classified documents is used. By comparing an input document to the documents in the reference set the most likely categories are returned.
Text Mining is used for example to:
- Highlight the key terms when viewing a patent document
- Identify similar incidents for faster problem solving
- Categorize new scientific papers along a hierarchy of topics
To summarize: Text Analytics and Text Mining are very interesting tools to deal with unstructured data.