This article deals with Text Data Processing using SAP Business Objects Data Services with the intension of Text Analytics. SAP BODS provides a single ETL platform for both Structured and Unstructured data as well as Data Quality, Data Profiling and Data Cleansing functionalities.Entity Extraction transform available as a part of Text Data Processing of Data Services, helps to extract entities, entity relationships and facts from unstructured data for downstream analytics. The transform performs linguistic processing on content by using semantic and syntactic knowledge of words, to identify paragraphs, sentences, clauses, entities and facts from textual information.
This transform provides a user friendly GUI inteface, having three tabs namely Input, Options and Output. The transform accepts textual format such as a text, HTML, or XML. We need to specify explicitly the language for processing the text content. Entity extraction is performed with the help of in-built SYSTEM source or user defined DICTIONARY or RULE to filter specific entities as output.
Lets see the basic feature of this transform using a small example. Later we will manipulate the entity extraction further using Dictionary and Rule.
Next take a look at the default fields created for the file format. Four fields have been generated automatically namely FileName, LastModified, Data and IsText. Here the FileName field will extract the absolute file names to be processed in the Data Files directory.The Data field will actually extract the entire content of the file as long format.Next in the dataflow, place a Base_EntityExtraction transform of Data Services, after the unstructured file format. Link the transform with the file format.
Typically our interest will be on the fields TYPE, SOURCE, SOURCE_FORM and STANDARD_FORM. Sample output contents of these fields are as below:
Please find the input text data file and the output dataset generated as screenshots below:
Also a little bit of analysis, to know places of interest as in the textual data:
SELECT STANDARD_FORM FROM STG_UNSTRUCT_DATA WHERE TYPE IN ('COUNTRY', 'CITY', 'PLACE_OTHER/LAND', 'COMMON_PLACE_OTHER/LAND')
Next we will learn how to create and use Dictonary and Rule to filter the entities of interest during extraction from the unstructured source data file.