We are moving soon! A faster, fresher and bigger DWBI Website is on your way. Follow us here.

Join the Community

Sponsor

Free BI Digest

Need more help on this subject?

Text Data Processing using SAP Data Services

This article deals with Text Data Processing using SAP Business Objects Data Services with the intension of Text Analytics. SAP BODS provides a single ETL platform for both Structured and Unstructured data as well as Data Quality, Data Profiling and Data Cleansing functionalities.Entity Extraction transform available as a part of Text Data Processing of Data Services, helps to extract entities, entity relationships and facts from unstructured data for downstream analytics. The transform performs linguistic processing on content by using semantic and syntactic knowledge of words, to identify paragraphs, sentences, clauses, entities and facts from textual information.

Base_EntityExtraction Transform

This transform provides a user friendly GUI inteface, having three tabs namely Input, Options and Output. The transform accepts textual format such as a text, HTML, or XML. We need to specify explicitly the language for processing the text content. Entity extraction is performed with the help of in-built SYSTEM source or user defined DICTIONARY or RULE to filter specific entities as output.

TEXT DATA PROCESSING

Lets see the basic feature of this transform using a small example. Later we will manipulate the entity extraction further using Dictionary and Rule.

  1. Create a New File Format with file type as Unstructured text.
    • Next define the Root directory to specify the file location.
    • Enter the File name(s) to process. Also we can use wild-character in this placeholder as *.*. This will process all the textual files present in the directory for entity extraction.
    • Other properties like Read subfolders and Skip empty files by default are set to Yes.
    • Aditionally at the dataflow level we have the option to provide value for the Number of files to read. By default this is set to {None}.

    Unstructured Text File Format

    Next take a look at the default fields created for the file format. Four fields have been generated automatically namely FileName, LastModified, Data and IsText. Here the FileName field will extract the absolute file names to be processed in the Data Files directory.The Data field will actually extract the entire content of the file as long format.

  2. Next in the dataflow, place a Base_EntityExtraction transform of Data Services, after the unstructured file format. Link the transform with the file format.

    Dataflow Text Data Processing

  3. Next in the Input tab of the transform, drag and drop the Data column of the input schema to the Input Schema Column Name so as to map with the Transform Input Field Name TEXT.

    Base_EntityExtraction Transform Input

  4. Check the Options tab and set the Language value accordingly, in my case English. Leave the rest of the options as it is. Later we will deal with Dictionary File and Rules.

    Base_EntityExtraction Transform Options

  5. On the Output tab select the fields of interest. Best practice deals with only the fields STANDARD_FORM and TYPE. By default the output schema of the transform will generate a maximum of 11 fields. We can use them if we want.

    Base_EntityExtraction Transform Output

    Typically our interest will be on the fields TYPE, SOURCE, SOURCE_FORM and STANDARD_FORM. Sample output contents of these fields are as below:

    • STANDARD_FORM: Intel Corporation, 200 USD, 2009-11-25
    • SOURCE_FORM: Intel Corp, $200, Nov. 25, 2009
    • TYPE: Action, Address, BuyEvent, Organization, Phone, Price, PRODUCT, CURRENCY, DATE, TIME
    • SOURCE: SYSTEM, DICTIONARY, RULE
  6. Next we just add a Template Table as target. Run the job and lets check the meaningful text data extracted by the transform.

Please find the input text data file and the output dataset generated as screenshots below:

Sample Unstructured Data File

Entity Extraction Output

Also a little bit of analysis, to know places of interest as in the textual data:

SELECT STANDARD_FORM 
FROM STG_UNSTRUCT_DATA
WHERE TYPE IN 
('COUNTRY', 'CITY', 'PLACE_OTHER/LAND', 'COMMON_PLACE_OTHER/LAND')

Text Analytics

Next we will learn how to create and use Dictonary and Rule to filter the entities of interest during extraction from the unstructured source data file.

About the Author

Saurav Mitra

Saurav Mitra is a business intelligence professional.
Connect with the author via Google+
 



If you have any doubt or question on the above article, please Ask your question here. We will surely help you out!

Before you leave, Kindly provide your comments / suggestions / feedback below. Thank You!