Level: Introductory Rolf Baurle (dwinfo@us.ibm.com), Team member, IBM's Boeblingen Laboratory, Germany Matthias Tschaffler, Developer, IBM
19 Dec 2002 Take a whirlwind tour through information mining by joining our authors as they cover the basics from installation, defining taxonomies, and querying the information.
©2002 International Business Machines Corporation. All rights reserved.
Note: IBM® DB2® Information Integrator for Content replaces Enterprise Information Portal in versions 8.1 and earlier (March 2003). See http://www.ibm.com/software/data/eip/ for more information.
Introduction
This article explains how and when to use Enterprise Information Portal (EIP) Information Mining (IM). Furthermore, we show how IM makes use of the IBM Content Manager V8 programming model.
Information Mining (IM) is a feature of IBM®
Enterprise Information Portal Version 8; it is the follow-on product of IBM Intelligent Miner for Text. IM works with unstructured data, like Web pages and text documents, in contrast to Data Mining that is based on structured data like relational data.
Applications that use EIP can create a federated data store, which acts as a common server. EIP federated classes enable federated searching, retrieval, and updating across several content servers.
EIP IM provides a way to automatically create metadata for documents that are stored in EIP backends. The following mining features are supported:
- Categorization
Assigns one or more categories to a document based on a trained user-defined taxonomy. A taxonomy is comprised of topic areas of interest in terms of a hierarchical structure of categories.
- Information extraction
Recognizes significant vocabulary items, like names, terms and expressions, in text documents.
- Summarization
Extracts the most important sentences of a document, and therefore can help the user decide whether to read the entire document or not.
- Language identification
Determines the language a document is written in.
- Clustering
Groups a collection of documents into clusters according to their similarity. They are labeled using meaningful terms from the documents that belong to the same cluster.
The generated meta data can be stored in the EIP admin database, enabling users to use the full search capability provided by IM.
A scenario
Let's look at the following scenario to see how IM is used in the "real world." A service company offers newspaper articles that can be requested by customers. It uses the Information Mining services to create a taxonomy with categories that cover all newspaper articles that appear on a daily basis. The incoming articles are scanned and assigned to a category, and a summary is created of each article. Depending on the subject interests specified by the customer, the relevant articles and summaries are directly routed to the customer, who uses the summary to decide on whether to read the entire article or not.
Under the covers
Using an example based on the EIP IM "First Steps," we describe what happens "under the covers" when setting up the IM environment. The text that describes the underlying technology appears in shaded areas.
Architectural overview
The IM data model is built by means of the IBM
Content Manager (CM) programming model;
therefore, all data access uses the IBM Content
Manager V8 JavaTM Connector provided by EIP. The model and user data are stored in the EIP administration database, which is a DB2®
Universal DatabaseTM database.
The CM V8 model is comprised of components that are sets of system-defined and user-defined attributes that you use to describe any type of data. Components come in two flavors: root and child components. You can use them to build hierarchies. A root component is the first level of a hierarchy, and child components are optional second or lower-level components in the hierarchy directly associated with the level above it. An item type is a template to create items consisting of a root and zero or more child components.
If you are not familiar with the CM modeling features, think of a database table by which a component is represented. The component attributes are the columns in that table, and the child components are associated with the higher-level components by foreign keys. An instance of a component is an item. It is represented as a row or entry in a component table.
Other features in CM allow you to create associations between items. One of them is a link reference. It associates an item with one or more other items. For a detailed description about the CM modeling capabilities, refer to Reference 3,
Modeling Your Data In Content Manager V.8.
IM creates its own set of CM attributes and components to hold the model entities. All of these entities have the special prefix "IKF" to avoid name clashes with other user-defined entities. Additionally, they are not visible in the CM Administration client to separate them from other user-defined attributes or item types.
For IM there are two different sets of entities: one part is known in advance and is created during the configuration step. The other part is generated dynamically, whenever a new catalog is created.
The shaded components in Figure 1 show the IM/EIP components that are described in this article.
|
Figure 1. IM Component overview
Step 1: Installing the product
It's easy to install IM. Just select the feature EIP Information Mining when you install EIP and all of the prerequisite components are installed. To verify that the installation was successful, you can run the installation verification test script run.bat located in <CMBROOT>\ikf\bin\tools.
The output of the installation verification test should be:
<CMBROOT>\ikf\bin\tools>run icmnlsdb icmadmin <password>
[...]
connected to icmnlsdb as icmadmin
catalog created
document filter finished
language identifier finished (en)
searching (waiting up to 6 minutes for index update)
result retrieved (PID1)
catalog deleted
cleanup finished |
Step 2: Defining a taxonomy using the Information Structuring Tool
The following assumes a successful installation of EIP with the Information Mining feature. This implies that you have deployed the Information Structuring Tools (IST) and the sample mining search JSP Web app (client).
You must define the taxonomy before you can use the categorization feature of EIP Information Mining. For the other mining features no further manual steps are necessary.
The IST Web application is deployed on a WebSphere® Application Server (which is a prerequisite of EIP V8) and allows you to define a taxonomy. You can also deploy the IST (and the JSP sample) to other J2EE-compliant application servers (such as Apache's Jakarta Tomcat). After you have logged on to the IST you will see the screen shown in Figure 2.
Figure 2. Information Structuring Tool after first logon
The library is a conceptual view of the contents of the Information Mining database. The library contains a set of catalogs. A catalog is the anchor point of a taxonomy. It also contains other data, like the training and evaluation results, that is needed to work with a taxonomy.
Right-click on Library -> new catalog and enter a name to create the catalog, as shown in Figure 3.
Figure 3. Creating your first catalog
When a catalog is created, a root category with the same name as the catalog is also created. Every catalog has a root category, which cannot be deleted. The root category is the entry point for accessing the taxonomy.
Additionally, IM generates three new CM root components to store all data associated with this catalog:
- A category component is created that will hold all categories that constitute the taxonomy for the associated catalog.
- A training document component that stores all training documents used to train the taxonomy.
- The record component that stores all extracted metadata for imported documents.
This set of metadata is defined by the IM schema. It specifies the names and types of attributes that can be stored for a catalog. Currently the IST uses the following default schema when a new catalog is created:
- IKF_CONTENT (string)
- IKF_TITLE (string)
- IKF_AUTHOR (string)
- IKF_CATEGORIES (string)
- IKF_SUMMARY (string)
- IKF_LANGUAGE (string)
- IKF_FEATURES (string)
- IKF_COMMENTS (string)
- IKF_DATE (timestamp)
- IKF_IDNUMBER (integer)
These schema keys are internally mapped to CM attributes when the record item type is created. One special schema key is the textual content IKF_CONTENT of a document. IM automatically creates a DB2 Text Information Extender (TIE) index for the content to enable it for full text search.
|
Figure 4
shows the relationships between the CM components created by IM and their usage in IM.
Figure 4. Component architecture
After the catalog has been created you can add categories to the taxonomy by right clicking on the root category folder and selecting new category (See Figure 5). By iterating these steps you define your category hierarchy. This way you manually create your taxonomy category by category. If you already have a taxonomy stored as a directory tree on the file system, you can import this taxonomy tree.
Figure 5. Creating a new category for your taxonomy
|
Adding a new category to a catalog means that there is a new
category
item created for the category item type associated with the catalog. To build the category hierarchy, each category is linked to its parent category by means of a CM link reference. Do not confuse the category hierarchy with the component hierarchy. The category hierarchy is built using link relations - there is only one single category component per catalog.
|
Step 3: Loading the training documents
The purpose of "training" is to create a set of categorization rules, which IM can use later to categorize new documents with respect to your taxonomy.
You must select the appropriate training documents first. The quality of the categorization is strongly dependent on the quality of the training documents assigned to each category. Refer to Reference 2 to see the criteria for selecting training documents. Forty (40) training documents per category is recommended.
Figure 6. Uploading training documents
Figure 6 shows how you use the IST to upload training documents.
- First select the respective category and the Training Documents List
window is displayed.
- Click on the Add Training Document. . . button and a new window opens up.
- Click the Browse. . . button and select the relevant files.
- Click Submit to import the training documents.
| Training documents are added to the
training document item type associated with the catalog. A CM link reference between the category and the training document item is used to establish a relation between these items.
|
Step 4: Training a taxonomy
With the IST, you optionally can evaluate the training. Evaluating the taxonomy helps you assess how good your training documents are with respect to your predefined taxonomy. You should run the evaluation prior to run the training. See Reference 2 for more details on evaluation.
After your taxonomy is completely created and the training documents are loaded, you can start the training of your categorization model. To do this, click on the root category of your taxonomy and then click on the Start Training button of the Training tab, as shown in Figure 7.
Figure 7. Start the training of your taxonomy
| The training result is stored as a child of the catalog item. It is retrieved from the data store whenever a new document has to be categorized. Note that there is one and only one training result associated with the catalog. |
Whenever you modify your taxonomy (remove or rename categories) the training result is invalidated, thus making it necessary to rerun the training. If you add categories, add or remove training documents and do not rerun training the old training results will be used, but not invalidated.
Step 5: Importing documents from an EIP backend
To access distributed data sources you can use the federated capabilities of EIP. In order to fill the Information Mining catalog with meta data, the application programmer writes a mining application to retrieve the data from the data source and stores the textual content along with meta information, such as summary and document language, in the mining catalog. The content is stored so that an index can be built for use in searching. IM supports not only plain text files but a multitude of other formats (Reference 2). The content extraction is done by using the Outside InTM document filters from Stellent Corp. By accessing data from different backends (e.g., from different medical data repositories) the Information Mining catalog represents a centralized location of data for your area of interest. In other words, it's like a data warehouse for text mining.
To help you get started quickly, EIP IM comes with a sample application ("accessAndMine"), which imports sample documents from a DB2 backend (the database IBMPRESS), applies Information Mining operations to the retrieved textual data and stores the results in the mining catalog.
The accessAndMine application shows you how to use the IM Java Beans API to store summaries, language and category information of the text document in the IM catalog. It could be easily extended to store extracted features as well.
Note that the IM product comes with samples for all supported mining features. See
Reference 2.
Records are stored in the record item type associated with the catalog item. There is one record item created for each document. The Service API described in the IM Java API provides methods to manage records. When setting the schema key value, e.g., for the summary key IKF_SUMMARY, you can either pass the values calculated by a preceding mining step, or you are free to set user-defined values, such as a human-edited version of the automatic summary. Note that when creating a new record item it is not mandatory to set all schema keys values that are defined by the catalog.
Records are assigned by CM link references to one or more categories. You also are free to choose the categories from the categorization result or any other categories, or have it depend on the outcome of other mining steps like information extraction.
If the content attribute IKF_CONTENT is set when storing or updating a record item the Text Information Extender (TIE) index is updated automatically according to the update frequency specified in the Infomining.properties file located in the <CMBROOT>\ikf\lib directory. You can change the default update frequency by modifying the entries in the [Search_Index] section. The names should be self-explanatory. Please also note that you have to modify the [Search_Index] values before you create a new catalog, otherwise the update frequency will not be affected. These values are inserted to the virtual TIE configuration file whenever a new TIE index is created. Therefore, you may change the update frequency manually by modifying the TIE configuration file directly.
|
 |
Step 6: Using the sample JSP to retrieve mined documents
Let's summarize what we did up to this point:
- We created a taxonomy, trained it and evaluated the training results.
- We loaded the mining catalog with data from one or more EIP backends and enriched the content with mining meta data.
Now that we have all the data "in place," we want to run "smart" queries against our Mining meta data store. For this purpose, EIP Information Mining comes with a sample JSP which you can access from your Internet browser.
You log on by specifying your user ID and password as well as the mining catalog you want to search in. The sample client combines full text search with category search, thus offering an advanced search function. The sample application can also easily be extended to search on other meta data (for example, the summary).
By entering the search term "ThinkPad" and "All Categories Sample" you will see the results shown in Figure 8, if you used the "Sample" catalog that comes with the EIP Infomining First Steps.
Figure 8. Search result using the sample JSP search client
Now we want to refine our search to category "Global Financing." As you can see in
Figure 9, there are fewer results returned from this refined query (exactly two).
Figure 9: Result of refining the search
Information Mining provides a text-based query API that can be used to execute queries against the data store to search in imported documents and their respective metadata. The IM query language allows you to look for any schema key values combined with categorization information (See Reference 2).
In the sample above, when you look for "all documents in which content contains the word 'Thinkpad'" the query string passed to the IM query API
looks like:
(IkfContent contains "'ThinkPad'")
This query is internally mapped to a CM query, which is implemented in the XPath language (Reference 5):
/<IKFRxxxx>[contains-text(@<IKFAxxxx>,"'ThinkPad'")=1]
- IKFRxxxx is the name of the dynamically created record item type. (xxxx stands for dynamically created unique identifiers which are needed to enforce EIP/CM unique name constraint.)
- IKFAxxxx is the name of the CM attribute that maps to the IKF_CONTENT schema key.
This query is finally submitted against the
data store.
Note that text-search is provided via special
built-in predicates based on the SQL-MM
fulltext standard [
Reference 6
] (see "contains-text" in the query above).
As shown in
Figure 9
you can refine the search result by specifying a specific category to search in. The sample query would look for "
all documents in which content contains the word 'Thinkpad' AND are in Category 'Sample/Global Financing'
".
The associated query string looks like:
(IkfContent contains "'ThinkPad'") AND (IkfCategory = "Sample/Global Financing")
This way you can build more complex queries
that may return more relevant result sets.
Information Mining also supports
projection
that allows you to retrieve a subset of the schema key values of the defined attributes for a record item type. You may know this feature from a standard SQL SELECT statement where you can restrict the output columns. Using projection you can improve query performance by only selecting the schema keys that are actually displayed. For example, you might want to avoid selecting attributes that are defined by LOB data types unless absolutely necessary.
|
 |
Removing an Information Mining catalog
If you want to discard your taxonomy, use the IST to remove it. Right click on the catalog and select "delete" as shown in
Figure 10
.
After the catalog is removed, all the associated information (training results, evaluation results and the records loaded into the catalog)is removed.
Figure 10. Removing a catalog
| When a catalog is deleted all associated data like training results, extracted meta data and the item types created for the catalog are deleted as well. Depending on the size of the catalog this might be a long-running operation. Therefore this operation is divided into two separate tasks: deletion of a catalog and a subsequent cleanup. The delete operation only removes the catalog and its dependent child items, e.g., the training result. All other data still exists in the data store but is now invisible to the world outside. A succeeding cleanup step removes all remaining data (all items and the dynamically created three item types). It is a non- atomic operation and can be run for several deleted catalogs at the same time. |
The IST starts the cleanup operation automatically in a separate thread for you. Hence the delete catalog invocation will return immediately and the new thread will clean up the database for you.
Storing metadata outside of the IM catalog
EIP IM provides you with a flexible set of Java interfaces. This does not limit you to use IM the way it is described in this article but also allows you to store the Mining metadata outside of the Information Mining catalog. You might consider this when you want to keep the mining metadata "close" to the original data on the backend. By using the Service API you can write your own application that enriches your backend documents with Information Mining metadata to extend your existing document search solution. The disadvantage is that you lose the "advanced search" capability (like the category search) provided by IM.
It is very important to understand that you can use the IM categorization feature only in conjunction with the IM catalog. The reason for this is that the categorization model is stored inside the IM catalog. However, you can store the categorization result (that is, the categories matching for a document) outside of the IM catalog by using IM as a "mining engine."
Conclusion
This article gave you a brief overview how to use EIP IM to create catalogs by using the IST. For each phase in the "life cycle" we described the steps performed using the IST GUI followed by the underlying implementation concept based on EIP/CM. This may help to not only to give you a better understanding about Information Mining but also demonstrates the powerful modeling capabilities of EIP/CM V8.
Resources
About the authors  | 
|  | Rolf Baurle is a member of the EIP Information Mining team located at IBM's Boeblingen Laboratory, Germany. |
 | 
|  | Matthias Tschaffler joined IBM in 2000. He worked on DB2 Text Information Extender and the IBM Warehouse Manager Connector for the Web deliverable. He currently works as a developer on the EIP Information Mining product. |
Rate this page
|