Skip to main content

skip to main content

developerWorks  >  Information Management  >

Reading between the lines

Mining unstructured data

developerWorks
Document options

Document options requiring JavaScript are not displayed


Learn and share!

Exchange know-how with your peers -- try our new Pass It Along beta app


Rate this page

Help us improve this content


Level: Introductory

Stefanos Manganaris (stefanos@us.ibm.com), Senior Consultant, Knowledge Discovery Consulting group, IBM

01 Mar 2001

Data mining techniques are used to profile customers or to assess their propensity to purchase or their risk for attrition (among many other things) based on structured data, such as demographics, psychographics, and the customer's past transactions with the company. But what about knowledge discovery from unstructured data collections such as text documents, images, video, and audio? This article discusses methods of mining unstructured data including three fundamental text mining operations: clustering, categorization, and information retrieval.

Reprinted with permission from DB2 Magazine.

By some estimates, the Web now contains close to 2.1 billion pages, a number that roughly doubles every 12 months. Worldwide, more than 225 million people send and receive e-mail messages. AOL alone conveys 760 million messages a day. The Internet's phenomenal growth and the widespread use of computers to store, process, and communicate text have created the need for tools that help individuals and businesses find the information they need in the most effective and efficient manner possible.

Such documents as Web pages, news postings, and e-mail messages contain information in unstructured free-form text or streams of characters that make up words and sentences that conform to the syntax and grammar rules of a specific computing language. In sharp contrast, traditional databases are rigidly structured collections of tables that contain records representing specific instances of entities, relationships between entities, and columns representing the various records' attributes. For example, a table of customers could have one record per customer and the following columns: customer ID, customer name, and customer address.

Data mining extracts previously unknown, comprehensible, and ultimately actionable information from structured databases. Using various analytical techniques that fall under the umbrella of data mining, you can perform knowledge discovery tasks, including:

  • Discovering patterns
  • Revealing hidden relationships
  • Detecting unusual behaviors
  • Organizing entities into similar groups
  • Inducing models that can explain the underlying rules that govern a process
  • Formulating models that can accurately predict a process.


Such capabilities are often critical components in solutions to business problems across industries, ranging from demand forecasting in supply chain management to targeted marketing in e-commerce, to fraud and abuse detection in processing medical claims. I've focused on several of these analytic techniques in previous issues (see, for example, the Data Miner column in the Quarter 3, 2000: Fall issue, DB2 magazine).

But what about knowledge discovery in unstructured data collections? Are there analytical techniques for extracting previously unknown actionable information from unstructured data collections, which might include text documents, images, video, and audio? What business problems could you solve using knowledge discovery on unstructured data?

Mining unstructured data

You can perform knowledge discovery on unstructured data with the help of tools such as IBM Intelligent Miner (IM) for Text and IBM ViaVoice for continuous speech audio. ViaVoice facilitates knowledge discovery with its voice recognition capabilities. Other technologies, such as handwriting recognition, content-based querying for images and time series (such as IBM's Query by Image Content), image classification, and video indexing, are in various stages of readiness for widespread use.

There are three fundamental text mining operations: clustering, categorization, and information retrieval.

Clustering. You can use clustering techniques to impose an organizational structure on a collection of text documents by clustering together groups that are related or similar based on their content. The clustering induces the number and type of thematic categories from the data; the organizational structure is data-driven and you don't have to prespecify document categories. Many clustering techniques produce flat organizational structures in which document groups are disjoint, and each document is assigned to a single thematic category. (See Figure 1.) Other clustering techniques produce hierarchical structures in which groups may be decomposed recursively into subgroups corresponding to refined or lower-level thematic categories. In any case, clustering turns unstructured document collections into thematically organized groups that provide a summary view of the documents in that group and facilitate effective and efficient navigation.

Categorization. Some document collections already have an organizational structure, either imposed by a document clustering tool or crafted manually by identifying thematic categories and assigning documents to them. In such situations, categorization techniques can uncover the principles governing the assignment of documents to categories. By analyzing the content of the documents and their assigned categories, categorization techniques produce classification models that detail the discriminating features of the various categories. Such models explain the key differences between categories and can automatically classify new documents and incorporate them into the collection and the existing structure.

Information retrieval. In addition to organizing document collections and categorizing documents, users must be able to retrieve pertinent information from the collection. With structured data, users perform a database query (which is not usually considered a data mining operation because the results of a query are explicitly stored in the database, and no new information is produced.) With unstructured data, querying is considerably more difficult because effective information retrieval requires analysis of document content - in other words, text mining. Simply matching the text of a query to documents is not very useful. You might want to retrieve relevant documents even when they don't contain any of the text found in the query. In addition, the fact that some of the text in the query is present in a document does not always mean that the document is relevant. The key characteristics of a good information retrieval engine are query flexibility, effective retrieval, and computational efficiency.

A data miner will immediately recognize that clustering text is similar to clustering data points, while text categorization is similar to inducing classification models. Indeed, the basic ideas in the algorithms used to perform these operations are the same, but the devil is in the details. The free-form nature of text makes handling it difficult. You have to analyze the semantics of words and sentences in the context they appear in order to derive features before you can apply any knowledge discovery techniques. Features are significant items in text, such as names and technical terms. Which features are significant in a document may depend on the content of other documents in the collection and on the text-mining task at hand. Thus, text-mining algorithms must incorporate sophisticated text analysis tools.

IBM IM for Text offers a rich set of text analysis tools, including a feature extraction component that can automatically derive a vocabulary that captures key terms and concepts appropriate for the document collection analyzed. Elements of the vocabulary can be multiword terms; names of people, organizations, and places; abbreviations; and key numeric figures, such as currency amounts and dates. IM uses algorithms that are sophisticated enough to recognize "credit facility," "credit line," "Credit Lyonnais," and "Credit Suisse" as four separate concepts, while recognizing "Bill Clinton" and "President Clinton" as the same entity, distinct from "Clinton, N.J." A similar tool extracts significant sentences from a document to create a summarized version. Figure 2 and Figure 3 show examples of feature extraction.


Figure 1. Hierarchical document clustering
Hierarchical document clustering

Figure 2. Sample feature extraction
Sample feature extraction

Figure 3. Sample feature extraction
Sample feature extraction

Another IM component can automatically identify the language of a document - an important feature in text mining. You can implement this capability using IM for Text categorization tools that also make the product trainable and extensible to other languages.

The feature extraction and language identification components in IM for Text are available for developing custom text-mining applications. IM for Text's sophisticated search engine also uses them for information retrieval. The search engine includes an indexing tool that performs in-depth linguistic analysis of the documents in the collection to prepare a data structure that facilitates fast and effective information retrieval. Although most of the indexing is performed offline, IM for Text is capable of updating the index on the fly, while it's processing queries. Client applications typically submit queries to the search engine's server. Using the index and other optional resources, such as dictionaries and thesaurus files, the server can efficiently search very large collections of documents written in any of 16 different languages, including double-byte character-set languages (such as Japanese, Chinese, and Korean), and stored in various file formats.

Figure 4 shows a typical client/server configuration. The query language in this example allows the use of free text; Boolean expressions with conjunction, disjunction, and exclusion of search terms; and a hybrid combination of the two. The search engine supports several paradigms, including precise-term searching, probabilistic retrieval, phonetic searches, and fuzzy searching.

Document collections are often de-centralized and are constantly changing (the Internet is a prime example). Therefore, you need tools that automatically explore the collection to track the changes and identify all documents in it. IM for Text includes a Web crawler toolkit, which you can use to develop customized agents that monitor collections of Web pages on the Internet or an intranet.



Back to top


E-mail and beyond

Let's now consider a practical example of text mining: coping with e-mail. If you are like most people, you receive more e-mail messages in a day than hard-copy correspondence. You read some, delete many, and file the rest. You probably have devised some rules for filing: you may file some messages by sender, others by topic. A text-mining application for e-mail could employ categorization tools to automatically determine which incoming messages should be kept for reading, immediately deleted, or filed unread. Messages to be filed could also be categorized automatically and assigned to the appropriate folder using a second categorization engine. The categorization algorithm can discover the rules that govern your particular message routing and filing scheme automatically, by analyzing your actions as you process your e-mail. Information retrieval tools can enhance the primitive string search capabilities of most e-mail reading tools. Linguistic analysis and feature extraction in the collection of accumulated e-mail would allow sophisticated text queries to identify pertinent e-mail messages based on their content. (For example, that kind of analysis would let you find all archived messages that are related in content to the one you are reading.) You can use clustering to further organize your e-mail notes thematically or to identify predominant themes within a category of filed notes.

Clearly, the potential of text mining goes beyond e-mail. News, patents, and other intellectual capital, and knowledge management can also benefit from such capabilities. Analysts estimate that 80 percent of the information an enterprise possesses is in unstructured form, vs. 20 percent structured. Faster, better access to this information can certainly have a positive impact on a business. Internally, text-mining solutions can help store, manage, retrieve, and deliver the intellectual capital of business communications. Examining external sources, text mining can help track developments in an industry and analyze the competition by pulling and analyzing relevant news, reports, and patents. The IBM Business Intelligence organization has a worldwide consulting practice that specializes in text mining applications, with experience helping customers in various industries develop solutions that leverage their most important asset: information.


Figure 4. An information retrieval solution
An information retrieval solution


Back to top


Global uses for text mining

Beyond facilitating access to information, text-mining solutions can help improve operations in profound ways. In the context of customer relationship management, text mining can help glean useful information about customers. Customer communications encompass more than placing orders and processing returns. A lot of useful information about a customer and the business arrives in the form of praises, complaints, desires, and suggestions. Feature extraction can also be used to automatically assess customers' satisfaction with their relationship to the business. Data mining techniques are used to profile customers or to assess their propensity to purchase or their risk for attrition (among many other things) based on structured data, such as demographics, psychographics, and the customer's past transactions with the company. Powerful solutions combine data and text mining capabilities and expand these analyses to factor in customer attributes inferred from nontransactional communications.

Reprinted with permission from the Spring 2001 issue of DB2 Magazine. Copyright CMP Media.

Resources



About the author

Stefanos Manganaris is a senior consultant in the Knowledge Discovery Consulting group at IBM. He is responsible for data mining solutions in consulting engagements and research in knowledge discovery with applications to business problems across industries. You can e-mail him at stefanos@us.ibm.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top