 | Level: Introductory Stefanos Manganaris (stefanos@us.ibm.com), Senior Consultant, Knowledge Discovery Consulting group, IBM
01 Mar 2001 Data mining techniques are used to profile customers or to assess their propensity to purchase or their risk for attrition (among many other things) based on structured data, such as demographics, psychographics, and the customer's past transactions with the company. But what about knowledge discovery from unstructured data collections such as text documents, images, video, and audio? This article discusses methods of mining unstructured data including three fundamental text mining operations: clustering, categorization, and information retrieval.
Reprinted with permission from DB2 Magazine.
By some estimates, the Web now contains close to 2.1
billion pages, a number that roughly doubles every 12 months.
Worldwide, more than 225 million people send and receive e-mail
messages. AOL alone conveys 760 million messages a day. The
Internet's phenomenal growth and the widespread use of computers to
store, process, and communicate text have created the need for
tools that help individuals and businesses find the information
they need in the most effective and efficient manner possible.
Such documents as Web pages, news postings, and
e-mail messages contain information in unstructured free-form text
or streams of characters that make up words and sentences that
conform to the syntax and grammar rules of a specific computing
language. In sharp contrast, traditional databases are rigidly
structured collections of tables that contain records representing
specific instances of entities, relationships between entities, and
columns representing the various records' attributes. For example,
a table of customers could have one record per customer and the
following columns: customer ID, customer name, and customer
address.
Data mining extracts previously unknown,
comprehensible, and ultimately actionable information from
structured databases. Using various analytical techniques that fall
under the umbrella of data mining, you can perform knowledge
discovery tasks, including:
- Discovering patterns
- Revealing hidden relationships
- Detecting unusual behaviors
- Organizing entities into similar groups
- Inducing models that can explain the underlying rules that
govern a process
- Formulating models that can accurately predict a process.
Such capabilities are often critical components in
solutions to business problems across industries, ranging from
demand forecasting in supply chain management to targeted marketing
in e-commerce, to fraud and abuse detection in processing medical
claims. I've focused on several of these analytic techniques in
previous issues (see, for example, the Data
Miner column in the Quarter 3, 2000: Fall issue, DB2 magazine).
But what about knowledge discovery in unstructured
data collections? Are there analytical techniques for extracting
previously unknown actionable information from unstructured data
collections, which might include text documents, images, video, and
audio? What business problems could you solve using knowledge
discovery on unstructured data?
Mining unstructured data
You can perform knowledge discovery on unstructured
data with the help of tools such as IBM Intelligent Miner (IM) for
Text and IBM ViaVoice for continuous speech audio. ViaVoice
facilitates knowledge discovery with its voice recognition
capabilities. Other technologies, such as handwriting recognition,
content-based querying for images and time series (such as IBM's
Query by Image Content), image classification, and video indexing,
are in various stages of readiness for widespread use.
There are three fundamental text mining operations:
clustering, categorization, and information retrieval.
Clustering. You can use clustering techniques
to impose an organizational structure on a collection of text
documents by clustering together groups that are related or similar
based on their content. The clustering induces the number and type
of thematic categories from the data; the organizational structure
is data-driven and you don't have to prespecify document
categories. Many clustering techniques produce flat organizational
structures in which document groups are disjoint, and each document
is assigned to a single thematic category. (See Figure 1.) Other clustering techniques produce
hierarchical structures in which groups may be decomposed
recursively into subgroups corresponding to refined or lower-level
thematic categories. In any case, clustering turns unstructured
document collections into thematically organized groups that
provide a summary view of the documents in that group and
facilitate effective and efficient navigation.
Categorization. Some document collections
already have an organizational structure, either imposed by a
document clustering tool or crafted manually by identifying
thematic categories and assigning documents to them. In such
situations, categorization techniques can uncover the principles
governing the assignment of documents to categories. By analyzing
the content of the documents and their assigned categories,
categorization techniques produce classification models that detail
the discriminating features of the various categories. Such models
explain the key differences between categories and can
automatically classify new documents and incorporate them into the
collection and the existing structure.
Information retrieval. In addition to
organizing document collections and categorizing documents, users
must be able to retrieve pertinent information from the collection.
With structured data, users perform a database query (which is not
usually considered a data mining operation because the results of a
query are explicitly stored in the database, and no new information
is produced.) With unstructured data, querying is considerably more
difficult because effective information retrieval requires analysis
of document content - in other words, text mining. Simply matching
the text of a query to documents is not very useful. You might want
to retrieve relevant documents even when they don't contain any of
the text found in the query. In addition, the fact that some of the
text in the query is present in a document does not always mean
that the document is relevant. The key characteristics of a good
information retrieval engine are query flexibility, effective
retrieval, and computational efficiency.
A data miner will immediately recognize that
clustering text is similar to clustering data points, while text
categorization is similar to inducing classification models.
Indeed, the basic ideas in the algorithms used to perform these
operations are the same, but the devil is in the details. The
free-form nature of text makes handling it difficult. You have to
analyze the semantics of words and sentences in the context they
appear in order to derive features before you can apply any
knowledge discovery techniques. Features are significant items in
text, such as names and technical terms. Which features are
significant in a document may depend on the content of other
documents in the collection and on the text-mining task at hand.
Thus, text-mining algorithms must incorporate sophisticated text
analysis tools.
IBM IM for Text offers a rich set of text analysis
tools, including a feature extraction component that can
automatically derive a vocabulary that captures key terms and
concepts appropriate for the document collection analyzed. Elements
of the vocabulary can be multiword terms; names of people,
organizations, and places; abbreviations; and key numeric figures,
such as currency amounts and dates. IM uses algorithms that are
sophisticated enough to recognize "credit facility," "credit line,"
"Credit Lyonnais," and "Credit Suisse" as four separate concepts,
while recognizing "Bill Clinton" and "President Clinton" as the
same entity, distinct from "Clinton, N.J." A similar tool extracts
significant sentences from a document to create a summarized
version. Figure 2 and Figure
3 show examples of feature extraction.
Figure 1. Hierarchical document clustering

Figure 2. Sample feature extraction

Figure 3. Sample feature extraction

Another IM component can automatically identify the
language of a document - an important feature in text mining. You
can implement this capability using IM for Text categorization
tools that also make the product trainable and extensible to other
languages.
The feature extraction and language identification
components in IM for Text are available for developing custom
text-mining applications. IM for Text's sophisticated search engine
also uses them for information retrieval. The search engine
includes an indexing tool that performs in-depth linguistic
analysis of the documents in the collection to prepare a data
structure that facilitates fast and effective information
retrieval. Although most of the indexing is performed offline, IM
for Text is capable of updating the index on the fly, while it's
processing queries. Client applications typically submit queries to
the search engine's server. Using the index and other optional
resources, such as dictionaries and thesaurus files, the server can
efficiently search very large collections of documents written in
any of 16 different languages, including double-byte character-set
languages (such as Japanese, Chinese, and Korean), and stored in
various file formats.
Figure 4 shows a typical
client/server configuration. The query language in this example
allows the use of free text; Boolean expressions with conjunction,
disjunction, and exclusion of search terms; and a hybrid
combination of the two. The search engine supports several
paradigms, including precise-term searching, probabilistic
retrieval, phonetic searches, and fuzzy searching.
Document collections are often de-centralized and are
constantly changing (the Internet is a prime example). Therefore,
you need tools that automatically explore the collection to track
the changes and identify all documents in it. IM for Text includes
a Web crawler toolkit, which you can use to develop customized
agents that monitor collections of Web pages on the Internet or an
intranet.
E-mail and beyond
Let's now consider a practical example of text
mining: coping with e-mail. If you are like most people, you
receive more e-mail messages in a day than hard-copy
correspondence. You read some, delete many, and file the rest. You
probably have devised some rules for filing: you may file some
messages by sender, others by topic. A text-mining application for
e-mail could employ categorization tools to automatically determine
which incoming messages should be kept for reading, immediately
deleted, or filed unread. Messages to be filed could also be
categorized automatically and assigned to the appropriate folder
using a second categorization engine. The categorization algorithm
can discover the rules that govern your particular message routing
and filing scheme automatically, by analyzing your actions as you
process your e-mail. Information retrieval tools can enhance the
primitive string search capabilities of most e-mail reading tools.
Linguistic analysis and feature extraction in the collection of
accumulated e-mail would allow sophisticated text queries to
identify pertinent e-mail messages based on their content. (For
example, that kind of analysis would let you find all archived
messages that are related in content to the one you are reading.)
You can use clustering to further organize your e-mail notes
thematically or to identify predominant themes within a category of
filed notes.
Clearly, the potential of text mining goes beyond
e-mail. News, patents, and other intellectual capital, and
knowledge management can also benefit from such capabilities.
Analysts estimate that 80 percent of the information an enterprise
possesses is in unstructured form, vs. 20 percent structured.
Faster, better access to this information can certainly have a
positive impact on a business. Internally, text-mining solutions
can help store, manage, retrieve, and deliver the intellectual
capital of business communications. Examining external sources,
text mining can help track developments in an industry and analyze
the competition by pulling and analyzing relevant news, reports,
and patents. The IBM Business Intelligence organization has a
worldwide consulting practice that specializes in text mining
applications, with experience helping customers in various
industries develop solutions that leverage their most important
asset: information.
Figure 4. An information retrieval solution

Global uses for text mining
Beyond facilitating access to information,
text-mining solutions can help improve operations in profound ways.
In the context of customer relationship management, text mining can
help glean useful information about customers. Customer
communications encompass more than placing orders and processing
returns. A lot of useful information about a customer and the
business arrives in the form of praises, complaints, desires, and
suggestions. Feature extraction can also be used to automatically
assess customers' satisfaction with their relationship to the
business. Data mining techniques are used to profile customers or
to assess their propensity to purchase or their risk for attrition
(among many other things) based on structured data, such as
demographics, psychographics, and the customer's past transactions
with the company. Powerful solutions combine data and text mining
capabilities and expand these analyses to factor in customer
attributes inferred from nontransactional communications.
Reprinted with permission from the Spring 2001 issue
of DB2 Magazine. Copyright
CMP Media.
Resources
About the author  | |  | Stefanos Manganaris is a senior consultant in the
Knowledge Discovery Consulting group at IBM. He is responsible for
data mining solutions in consulting engagements and research in
knowledge discovery with applications to business problems across
industries. You can e-mail him at stefanos@us.ibm.com. |
Rate this page
|  |