数据集特点

Instant Message:即时聊天信息

Usage:用处

Instant message clustering is very useful for analyzing its content characteristic or establishing other mining application.

Features:文本特点

短文本,特征稀疏:sparse key words when clustering on instant messages

The most common text processing approach is to represent the documents with vectors. This is so—called vector space model. in which a vector corresponds to one document and the dimensions correspond to words in this document. Once the high-dimensional vectors are derived. the major challenge left for document clustering is how to deal with these high dimensional data. However. instant message is extremely shorter than the common document. There are usually only several key words in one instant message. and key words about the message topic are even latent sometimes.The sparsity of key words makes word—frequency based methods inappropriate to measure the similarity among Instant messages.

Process:处理方法

methos to enhance the description of instant messages to response the problem of sparse key words when clustering on instant message.Firstly, we notice that instant message is a kind of semi-structured data. which has source and destination addresses with time stamp. Instant messages sent back and forth among specific persons during some specific time intervals form a conversation, which groups these instant messages into a specific topic. So we combine these messages as one conversation. it is obvious that conversation has more key works and more integral context information than simply single message. Then clustering is performed toward conversations instead of messages. Secondly. we enhance the content description of a conversation with words. which are not in the conversation but are closely related with existing words in this conversation. Fox example. words ‘ball’ and ‘football‘ are added to lM-2. which are not appear in lM-2 but have obvious correlation with the word. ‘basketball’. in lM-2. in this paper. we propose an instant message clustering method called WR-KMcans. which can automatically scan instant message corpora, construct conversations and enhance traditional TF-lDF model by adding relevant words in conversations.