Techreport,

An Information-Theoretical Approach to Genetic Algorithms for Clustering

, and .
(2001)

Abstract

In a database with categorical attributes each attribute denes a partition whose classes can be regarded as natural clusters of rows. The main theme of this paper is nding a partition of the rows of the database that is as close as possible to the partitions associated to each attribute. The classes of this partition will then be treated as clusters of rows. We evaluate the closeness of two partitions by using certain generalizations of the classical conditional entropy. From this perspective, we wish to construct a partition (referred to as the median partition) such that the sum of the dissimilarities between this partition and all the partitions determined by the attributes of the database is minimal. Then, the problem of nding the median partition is an optimization problem over the space of all partitions of the rows of the database for which we give an approximative solution. To search more eciently the space of possible partitions, which can be very large, we are using a genetic algorithm. Partitions are represented by chromosomes and we tested both the classical techniques of mutation and crossover and certain special mutation and crossover methods that contain specic knowledge of the problem domain. Keywords: median partition, Shannon entropy, Gini index, mutations, crossover operations 1

Tags

Users

  • @k.e.

Comments and Reviews