================================================================
 Bayon : a simple and fast clustering tool

 Copyright(C) Mizuki Fujisawa <fujisawa@bayon.cc>
================================================================

Overview:
  Bayon is a simple and fast hard-clustering tool.
  Bayon supports repeated-bisection clustering and k-means clustering.

Install:
  % ./configure
  % make
  % make check
  % sudo make install

Usage:
  * Clustering input data
    % bayon -n num [options] file
    % bayon -l limit [options] file
       -n, --number=num      the number of clusters
       -l, --limit=lim       limit value of cluster bisection
       -p, --point           output similarity points
       -c, --clvector=file   save vectors of cluster centroids
       --clvector-size=num   max size of output vectors of
                             cluster centroids (default: 50)
       --method=method       clustering method(rb, kmeans), default:rb
       --seed=seed           set a seed for random number generator

  * Get the similar clusters for each input documents
    % bayon -C file [options] file
       -C, --classify=file   target vectors
       --inv-keys=num        max size of the keys of each vector to be
                             looked up in inverted index (default: 20)
       --inv-size=num        max size of the inverted index of each key
                             (default: 100)
       --classify-size=num   max size of output similar groups
                             (default: 20)

  * Common options
       --idf                 apply idf to input vectors
       -h, --help            show help messages
       -v, --version         show the version and exit

Example:
  * clustering (number_of_output_clusters = 100)
    % bayon -n 100 input.tsv > cluster.tsv

  * clustering (save the vectors of cluster centroids)
    % bayon -n 100 -c centroid.tsv input.tsv > cluster.tsv

  * classification (get similar clusters using centroid vectors)
    % bayon -C centroid.tsv input.tsv > classify.tsv

Format of Input Data:
  * list of the vectors of input documents for clustering and classification

    document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
    document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
    ...

    * document_id : string
    * key         : string
    * value       : double

  * list of the vectors of cluster centroids

    cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
    cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
    ...

    * cluster_id : string
    * key        : string
    * value      : double

Format of Output Data:
  * list of clusters (output of clustering)

    cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n
    cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n
    ...

    * cluster_id  : integer (>= 1)
    * document_id : string

  * list of clusters with similarity values between documents and clusters
    (if did clustering with [-p] option)

    cluster_id1 \t document_id1 \t point1 \t document_id2 \t point2 \t ...\n
    cluster_id2 \t document_id3 \t point3 \t document_id4 \t point4 \t ...\n
    ...

    * cluster_id  : integer (>= 1)
    * document_id : string
    * point       : double

  * list of the vectors of cluster centroids
    (if did clustering with [-c file] option)

    cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
    cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
    ...

    * cluster_id : integer (>= 1)
    * key        : string
    * value      : double

  * list of similar clusters for each input documents

    document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n
    document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n
    ...

    * document_id : string
    * cluster_id  : string
    * point       : double

Requirement:
  * C++ compiler with STL (Standard Template Library)
  * google-sparsehash <http://code.google.com/p/google-sparsehash/>
    (If google-sparsehash not installed,
     the clustering tool uses __gnu_cxx::hash_map or std::map)

License:
  GPL2 (Gnu General Public License Version 2)
