Parallel K-Means Data Clustering for Large Data Sets

This software package of parallel K-means data clustering is the extention of
the Parallel K-means Software for handling data sets with more than 2 billion
data points. It uses "long long" data type to represent the number of data
points, instead of "int" in the previous release. Note that this software
contains only the MPI version and uses Parallel netCDF (PnetCDF) for its I/O
method. You can build and install PnetCDF in your home directory. PnetCDF
release includes build recipes for various machine platforms.

To compile:

Edit Makefile to make the following changes and then run command "make".
  MPICC       -- set to the path of MPI C compiler.
  PNETCDF_DIR -- set to the path where parallel netCDF is installed.
  DATATYPE    -- set to the data type of your input data. In this release,
                 "short" is used. Valid values are: "char", "int", "float",
                 "double", "long long"

To run:
  "make" command will produce an executable file named: "mpi_main".
  Command-line arguments :
  Usage: mpi_main [switches]
       -i filename    : input netCDF file containing data to be clustered
       -v var_name    : name of variable in the netCDF file to be clustered
       -c filename    : name of netCDF file that contains the initial cluster
                        centers if skipped, the same file from option "-i" is
                        used
       -k var_name    : name of variable in the netCDF to be used as the
                        initial cluster centers if skipped, the variable name
                        from the option "-v" is used
       -n num_clusters: number of clusters (K, must > 1)
       -t threshold   : threshold value (default 0.0010)
       -o             : output timing results (default no)
       -q             : quiet mode
       -d             : enable debug mode
       -h             : print this help information
 
Input file format:

  Only netCDF file format is supported in this software release. A few example
  files are provided in the sub-directory ./Image_data. Some information about
  netCDF file format can be found from links below.

  * netCDF is a portable and self-describing file format. Check
    http://www.unidata.ucar.edu/software/netcdf
  * Parallel netCDF (PnetCDF) is used to carry out paralell I/O, please check
    the link below for further information about PnetCDF.
    http://cucis.ece.northwestern.edu/projects/PnetCDF

Output file:

  The output file is in netCDF format (CDF-5). If command-line option "-o" was
  not used, the default output file name will be the input file name with
  ".kmeans_out" appended and the file extention ".nc" will still preserved. For
  example, if the input file name is "input.nc", then the default output file
  name will be "input.kmeans_out.nc".

Output variables:
  * Coordinates of cluster centers will be stored in variable named "clusters".
  * Membership of all data points to the clusters will be in variable named
    "membership".

Examples: Here are the file header of input and output files from using an
example "Image_data/color17695.nc".

    % mpiexec -n 4 mpi_main -q -i Image_data/color17695.nc -v color17695 -c Image_data/color17695.nc -k color17695 -n 4 -o output/out.nc

    Performing **** Parallel Kmeans  (MPI) ****
    Num of processes = 4
    Input file       : Image_data/color17695.nc
    Output file      : output/out.nc
    numObjs          = 17695
    numCoords        = 9
    numClusters      = 4
    threshold        = 0.0010
    I/O time         =     0.0672 sec
    Computation time =     0.0463 sec



    % ncmpidump -h Image_data/color17695.nc
    netcdf color17695 {
    // file format: CDF-5 (big variables)
    dimensions:
	    num_elements = 17695 ;
	    num_coordinates = 9 ;
    variables:
	    float color17695(num_elements, num_coordinates) ;
    }



    % ncmpidump -h output/out.nc
    netcdf out {
    // file format: CDF-5 (big variables)
    dimensions:
	    num_clusters = 4 ;
	    num_coordinates = 9 ;
	    num_elements = 17695 ;
    variables:
	    float clusters(num_clusters, num_coordinates) ;
	    int64 membership(num_elements) ;
    }



Wei-keng Liao (wkliao@eecs.northwestern.edu)
EECS Department
Northwestern University

Nov. 30, 2013

