confStream


A common problem in clustering is the proper choice of algorithm and optimal parameter settings. This is even more challenging in stream clustering where clusters are identified and updated over time. Traditionally, automated algorithm selection and configuration is available which can automatically find the best parameter settings. Unfortunately, none of the existing approaches are directly applicable to the streaming scenario. confStream presents the first approach for automated algorithm selection and configuration of stream clustering algorithms. It uses an ensemble of different configurations which allows to automatically find and adapt optimal parameters over time.

Publications

The corresponding paper has been accepted for publication at the ECMLPKDD ’19. Other publications are currently in review and will be available here once accepted.


Implementation

An implementation of the proposed algorithm is available here: https://github.com/MatthiasCarnein/moa/tree/algorithmConfiguration.

The algorithm is implemented in Java as a stream clustering algorithm for the Massive Online Analysis (MOA) framework. The algorithm makes use of a configuration file which specifies the algorithms to use as well as their initial configuration. All stream clustering algorithms implemented in MOA are available. The algorithm is currently developed in a fork and will be merged eventually.

An example on how to cluster a Random RBF stream is shown below:

public static void main(String[] args) {

	// initialise algorithm and set configuration file
	ConfStream algorithm = new ConfStream();
	algorithm.fileOption.setValue("settings.json");
	algorithm.prepareForUse();

	// generate stream
	RandomRBFGeneratorEvents stream = new RandomRBFGeneratorEvents();
	stream.prepareForUse();

	// train algorithm
	for (int i = 0; i < 1000000; i++) {
		Instance inst = stream.nextInstance().getData();
		algorithm.trainOnInstanceImpl(inst);
	}

	// get result
	algorithm.getClusteringResult();

}

A configuration file which performs algorithm selection and configuration for the DenStream ("denstream.WithDBSCAN"), ClusTree ("clustree.ClusTree"), CluStream ("clustream.WithKmeans") and BICO ("kmeanspm.BICO") algorithm is shown below:

{
	"windowSize": 1000,
	"ensembleSize": 20,
	"newConfigurations": 10,
	"keepCurrentModel": "true",
	"reinitialiseWithClusters": "true",
	"preventAlgorithmDeath": "true",
	"evaluateMacro": "false",
	"keepGlobalIncumbent": "true",
	"keepAlgorithmIncumbents": "true",
	"keepInitialConfigurations": "true",
	"useTestEnsemble": "true",
	"lambda": 0.05,
	"resetProbability": 0.01,
	"numberOfCores": 1,
	"algorithms": [
		{
			"algorithm": "denstream.WithDBSCAN",
			"parameters": [
				{"parameter": "e", "type":"numeric", "value":0.02, "range":[0,1]},
				{"parameter": "b", "type":"numeric", "value":0.2, "range":[0,1]},
				{"parameter": "m", "type":"integer", "value":1, "range":[0,10000]},
				{"parameter": "o", "type":"integer", "value":2, "range":[2,20]},
				{"parameter": "l", "type":"numeric", "value":0.25, "range":[0,1]}
			]
		}
		,
		{
			"algorithm": "clustree.ClusTree",
			"parameters": [
				{"parameter": "H", "type":"integer", "value":8, "range":[1,20]},
				{"parameter": "B", "type":"boolean", "value":"false"}
			]
		}
		,
		{
			"algorithm": "clustream.WithKmeans",
			"parameters": [
				{"parameter": "k", "type":"integer", "value":5, "range":[2, 20]},
				{"parameter": "m", "type":"integer", "value":100, "range":[1,10000]},
				{"parameter": "t", "type":"integer", "value":2, "range":[1,10]}
			]
		}
		,
		{
			"algorithm": "kmeanspm.BICO",
			"parameters": [
				{"parameter": "k", "type":"integer", "value":5, "range":[2, 20]},
				{"parameter": "n", "type":"integer", "value":1000, "range":[1,2000]},
				{"parameter": "p", "type":"integer", "value":10, "range":[1,20]},
				{"parameter": "d", "type":"integer", "value":10, "optimise":"false"}
			]
		}
	]
}

For every parameter, optimisation can be turned off with "optimise":"false" in order to prevent changes to the defined parameter value. In the example above, the dimensionality parameter ("d") of BICO ("kmeanspm.BICO") is set to a fixed value without parameter optimisation. If a parameter is undefined, its default value will be used.

A wide range of parameter types is supported, including numerical, integer, ordinal, categorical and boolean. Their basic usage is shown below:

"parameters": [
	{"parameter": "n", "type":"numeric", "value":0.02, "range":[0,1]},
	{"parameter": "i", "type":"integer", "value":10, "range":[0,100]},
	{"parameter": "o", "type":"ordinal", "value":"good", "range":["bad","medium","good"]},
	{"parameter": "c", "type":"categorical", "value":"euclidean", "range":["euclidean","manhattan","mahalanobis"]},
	{"parameter": "b", "type":"boolean", "value":"true"}
]

ECMLPKDD ’19

We proposed an initial idea for our algorithm at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD '19). For reference, the implementation at the time of publication is still available here: https://github.com/MatthiasCarnein/moa/tree/ECMLPKDD.

Since then, this version has been superseded and heavily extended as shown above. In particular, the configuration file used to have fewer settings. A configuration file for this version which optimises the ε ("e") parameter of the DenStream algorithm ("denstream.WithDBSCAN") is shown below:

{
	"windowSize": 1000,
	"ensembleSize": 25,
	"newConfigurations": 10,
	"keepCurrentModel": "true",
	"preventAlgorithmDeath": "true",
	"lambda": 0.05,
	"algorithms": [
		{
			"algorithm": "denstream.WithDBSCAN",
			"parameters": [
				{"parameter": "e", "type":"numeric", "value":0.02, "range":[0,1]}
			]
		}
	]
}