Matilda (Mining Artificially Generated Data). Overall goal of this research line is to apply data mining techniques to engineering problems, specifically to civil engineering tasks dealing with design, but also to model formation and analysis tasks as they occur as part of diagnosis problems.
Simulation Data Mining We are dealing with the generation of knowledge and decision rules from large numbers of simulations. In particular, simulation data mining will be of great value if engineering design is impeded due to the time required to generate simulation results in an interactive setting for large models. Instead, simulation results could be batch-processed in the background whilst the designer goes on with other work, and later be returned on demand. Then if the requested model isn't available, the nearest neighbor could be returned in the interim as an approximated preview, allowing the designer to continue without interruption. In addition, it is also interesting to predict the expected behavior of models based on training data made up of existing simulation results, when the designer is challenged with a design space of enormous size.
Similarity Measurement The development of a similarity measure can support simulation data mining based on the idea that models of similar design will have similar simulation results. From this assumption, similar models can be looked up based on any given model, or models of similar design can be identified based on simulation behavior.
First, a subset of the design space such as geometry and material parameters is considered for exploration (step 1). Next, the simulation results are produced using the Finite Element Method (step 2). Following that, thousands of simulation measurements are aggregated into a more manageable subset (step 3). Then clustering technology is applied to generate knowledge about nearest neighbors (step 4), and an appropriate set of this knowledge is sampled (step 5). Finally, class probability estimates from machine learning classification technology are exploited for producing similarity scores. With the exception of the simulations, all steps form interesting data mining questions, from which many competing alternatives have to be considered and evaluated.
Mining and Storing Big Data A student project in Summer 2012 called "Mining and Storing Big Data" studied the relationship between simulation results and machine learning results. As a sub-theme, there is interest in applying so-called "big data" technology to address bottlenecks. The Hadoop and Mahout frameworks have been the technologies applied to address this "big data" theme. Hadoop is used to allow concurrent processing of numerical simulations, which is an inherently parallelizable task. Mahout provides a library of supervised and unsupervised learning methods that are not otherwise parallelizable, to enable concurrent processing of those parts.
As another sub-theme, there is interest in implementing methods for making the work easier to reproduce and disseminate. An online version of the six-step "simulation pipeline" implemented as a TIRA experiment addresses reproducibility. The TIRA web service provides an online framework to allow researchers to share experiments on the web for others to reproduce, and provides other features such as the ability to explore experiment parameters, monitor experiment progress, and reuse cached results.
Domain Decomposition There is also interest in speeding up numerical analysis in general, so that the findings can be applied efficiently elsewhere. A method called domain decomposition can be applied to parallelize the processing. To do this the domain on which the numerical analysis is solved is broken into several sub-domains for concurrent processing on modern computing architecture. Then the full solution is reformulated using the overlapping parts of each sub-domain.
The additive Schwarz method can be applied as one implementation for domain decomposition. In this problem setting, there is a complex trade-off that must be managed between the number of iterations required to execute the additive Schwarz method, and the total size of the problem including the redundancy of the overlaps. Here, a human could organize the sub-domains into a checkerboard pattern and apply a uniform overlap to solve the problem as a simple solution. However, our current work shows that more customized and efficient solutions can be developed with regression analysis in machine learning. Our code for supporting this work is available for download: dd-code-release.zip (1.4 MB).
- Steven Burrows
- Tim Gollub
- Benno Stein
- Andreas Bunte, University of Applied Sciences Ostwestfalen Lippe
- Jörg Frochte, Bochum University of Applied Sciences
- Oliver Niggemann, Helmut-Schmidt-Universität Hamburg
Students: David Wiesner, Katja Müller, Peter Hirsch, Jens Opolka, Tom Paschke, and Michael Völske.