Project purpose

TIMaCS deals with the challenges in the administrative domain upcoming due to the increasing complexity of computing systems especially of resources with performance of several petaflops.

The project aims at reducing the complexity of the manual administration of computing systems by realising a framework for intelligent Manangement of even very large computing systems based on technologies for virtualising, knowledge-based analysis and validation of collected information, definition of metrics and policies. This framework should be able to automatically start predefined actions additionally to the notification of an administrator. Beyond that the data analysis based on previous monitoring data, regression tests and intense regular check aims at preventive actions prior to failures. The framework to be realised will include open interfaces to be easily bind to relevant existing systems like accounting oder user management systems (user policies, priority, ...). We seek for developing a framework ready for production and their validation at the High Performance Computing Center Stuttgart (HLRS), The Center for Information Services and High Performance Computing (ZIH) and the Computing Center at the Philipps-Universität Marburg.

Objectives

  1. Concept and Implementation of a robust and highly scalable monitoring solution for very large computing systems based on existing tools and supplementary implementations ready for production.
  2. Design and Implementation of a system for partitioning and dynamic user assignment of very large computing systems based on concepts for virtualisation. Easy setup or removal of single compute nodes out of a heterogeneous or hybrid system will be included.
  3. On top of that a management framework will be developed which supports different automisation and escalation strategies based on policies: notification of an administrator, semi-automatic to fully-automatic counteractions, prognoses, anomaly detection and their validation under production conditions.
  4. Tools for detection and automatic error handling as well as concepts and realisation of preventive actions to check the infrastructure i.e. between jobs and supporting regular maintenance.
  5. Sustainability by defining standard conform interfaces and an integrated framework targeting at the combination of not yet synchronised developments of tools for monitoring and management, in cluster virtualisation, policy based management and knowledge based data analysis.