Date: 27/02/2010
Early detection technology for cloud computing system-failure from Fujitsu
Fujitsu Laboratories has developed a technology to detect system failures before they happen, by improving the ability to analyze cloud system data and gather information, narrowing down the causes of failures, and automatically resolving them. This new technology reduces the workload of administrators and allows users the utilize the cloud with confidence. Technologies such as this improves reliability and stability to cloud systems
Fujitsu Laboratories has developed two technologies to detect signs of failures depending on the type of failure.
(1) Detection of failures through the analysis of system messages:
This technology focuses on specific patterns in messages that are generated just before failures occur and detects warning signs. By comparing the pattern of generated messages with messages from previous system failures, the technology can pick up on signs of failure.
(2) Detection of potential failures that do not generate messages:
When configuring equipment such as servers, human error can lead to the input of incorrect settings. In this kind of situation, the server will operate according to the settings and may not generate any error messages. An effective method for detecting failures in this instance is to gather and analyze data packets that travel across networks that link servers and systems, and then analyze minor changes on the packet level - such as data loss, resent packets and transmission delays. In order to monitor large-scale systems that are involved in cloud computing, Fujitsu Laboratories has developed a technology that is compatible with 10Gbps high-speed communication technology, and which detects network and server system failures in real time.
2. Narrows down causes of failures
The technology scans through detected signs pointing towards system failure and makes inferences about the most likely areas that have generated these signs. Using the observed symptoms as a point of origin, the technology employs network and system configuration information to trace the symptoms' causes. It then overlays the results of evaluations taken from multiple points of origin, generating inferences about the most likely causes based on the areas with the most overlap or with no proper activities.
3. Resolves causes of failures
The system leverages past knowledge of how to deal with system failures, including system log information, and presents administrators with the most suitable methods for dealing with the determined causes of the failures. Due to the fact that previous failures will often occur again, the system stores previous cases of system failures and the procedure history to resolve them in its knowledge base, so that it can quickly determine a solution in order to resolve the cause of the failures.