Can Computers Heal Itself?
This article showcases the new trends in self-managed computing systems that configure, heal, protect themselves and adapt to the user’s needs automatically. The idea derives from autonomic nervous systems of the human body that helps controlling the entire system. Building self-managing and self-healing systems towards autonomic computing is a grand challenge of today’s complex computing world. With increased complexity in hardware and software and ease of computing, the need of reliable and self-healing system administration is a call for today. We evaluate the need and focus on the challenges of implanting a system thinker that helps in self-healing the system at least partially if not at all to its entirety.
Genesis:
The idea behind automatic computing derives from the automatic nervous system of the human body wherein the system controls important bodily functions without any conscious intervention. Drawing analogy to this phenomenon, IBM proposed in 2001 to create self-managing computer systems that could automatically configure, heal, optimize and protect on its own. This was followed suit by Sun Microsystems, Hewlett-Packard, and Microsoft, whose products are leading the development and implementations of automatic computing.
IBM calls this seminal field of research ”A problem that by virtue of its degree of difficulty and the importance of its solution, both from a technical and societal point of view, becomes the focus of interest to a specific scientific community.”
What is Autonomic computing?
Autonomic computing introduces a new buzzword-self-CHOP system that self-configure, self-heal, self-optimize, and self-protect itself without human intervention thereby leveraging the benefits of self-managed systems.
Self-configure:
Let’s try picturing this:
An organization is running an Enterprise Application Integration (EAI) application in a clustered environment with failover mechanisms and other distributed and object-oriented features built into it. It includes a set of java EE servers deployed on a cluster of nodes, each tier of the system being replicated for better performance and availability. Daily this application caters to thousands of requests over the internet, and carries out hundreds of transactions per minute across locations in five continents.
Given the complexity and size of this application and its environment, it takes a group of systems administrators to install the system in all the physical locations, configure the application and it’s associated resources such as JMS Queues, Mail Servers, Application Servers, Workflow Servers, and many more. This entire process of installation and configuration consumes a few person days, multiplied by the number of such installations.
Self-Heal
Autonomic computing suggests building software systems with self—healing features that would monitor the application and server logs periodically for system failures, and have intelligence built into it to diagnose the root cause, determine the problem, and finally recover the system from the show-stopper. In a much down-to-earth terms, this would mean first detecting and isolating the failed component, taking it offline, fixing or isolating the failed component, doing an auto-build of the system and redeploying the application without any human intervention. For instance, an autonomic system will encounter a failed database index by re-indexing the files, and subsequently testing and loading them back into production. If the issue lies with storage constraints, the self-healing manager will automatically extend file space and database storage according to previous data on growth and expansion.
Self-Optimize
Autonomic Computing systems are empowered with a self-optimizing workload manager that is capable of logical partitioning and dynamic server clustering extended across multiple heterogeneous systems to provide a single collection of computing resources across the enterprise. Be it issues with storage, databases, networks, and other resources, the workload manager continually monitors and tunes the available resources for optimal usage. Formulation new algorithms for this self-optimizing design pattern is an open area of research that calls upon advanced data management techniques and feedback mechanisms using control theory.
Self-Protect
We live in uncertain times where all possible software and hardware vulnerabilities are utilized with malicious intents. For instance, an unethical hacker might exploit a memory leakage in the printer spooler of a particular unit to overflow the entire LAN and jamming the network. Self-protecting autonomic systems will be able to diagnose the attack, isolate the component in question, and redirect the printer usage to some alternate location without human intervention. Self-protection is also used as an early warning to anticipate and prevent system failures.
Autonomic System Architecture:
White et al have described an architectural approach to achieve the goals of autonomic computing. They suggested making each autonomic computing element be responsible for monitoring its input services and determine whether they are performing According to the agreed upon agreements and contracts covering them. In case of a failure, in partial or its entirety, because of wrong results returned or out of bounds or something else, the requesting autonomic computing may react by cutting off the relationship of the problem creating process and requesting a new or fresh one.
In autonomic computing supporting architecture , a control loop spans over a centralized or may be a distributed horizontally partitioned knowledgebase to sniff problems found through Sensors, thereby monitoring, analyzing, planning, executing the action plan, and finally triggering the Effector for implanting the solution in order to self heal the system. The source of such problems reporting could be in terms of a system log, exceptions occurred but not handled and reported through log files, or through in-memory process or implanted agents on client machines. The knowledge base will have different rules for symptoms of such incidents, possible execution plans. In case solution does not exist, the knowledgebase will be periodically analyzed for possible solutions schemes and action plans which will be used for later occurring cases.
Object oriented Autonomic Computing:
Object-oriented software systems are based on hiding the implementation and exposing the public interface for controlling the behavior and state changes by outside clients. In case an object fails to restore states, or goes to an inconsistent state because of some transaction leading to such situations, the autonomic engine implemented at the object level could bring itself to a consistent state based on a consistency profile check.
Service-oriented architecture (SOA) provides a loosely coupled composition of services. With heterogeneous services offerings on varied platforms, web services based on distributed middleware technology using open standards and interoperability through XML, services can be described as exchanging messages in XML. With SOA, in conjunction with web services, application integration with cross-platform interoperability, scalability, availability is achievable. Loose-coupling and asynchronous linkage by messaging are the important aspects of SOA so that incase of a problem creating component , the component can be quarantined and fresh copy of the component can be re-instantiated to serve the same purpose after resuming to the last rolled back consistent state.
In case of a loosely coupled SOA, when an attempt to respond to a request for a designated service fails, the log of failures can be analyzed to detect the cause of t such failures, and corrective action can be taken. For example, in case of a java EE application, if a JDBC connection fails to connect to a designated database, the cause of failure could be detected in terms of the nature of exception. If the exception is caused due to class not found exception, then the class path and the existence of the related driver could be search automatically and depending on the cause of such failure, a proper action plan based on rule based knowledgebase can be taken by the Effector.
Suggested Approach:
We can have multi-agents running on client machines running as daemon processes to work as sniffer of the problems. The agent keeps track of the system and application logs for applications registered for autonomic computing and configured for that. In case of exception coming out of running java based web application and in case of a failure (exception) within a known set of problems, it tries to rectify steps like rectifying the database (may be) or may be running some auto-correct code snippet at the server end by sending a remote procedure call (RPC) or invoke a java based RMI (Remote method invocation) to rectify the problem or prevent the problem from re-occurring.
In case a garbage collector in java based application fails due to PermGenerrors (which is overflow of the heap area), the system could temporarily stall the application initiated by the healing agent and resume the application or automat5ically restart a fresh copy of application after reporting the problems.
We suggest that the problems determination algorithms which are crucial to any autonomic system be designed to be rules-based, with a RBE (Rules Based Engine) implementing fuzzy logic on set of exception that the log tracer/analyzer extracts from a system log after a failure. The RBE can be made easily configurable using XML as the underlying knowledge repository. On encountering multiple causes for a system failure, the self-healing manager can use fuzzy logic to determine the root cause behind it. For instance, if the log analyzer/ tracer reveals a scenario where “Application server can’t connect to the database” but the “Application server can ping database server machine “, the RBE can figure out that the database server is down, and consequently the workload Manager may take appropriate actions to recover from the system crash.
Trends in Autonomic Computing:
The increasing heterogeneity dynamism and interconnectivity in software applications, service and networks led to complex unmanageable and insecure systems. Coping with such a complexity necessitates investigation of this new paradigm for computers to heal itself, and posses to broad areas of research: technologies related to autonomic computing and development of autonomic computing product. Open areas of research Peer-to-Peer and Grid Computing as a means towards implementing autonomic systems and designing automatic managers in multi-layer P2P form, so that autonomic behavior and the underlying RBE knowledge base are stored in separated layers.
The scope of autonomic computing involves not only rules- based enterprise-wide application we build, but also the underlying operating, middleware, database systems, server/network systems shared services. This will be evident in the B2B and B2C collaboration.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment