Introduction
-
A reference road map for cloud high availability and reliability is proposed.
-
A big picture is proposed through dividing the problem space into four major parts.
-
A comprehensive cloud task taxonomy and failure classification are presented.
-
Research gaps which were neglected in the literature review are identified.
Research background
Actors | Definition |
---|---|
Cloud consumer | Any individual person or organization that has a business relationship with cloud providers and consumes available services |
Cloud provider | Any individual entity or organization which is responsible for making services available and providing computing resources to cloud consumers |
Cloud broker | An IT entity that provides an entry for managing performance and QoS of cloud computing services. In addition, it helps cloud providers and consumers with management of service negotiations |
Cloud auditor | A party that can provide an independent evaluation of cloud services provided by cloud providers in terms of performance, security and privacy impact, information system operations and etc. in the cloud environments |
Cloud carrier | An intermediary party that provides access and connectivity to consumers through any access devices such as networks. Cloud carrier transports services from a cloud provider to cloud consumers |
Papers’ reference | Main idea | Advantages | Challenges | Evaluation metrics |
---|---|---|---|---|
[46] | Investigating the repair policy and system parameters on cloud availability | Repair policy effectiveness is evaluated Using differential analysis for analyzing parameter sensitivity | Types of Failures are not considered Limitation on number of physical machines The lack of different types repair facilities in evaluation | MTTR Steady state availability MTTM Pool size |
[47] | Storage services availability evaluation using hierarchical models | Adoption of availability importance index Critical components availability identification | Study case study and assessments are limited to the Eucalyptus platform | MTTF MTTR File Size MaxClients InService Throughput |
[32] | Using VM replicas in cloud datacenter to provide high availability | Resource optimization while assuring availability | VM and application scheduling is not considered in the proposed method Evaluation on a small cloud infrastructure | Latency OMG DDS QoS Standard Deviation |
[48] | Applying non-sequential Monte Carlo Simulation to reliability evaluation | A new cloud computing test-bed were developed A new algorithm for expansion planning were presented | This approach cannot be used for modeling other reliability features such as live VM migration | Number of failures Number of VM allocations |
[49] | Using a combination of a stochastic Petri net model and a proposed cloud scoring system | Considering both cloud consumers and cloud providers in the proposed method Proposing a cloud scoring system | The proposed cloud scoring system overhead and cost is not considered The user requirements are limited to only cost and energy in this study | OPEX option Carbon footprint option Overload factor Deployment Distances Relative average utilization |
[50] | Comparing two fault tolerance techniques according to the cloud consumers’ and providers’ requirements | Considering both cloud consumers’ and cloud providers’ requirements | Failure prediction mechanism is required | MTBF Electricity Bill Failure prediction accuracy Energy consumption Task completion rate |
[51] | Amending the current cloud simulators to support HA features | Considering green computing | Limited availability evaluation metrics | Request per second Average service time Power consumption |
Proposed reference roadmap
Where?
“Where are the vital parts for providing HA in the body of cloud computing datacenters?”
-
The agreement context. Signatory parties, generally the consumer and the provider, and possibly third parties entrusted to enforce the agreement, an expiration date, and any other relevant information.
-
A description of the offered services including both functional and non-functional aspects such as QoS.
-
Obligations agreement of each party, which is mainly domain-specific.
-
Policies: penalties incurred if a SLA term is not respected and SLA violation occurs.
-
Customer-based SLA It is a type of agreement with a single customer that covers all the necessary services. This is similar to the SLA between an IT service provider and the IT department of an organization for all required IT services.
-
Service-based SLA It is defined as a general agreement for all customers who are using the delivered services by the service provider.
-
Multi-level SLA This kind of agreement can be split into different levels, with each level addressing a different set of customers for the same services.
-
80% of cloud service providers’ profits may come from 20% of customers.
-
80% of requested services consist of just 20% of the entire cloud providers’ services.
Final class | Duration (h) | CPU (Cores) | Memory (GBs) |
---|---|---|---|
1: sss | Small | Small | Small |
2: sm* | Small | Med | All |
3: slm | Small | Large | Small + med |
4: sll | Small | Large | Large |
5: lss | Large | Small | Small |
6: lsl | Large | Small | Large |
7: llm | Large | Large + med | Small + med |
8: lll | Large | Large + med | Large |
Which?
“Which components play key roles to affect cloud computing HA and reliability?”
Failure classification | Failure modes | Description |
---|---|---|
Software failures | System/application software failure [36] | The cloud tasks and VM hypervisors are actually software programs running on different computing nodes, which may contain software faults, bugs, and errors |
Database failure [36] | There is the possibility of hardware or software failure in each database system. So, database systems are prone to losing data | |
Hardware failures | Hardware component failure [65] | The computing resources, in general, have hardware components (such as storage devices, processing elements, and memory) which may also encounter hardware failures |
Network failure [65] | When cloud tasks access remote data sources, the communication channels could be broken, which causes the network failure, especially for the long time transmissions of large datasets | |
Cloud management system (CMS) failures | There is usually a limitation on the maximal number of incoming requests in the queue. Waiting too long in the queue can cause the Timeout failure for new requests. So, if the queue is full, new requests will be dropped simply which is called an overflow failure | |
The cloud service commonly has its due time set by the owner or the service monitor. If the waiting time of the queued requests is over the due time, the Timeout failure occurs. Therefore, those timeout requests will be dropped from the queue | ||
In CMS, the data resource manager should register data resources. However, it is possible that some previously registered data are removed but the data resource is not updated. So, data resource missing will happen | ||
The computing resource missing is another failure like data resource missing that can also happen in the cloud management system. This failure will happen because of the reasons like turning off the PC without notifying the CMS | ||
Security failures | Customer faults [68] | The recent research results show that only a small portion of security failures impacting cloud services consumers have been due to the provider’s fault. According to the Gartner’s top predictions for IT users for 2016 and beyond, about 95% of cloud security failures through 2020 will be the customer’s faults |
Software security breaches [69] | Software security breaches can lead to the cloud services failure. When the attackers can gain access to the customer information such as login data, credits and etc. through the cloud-based software security breaches, it can result in huge problems for the customers who rely on their daily cloud-based activities | |
Security policy failure [69] | Miscalculating the cloud security requirements in providing a security policy is really a hot challenge which leads to system failures. Common mistakes to define a comprehensive security policy are some of the main reasons for security failure | |
Human Operational Faults | Misoperation [67] | This kind of failure is related to accidental faults made by human personnel operating or configuring the system, for both updates of the system and during a repair process. The extent to which this misoperation affects the cloud system can depend on the level on which the fault has occurred |
Misconfiguration [67] | There is a possibility of affecting a whole cluster or even a whole datacenter in a cloud system in case network node software is misconfigured. The worst case, however, remains the misconfiguration of the cloud management software which leads to bringing down all the cloud at once | |
Environmental Failures | Environmental disasters [67] | Environmental disasters can play the main role in the dependability of a cloud system. Factors such as floods, power outages, fires etc. are although outside the control of the service provider but can always interrupt service provision. This is because these environmental disasters like floods and power outages affect a whole cloud datacenter and hence their consequences can be a very large-scale service disruption |
Cooling system failure [67] | The functionality of physical servers in a cloud datacenter also depends on the thermal conditions of the location where the servers are installed. So, failure in the air-conditioning system where servers are placed also causes failure in services provision. Therefore, the servers will either shut down completely or will be under-utilized for offering services and hence can be regarded as unavailable |
-
Manage a pool of heterogeneous resources.
-
Provide remote access for end users.
-
Monitor system security.
-
Manage resource allocation policies.
-
Manage tracking of resource usage.
When?
“When reliability and HA will decrease in cloud computing environments?”
Combinatorial model types
State-space models
Hierarchical models
Reliability | Maintainability | Availability |
---|---|---|
Constant | Decrease | Decrease |
Constant | Increase | Increase |
Increase | Constant | Increase |
Decrease | Constant | Decrease |
How?
“How to provide high availability and reliability while preventing performance degradation or supporting graceful degradation?”
FT policy | FT technique | Description |
---|---|---|
Proactive FT | Preemptive migration | Preemptive migration involves suspending a process, recording its state, transferring it to another node and resuming operation of the process in the new node. It makes use of a feedback-loop control system where applications are constantly monitored and analyzed |
Software rejuvenation | Software rejuvenation technique can be applied proactively as inescapably software aging can lead to the software systems failures. In fact, it is a technique in which periodic reboots are scheduled for the system. After each reboot, the system resumes with a clean state | |
Reactive FT | Checkpointing/restart | Application checkpoint/restart technique allows saving the state of a running application to resume its execution later from the time at which it was checkpointed, on any arbitrary machine After a failure has occurred, the application software will be restarted from the point of failure, instead of rerunning the whole application from the scratch. It is an efficient fault tolerance technique for high computation intensive applications hosted in the cloud |
Replication | Replication is one of the most popular techniques which can be used according to the reactive policy. In cloud computing fault tolerance techniques, replication can be applied by keeping multiple replica of data and services. So, when an incoming request is received, it can be handled by a set of available replicas. Several different replicas are running through different computing resources to complete the requested task | |
Task resubmission | The failed task can be resubmitted either to the same or to a different host at system runtime without any interruption during the system workflow |
Discussion and open issues
Conclusion
-
A comprehensive study of cloud failure modes, causes and failure rate, and reliability/availability measuring tools;
-
A highly utilized and more profitable cloud economy that can guarantee the provision of highly available and reliable services.
-
Evaluating availability and reliability of cloud computing system and components based on the proposed architecture;
-
Studying the mutual impact of HA mechanisms and VM performance overhead.