Banner
       



Publication Details PU-1 - `A Task Replication and Fair Resource Management Scheme for Fault Tolerant Grids´


February 2005; Antonios Litke, Konstantinos Tserpes, Konstantinos Dolkas, Theodora Varvarigou

Abstract:
In this paper we study a fault tolerant model for Grid environments based on the task replication concept. The basic idea is to produce and submit to the Grid multiple replicas of a given task, given the fact that the failure probability for each one of them is known a priori. We introduce a scheme for the calculation of the number of replicas for the case of having diverse failure probabilities of each task replica and propose an efficient resource management scheme, based on fair share technique, which handles the task replicas so as to maintain in a fair way the fault tolerance in the Grid. Our study concludes with the presentation of the simulation results which validate the proposed scheme.

References:
1. M.R. Lyu,, Software Fault Tolerance, John Wiley & Sons – Chichester, 1995
2. J. B. Weissman. Fault Tolerant Computing on the Grid: What are My Options? HPDC 1999
3. F. Wang, K. Ramamritham, J.A. Stankovic. Determining redundancy levels for fault tolerant real-time systems, IEEE Trans. Computers, vol 44, issue 2, 1995, pp. 292-303
4. A. Nguyen-Tuong. Integrating Fault-Tolerance Techniques in Grid Applications, PhD Dissertation, University of Virginia, August 2000
5. Scheduling Working Group of the Grid Forum, Document: 10.5, September 2001
6. K. Ramamritham, J.A.Stankovic, and P.-F. Shiah. Efficient Scheduling Algorithms for Realtime Multiprocessor Systems, IEEE Trans. on Parallel and Distributed Systems, vol.1, no.2, 1990, pp.184-194
7. L. E. Jackson and G. N. Rouskas. Deterministic Preemptive Scheduling of Real Time Tasks, IEEE Computer, vol. 35, no. 5, 2002, pp. 72-79
8. A. Demers, S. Keshav and S. Shenker, Design and Analysis of a Fair Queuing Algorithm, Proc. of the ACM SIGCOMM, 1989
9. D. Bertsekas, R. Gallager, Data Networks, Prentice Hall, 1992. The section on max-min fairness starts on p.524
10. J.Y-T. Leung and M.L. Merrill, A Note on Preemptive, Scheduling of Periodic, Real-Time Tasks, Information Processing Letters, 11, no. 3, 1980, pp. 115-118
11. M. L. Dertouzos and A.K.-L. Mok, Multiprocessor On-line scheduling for Hard Real Time Tasks, IEEE Trans. on Software Eng., vol. 15, no. 12, 1989, pp. 1497-1506
12. A. S. Tanenbaum, M. van Steen, Distributed Systems: Principles and Paradigms, Prenctice Hall, Computer Science, 2002
13. T. Varvarigou, J. Trotter, Module replication for fault-tolerant real-time distributed systems, IEEE Transactions on Reliability, vol. 47, no. 1, 1998, pp. 8-18
14. N. Doulamis, A. Doulamis, A. Panagakis, K. Dolkas, T. Varvarigou and E. Varvarigos, A Combined Fuzzy -Neural Network Model for Non-Linear Prediction of 3D Rendering Workload in Grid Computing, IEEE Trans. on Systems Man and Cybernetics, Part-B (accepted for publication)
15. The Globus project. http://www-fp.globus.org/hbm/
16. A. Nguyen-Tuong, and A.S. Grimshaw, “Using Reflection to Incorporate Fault-Tolerance Techniques in Distributed Applications,” Computer Science Technical Report, University of Virginia, CS 98-34, 1998.
17. H. Casanova, J. Dongarra, C. Johnson and M. Miller, “Application-Specific Tools”, in I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New Computing Infrastructure, Chapter 7, pp. 159–180, 1998
18. A.S. Grimshaw, A. Ferrari and E.A. West, “Mentat”, in G.V. Wilson and P. Lu (eds.), Parallel Programming Using C++, Chapter 10, pp. 382–427, 1996
19. F.C. Gartner, “Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments”, ACM Computing Surveys, Vol. 31, No. 1, 1999
20. “Access to Knowledge through the Grid in a Mobile World” (AKOGRIMO) Integrated Project FP6-2003-IST-004293. http://www.akogrimo.org/


Source:
Proceedings of European Grid Conference 2005 (EGC2005)

Syndicate our news.