Informatica Journal Special issue on Parallel Computing on Network of Computers

Clustering--In Search for Scalable Commodity Supercomputing

Guest Editorial

Informatica: An International Journal of Computing and Informatics, Vol 23, No. 1, 1999.

The history of computing can be viewed as a constant search for computational power. As soon as a new, more powerful, computer is developed a larger problem to be solved appears on the horizon. This need for computational power, instead of leveling off, is growing day by day. During recent decades different high performance computer systems attempted to satisfy the power-hungry users. The most common systems were:

Vector Computers (VC)

Massively Parallel Processors (MPP)

Symmetric Multiprocessors (SMP)

Cache-Coherent Nonuniform Memory Access Computers (CC-NUMA) \item

Distributed Systems

Although vector computers provided the breakthrough needed for the computational research to emerge as an independent science, they were only a partial answer as they could deliver top performance only for a few classes of problems. Many powerful scalable MPP systems have been built, but most of them have failed commercially due to their high cost and a low performance/price ratio. Symmetric multiprocessors are attractive, but they suffer from the scalability problem. Non-uniform memory access computers address some scalability and cost issues, but they suffer from a single point of failure as they use a single operating system kernel across all nodes (as in SMPs). Distributed systems are scalable but they do not offer ease of use or means for fast communication, which are essential requirements for efficient execution of parallel applications.

Recent years have witnessed a new direction of search for computational power -- cluster computing (CC). (A cluster is a computer system that forms, in varying degrees, a single unified resource composed of several interconnected computers.) Although clustering or cluster computing has been around for more than 25 years it did not gain momentum until three technology trends converged in the 1980s: development of high performance microprocessors, emergence of high-speed networks and maturation of standard tools for high performance distributed computing. Another trend which is also worth mentioning in this context is the increased need for computing power in commercial applications coupled with the high cost and low accessibility of traditional supercomputers.

In recent years, the availability of high-speed networks and high performance microprocessors/workstations as commodity components make networks of workstations an appealing vehicle for cost-effective parallel computing. Clusters/networks of computers (workstations, PCs or SMPs) built using commodity hardware and software (such as Linux, PVM, or MPI) play a major role in redefining the concept of high performance computing. As a whole, clusters are becoming an alternative to MPPs and supercomputers in many areas of application.

This Special Issue is a result of an extremely large number of submissions that we received for the Special Issue of the Parallel and Distributed Computing Practices (PDCP)} Journal [3]. Among more than five dozens of submissions, 24 papers have received very high recommendation from reviewers. We could not publish them all in {\it PDCP} but we were able to find a home for them here in INFORMATICA. A half of the selected papers will appear in PDCP and the remaining half appears in this issue. The focus of this Special Issue will be on both hardware and software aspects of cluster computing.

There are many ways of looking at cluster computing. Typically we consider them to be a number of machines physically connected via a wired network. This does not need to be the case in the future. We thus start from the paper by H. Zheng, et. al. Mobile Cluster Computing and Timeliness Issues which presents an overview of research issues involved in mobile cluster computing. In particular, they consider problems involved in cluster nodes migrating between cells of a wireless network. The next paper returns back to Earth and considers how the modern high-speed networks can be used to facilitate cluster computing. In High-Performance Cluster Computing over Gigabit/Fast Ethernet, J. Sang, et. al. consider cluster computing over Fast Ethernet and present practical results obtained using the NAS parallel benchmark. The last paper addressing the global issues is The Remote Enqueue Operation on Networks of Workstations by M. Katevenis, et. al. It contains an overview of communication mechanisms necessary to support cluster computing. In the paper a remote-enqueue atomic operation is introduced and compared with other possible alternatives.

The second group of papers is devoted to the load balancing and scheduling issues of clustering. In Preserving Mutual Interests in High Performance Computing Clusters O. Kremien et. al. address one of the drawbacks of the PVM environment (use of a simple round-robin process allocation policy) by adding to the environment a resource manager. This enables them to facilitate effectively the construction of clusters built of heterogeneous computers. Another approach to load balancing is presented by A. Bevilacqua in A Dynamic Load Balancing Method on a Heterogeneous Cluster of Workstations. His load balancing algorithm is based on the dynamic data assignment and its performance is studied for a 3D image reconstruction problem. In Minimizing Communication Conflicts with Load-Skewing Task Assignment Techniques on Network of Workstations W.-M. Lin and W. Xie study the load balancing problem for low speed networks. They present an algorithm which is particularly well suited for the bus-based communication. Finally, in Scheduling of I/O in Multiprogrammed Parallel Systems, P. Kwong and S. Majmudar address the effective management of parallel I/O for cluster computing. They develop a simulation model and compare the performance of various I/O scheduling strategies.

The next two papers are devoted to fault tolerance. D. Kebbal et. al. in Fault Tolerance of Parallel Adaptive Applications in Heterogeneous Systems discuss fault tolerance for heterogeneous adaptive systems. The proposed fault tolerance policy based on optimized coordinated checkpointing is shown to be an effective strategy allowing a recovery from failure by involving only a minimal part of the failed application. A fault tolerance approach based on utilization of idle cycles of computers in the cluster is proposed by T. Setz in Fault Tolerant Execution of Compute-Intensive Distributed Applications in LiPS. The proposed approach alleviates the need for application-wide synchronization used to generate sets of consistent checkpoints.

The last three papers are devoted to application development. E. Manolakos and D. Galatopoullos in JavaPorts: An Environment to Facilitate Parallel Computing on a Heterogeneous Cluster of Workstations shows the use of Java language for high performance computing. They demonstrate experimental results showing that a good performance can be achieved even on a relatively slow 10Mbs Ethernet based cluster of workstations. In Structured Performability Analysis of Parallel Applications, J. Dougherty presents a unified performance and dependability evaluation methodology for practical large-scale parallel applications. Experimental results comparing the performance obtained on a network of DEC Alpha Stations with the performance predicted by the theoretical model are presented. Finally, D. Helman and J. JaJa in Sorting on Clusters of SMPs discuss practical issues involved in developing an efficient sorting algorithm for a cluster of DEC SMP Alpha servers.

We would like to express our deep gratitude to Prof. M. Gams, Managing Editor of INFORMATICA who agreed to publish this Special Issue on a very short notice. This issue would not be possible without the help of referees (listed below) who worked very hard to review all the submitted papers. We would like to thank them all.

We hope you will find this special issue interesting.

Guest Editors

Rajkumar Buyya
Co-Chair
IEEE Task Force on Cluster Computing (TFCC)
School of Computer Science and Software Engineering
Monash University Clayton Campus, Melbourne, Australia.
Email: rajkumar@ieee.org};
URL: http://www.dgs.monash.edu.au/~rajkumar/tfcc/

Marcin Paprzycki
Coordinator,
IEEE TFCC Tech. Area--Algorithms and Applications
Department Computer Science and Statistics
University of Southern Mississippi
Hattiesburg, MS 39406, USA
Email: m.paprzycki@usm.edu};
URL: http://orca.st.usm.edu/marcin/

Bibliography

[1] R. Buyya (editor). High Performance Cluster Computing: Systems and Architectures. Volume 1, 1/e, Prentice Hall PTR, NJ, 1999.

[2] R. Buyya (editor). High Performance Cluster Computing: Programming and Applications. Vol. 2, 1/e, Prentice Hall PTR, NJ, 1999.

[3] R. Buyya and C. Szyperski. Special Issue on High Performance Computing on Clusters. Parallel and Distributed Computing Practices (PDCP) Journal, Vol. 2 (2), June 1999.

[4] G. Pfister. In Search of Clusters. 2/e, Prentice Hall PTR, NJ, 1998.

[5] T. Sterling, J. Salmon, D. Becker, and D. Savarrese. How to Build a Beowulf. The MIT Press, 1999.

[6] B. Wilkinson and M. Allen. Parallel Programming Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, NJ, 1999.

Reviewers

Alessandro Bevilacqua
Alfred Weaver
Amin Vahdat
Amitabh Dave
Amy Apon
Andrzej Goscinski
Benedict Gomes
Biersack Ernst
Boleslaw Szymanski
Boris Weissman
Cho-Li Wang
Chung-Ta King
Dan Hyde
David Bader
Davide Rossetti
Domenico Talia
Dorina Petriu
Dror Feitelson
El-ghazali Talbi
Erhan Saridogan
Gangan Agrawal
Gihwan Cho
Giuseppe Ciaccio
Harjinder Sandhu
Hye-Seon Meeng
Jamel Gafsi
Jay Fenwick
Jianyong Wang
John Dougherty
Kennith Birman
Lars Rzymianowicz
Lori Pollock
Luis Silva
Maciej Golebiewski
Mark Baker
Mark Clement
Marrianne Winslett
Mathew Chidester
Michele Colajanni
Orly Kremien
Paddy Nixon
Paul Roe
Putchong Uthayopas
Quinn Snell
Rainer Fraedrich
Rajeev Raje
Rajeev Thakur
Ricky Kwok
Robert Brunner
Robert Todd
Samuel Russ
Toni Cortes
Yong Cho
Yoshio Tanaka
Yu-Kwong Kwok