HeteroPar'2011 Program

Sessions I Chair: George Bosilca
9:30-10:30 Invited Speaker: Rosa M. Badia, Barcelona Supercomputing Center, Spain StarSs support for task-based parallel programming of heterogeneous platforms Abstract: The current hardware platforms and their expected evolution with heterogeneous configurations composed of multicores and GPUs puts a significant stress in the programmers. While this has been said in the recent years, no clear winner for a programming model is widely accepted. The talk will focus on the characteristics of the task-based programming model StarSs, specially in its current OmpSs implementation that combines OpenMP with StarSs ideas. OmpSs instance implements the OpenMP tasks, but it is able to take into account the data-dependences that exist between different tasks instances in order to build the corresponding data-dependence graph. The OmpSs runtime is able to run the tasks of the same application in CPUs and GPUs of the same platform, taking care of the required data transfers. What is more, the runtime also supports distributed heterogeneous clusters. The talk will describe the main characteristics of the programming model, and how these features are supported through the compiler and runtime.
10:30-11:00 Jean-Marc Nicod, Laurent Philippe and Lamiel Toch. A Genetic Algorithm to Schedule Workflows on a SOA-Grid with Communication Costs Abstract: In this paper we study the problem of scheduling a collection of workflows, identical or not, on a SOA grid. A workflow (job) is represented by a directed acyclic graph (DAG) with typed tasks. All of the grid hosts are able to process a set of task types with unrelated processing costs and are able to transmit files through communication links for which the communication times are not negligible. The goal is to minimize the maximum completion time (makespan) of the workflows. To solve this problem we propose a genetic approach. The contributions of this paper are both the design of a Genetic Algorithm taking the communication costs into account and the performance analysis.
Coffee Break
Sessions II Chair: Emmanuel Jeannot
11:30-12:00 Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato. An Extension of XcalableMP PGAS Language for Multi-node GPU Clusters Abstract: A GPU is a promising device for further increasing computing performance in high performance computing field. Currently, many programming languages are proposed for the GPU offloaded from the host, as well as CUDA. However, parallel programming with a multi-node GPU cluster, where each node has one or more GPUs, is a hard work. Users have to describe multi-level parallelism, both between nodes and within the GPU using MPI and a GPGPU language like CUDA. In this paper, we will propose a parallel programming language targeting multi-node GPU clusters. We extend XcalableMP, a parallel PGAS (Partitioned Global Address Space) programming language for PC clusters, to provide a productive parallel programming model for multi-node GPU clusters. Our performance evaluation with the N-body problem demonstrated that not only does our model achieve scalable performance, but it also increases productivity since it only requires small modifications to the serial code.
12:00-12:30 Hamid Arabnejad and Jorge Barbosa. Performance Evaluation of List Based Scheduling on Heterogeneous Systems Abstract: This paper addresses the problem of evaluating the schedules produced by list based scheduling algorithms, with meta-heuristic algorithms. Task scheduling in heterogeneous systems is a NP-problem, therefore several heuristic approaches were proposed to solve it. These heuristics are categorized into several classes, such as list based, clustering and task duplication scheduling. Here we consider the list scheduling approach. The objective of this study is to assess the solutions obtained by list based algorithms to verify the space of improvement that new heuristics can have considering the solutions obtained with meta-heuristic that are higher time complexity approaches. We concluded that for a low Communication to Computation Ratio (CCR) of 0.1, the schedules given by the list scheduling approach is in average close to meta-heuristic solutions. And for CCRs up to 1 the solutions are below 11\% worse than the meta-heuristic solutions, showing that it may not be worth to use higher complexity approaches and that the space to improve is narrow.
12:30-13:00 David Clarke, Alexey Lastovetsky and Vladimir Rychkov. Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models Abstract: In this paper we present a new data partitioning algorithm to improve the performance of parallel matrix multiplication of dense square matrices on heterogeneous clusters. Existing algorithms either use single speed performance models which are too simplistic or they do not attempt to minimize the total volume of communication. The Functional performance model (FPM) is more realistic then single speed models because it integrates many important features of heterogeneous processors such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. To load balance the computations the new algorithm uses FPMs to compute the area of the rectangle that is assigned to each processor. The total volume of communication is then minimized by choosing a shape and ordering so that the sum of the half-perimeters is minimized. Experimental results demonstrate that this new algorithm can reduce the total execution time of parallel matrix multiplication in comparison to existing algorithms.
Lunch Break
Sessions III Chair: Edgar Gabriel
14:30-15:00 Gennaro Cordasco, Rosario De Chiara, Ada Mancuso, Dario Mazzeo, Vittorio Scarano and Carmine Spagnuolo. A Framework for distributing Agent-based simulations: D-MASON Abstract: Agent-based simulation models are an increasingly popular tool for research and management in many, different and diverse fields. In executing such simulations the “speed” is one of the most general and important issues. The traditional answer to this issue is to invest resources in deploying a dedicated installation of dedicated computers. In this paper we present a framework, D-MASON, that is a parallel version of the MASON, a library for writing and running Agent-based simulations. D-MASON is designed to harness unused PCs for increased performances.
15:00-15:30 Jacques Bahi, Raphaël Couturier and Lilia Ziane Khodja. Parallel Sparse Linear Solver GMRES for GPU Clusters with Compression of Exchanged Data Abstract: GPU clusters have become attractive parallel platforms for high performance computing due to their ability to compute faster than the CPU clusters. We use this architecture to accelerate the mathematical operations of the GMRES method for solving large sparse linear systems. However the parallel sparse matrix-vector product of GMRES causes overheads in CPU/CPU and GPU/CPU communications when exchanging large shared vectors of unknowns between GPUs of the cluster. Since a sparse matrix-vector product does not often need all the unknowns of the vector, we propose to use data compression and decompression operations on the shared vectors, in order to exchange only the needed unknowns. In this paper we present a new parallel GMRES algorithm for GPU clusters, using compression vectors. Our experimental results show that the GMRES solver is more efficient when using the data compression technique on large shared vectors.
Coffee Break
Sessions IV Chair: Emmanuel Agullo
16:30-17:00 Aleksandar Ilic and Leonel Sousa. Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory Abstract: This paper addresses the problem of scheduling discretely divisible applications in heterogeneous desktop systems with limited memory by relying on realistic performance models for computation and communication, through bidirectional asymmetric full-duplex buses. We propose an algorithm for multi-installment processing with multi-distributions that allows to efficiently overlap computation and communication at the device level in respect to the supported concurrency. The presented approach was experimentally evaluated for a real application; 2D FFT batch collaboratively executed on a Graphic Processing Unit and a multi-core CPU. The experimental results obtained show the ability of the proposed approach to outperform the optimal implementation for about 4 times, whereas it is not possible with the current state of the art approaches to determine a load balanced distribution.
17:00-17:30 Ma. Guadalupe Sánchez, Vicente Vidal and Jordi Bataller. Peer group and Fuzzy Metric to Remove Noise in Images using Heterogeneous Computing Abstract: In this study, we conducted the parallelization to remove impulsive noise in images with the concept of the peer group and fuzzy metric through multi-core using Open Multi-Processing (OpenMP) and the Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA). Many sequential algorithms have been proposed to remove the noise, but they have an excessive computational cost for large images when the purpose is real-time processing. We conducted an analysis of performance for different image sizes in order to identify the best parallel architecture for each size. We compare the implementation in multi-core, multi-GPU and a combination of both. When the algorithm runs on the GPU a study is made using the shared memory and texture memory to minimize access time to data in global memory. Results show that when the image is distributed in multi-core and multi-GPU a greater number of megapixel/sec are processed.
17:30-18:00 Jaspal Subhlok, Edgar Gabriel, Girish Nandagudi and Judit Jimenez. Estimation of MPI Application Performance on Volunteer Environments Abstract: Emerging MPI libraries, such as VolpexMPI and P2P MPI, allow message passing parallel programs to execute effectively in heterogeneous volunteer environments despite frequent failures. However, the performance of message passing codes varies widely in a volunteer environment, depending on the application characteristics and the computation and communication characteristics of the nodes and the interconnection network. This paper has the dual goal of developing and validating a tool chain to estimate performance of MPI codes in a volunteer environment and analyzing the suitability of the class of computations represented by NAS benchmarks for volunteer computing. The framework is deployed to estimate performance in a variety of possible volunteer configurations, including some based on the measured parameters of a campus volunteer pool. The results show slowdowns by factors between 2 and 6 for different NAS benchmark codes for execution on a realistic volunteer campus pool as compared to dedicated clusters.