Innovative Computing Laboratory, University of Tennessee

 ______ _______ _____   _______ _______ ______
|   __ \   |   |     |_|     __|   _   |   __ \
|    __/   |   |       |__     |       |      <
|___|  |_______|_______|_______|___|___|___|__|

Parallel Ultra Light Systolic Array Runtime

PULSAR 2.0.0, November 2014

  • PRT includes multi-GPU support for Nvidia GPUs using CUDA.

    • The prt_vsa_new() function takes the number of devices in addition to the number of threads.

    • The VDP mapping function returns the unit’s location (host or device) in addition to the rank.

    • Channel functions, prt_channel_push() and prt_channel_pop(), are replaced by VDP functions, prt_vdp_channel_push() and prt_vdp_channel_pop().

    • Packet functions, prt_packet_new() and prt_packet_release(), are replaced by VDP functions, prt_vdp_packet_new() and prt_vdp_packet_release().

    • The new auxiliary function, prt_vdp_packet_new_host_to_device(), allows for creating a packet in host memory and queueing a transfer to device memory.

    • The new function prt_vsa_device_warmup_func_set() allows to warmup devices before timing.

  • PRT can be built with or without CUDA support, and with or without MPI support. If PRT is built with MPI support or with CUDA support, an extra thread is launched to controll the MPI and/or the GPUs. If PRT is launched in a single node configuration without GPUs, the extra thread is not launched and all CPU cores can be used for computing at full speed.

  • The prt_channel_new() channel constructor now takes the data size in bytes, instead of taking the count and the MPI datatype, which is consistent with the prt_vdp_packet_new() constructor.

  • Variable-size packets can be created and sent down a channel. Channel size designates the maximum allowed packet size in the channel. Actual packet size can be queried on the receiving side after reception.

  • Channels can be turned on and off. Newly created channels are active. Inactive channels are excluded from readiness checks. prt_vdp_channel_on() activates a channel, prt_vdp_channel_off() deactivates a channel.

  • The prt_vsa_run() function now returns the execution time in seconds as a double precision floating point number. Timing starts after a global MPI barrier and a local Pthreads barrier, and ends after a global MPI barrier and a local Pthreads barrier. The intention is to precisely measure the execution time of the workload, without the overheads of initialization.

  • The function prt_vsa_thread_warmup_func_set() allows for specifying a thread warmup function, called on each thread at the start of prt_vsa_run(), before all threads are barriered and execution time is measured. It is intended for initialization procedures of libraries, such as dynamic loading, memory allocations (possibly expensive pinned memory allocations), etc., and allows to exclude the overhead of such initializations from the timing of the main workload.

  • The tile QR code now allocates and initializes the matrix before the launching of the VSA. The matrix is allocated in a distributed fashion. The tile QR can be launched with or without GPU acceleration. If launched with GPU acceleration, the tile QR offloads the DORMQR and DTSMQR kernels to GPUs.

PULSAR 1.0.0, August 2013

The first release of PULSAR provides a complete API for building and executing a Virtual Systolic Array (VSA) - a collection of Virtual Data Processors (VDPs) connected with channels and communicating with packets.

The runtime supports distributed memory systems with multicore processors and relies on POSIX Threads (a.k.a. Pthreads) for intra-node multithreading, and on the Message Passing Interface (MPI) for inter-node communication.

This release is accompanied by an implementation of the tile QR factorization with sequential (a.k.a. “domino”) panel reduction and an implementation of the LU factorization with no pivoting.