Using the new features of SYCL 2020 in BabelStream

Today marks the release of the new specification of SYCL: SYCL 2020. The press release gives a great summary of all the new features and changes that this new version brings. The full specification can be found in the Khronos Registry as a PDF and in HTML. In this post, I look at some of the changes and new features of SYCL 2020, and how they are being used in BabelStream. You can see the transformation at the GitHub Pull Request.

BabelStream is a modern implementation of the McCalpin STREAM benchmark, written in a wide variety of programming models and supports heterogeneous devices. Other differences to STREAM are summarised elsewhere. The HPC Research Group has often used BabelStream to explore the topic of Performance Portability, and in our recent study you can find results from many programming models running on a diverse range of CPUs and GPUs from different vendors.

BabelStream has had a SYCL implementation since 2016, and has been kept up to date with the revisions to the specification. Updating BabelStream from SYCL 1.2.1 to SYCL 2020 has resulted in at the very least fewer lines of code (233 down from 290), and 22% fewer characters (6,928 down from 8,919; a saving of 1,991) thanks to some simplifications brought in in the latest version of SYCL.

Naming and un-naming

There are some cosmetic changes, such as the header file being moved from <CL/sycl.hpp> to <sycl/sycl.hpp> in a nod to supporting a wider variety of backends. SYCL 2020 expands the broad support of using OpenCL under the hood as SYCL implementations can use other approaches. This gives a boost to the ecosystem and helps expand the support for SYCL across lots of different architectures from different vendors. The namespace is now shortened just sycl::, rather than cl::sycl::.

The kernel function naming using templates is now optional in SYCL 2020. This means we don’t need to create unique names for all the kernels, and in the case of BabelStream instantiate those names to help with the templated type selection of float or double data types.

Simplified accessors

SYCL 2020 is now based on C++17, and as a result SYCL codes can take advantage of Class template argument deduction (CTAD) which removes some verbosity in specifying all the template types. This is most noticeable for accessors where SYCL 2020 brings some simplifications to constructing the accessor class.

The result is they can be much cleaner to use than the get_access<>() buffer member function. This method is still available in the new standard, but using the constructors in BabelStream looks, in my opinion, far less cumbersome and in some cases better shows the usage.

As an example, let’s look at the Copy kernel. In SYCL 1.2.1, the code looked like this:

template <class T>
void SYCLStream<T>::copy()
{
  queue->submit([&](handler &cgh)
  {
    auto ka = d_a->template get_access<access::mode::read>(cgh);
    auto kc = d_c->template get_access<access::mode::write>(cgh);
    cgh.parallel_for<copy_kernel>(range<1>{array_size}, [=](id<1> idx)
    {
      kc[idx] = ka[idx];
    });
  });
  queue->wait();
}

In SYCL 2020, we replace the get_access<>() buffer member functions and use accessor constructors instead:

template <class T>
void SYCLStream<T>::copy()
{
  queue->submit([&](sycl::handler &cgh)
  {
    sycl::accessor ka {d_a, cgh, sycl::read_only};
    sycl::accessor kc {d_c, cgh, sycl::write_only};
    cgh.parallel_for(sycl::range<1>{array_size}, [=](sycl::id<1> idx)
    {
      kc[idx] = ka[idx];
    });
  });
  queue->wait();
}

The access mode (read, write) can now be deduced from the shorthand tag in the constructor instead of appearing in long form in the template argument.

Note, the BabelStream code now uses a constructor initialiser list to construct the buffers instead of allocating them as pointers and allocating them on the free store with new. This is a change to better match modern C++ coding style.

Simplified host accessors

This convenience also extends to host accessors. In this SYCL 1.2.1 version of BabelStream, the auto hid a lot of the complexity of the accessor type. Without auto, we would otherwise have had to specify access::target::host_buffer in the type:

cl::sycl::accessor<T, 1, cl::sycl::access::mode::read, cl::sycl::access::target::host_buffer>

Using auto saves writing this, but means a reader of the code only knows the accessors are host accessors because they are declared in application scope: the variables _a, _b, and _c below are declared in the host code instead of command group scope.

template <class T>
void SYCLStream<T>::read_arrays(std::vector<T>& a, std::vector<T>& b, std::vector<T>& c)
{
  auto _a = d_a->template get_access<access::mode::read>();
  auto _b = d_b->template get_access<access::mode::read>();
  auto _c = d_c->template get_access<access::mode::read>();
  for (int i = 0; i < array_size; i++)
  {
    a[i] = _a[i];
    b[i] = _b[i];
    c[i] = _c[i];
  }
}

SYCL 2020 adds the host_accessor class, so it’s now much clearer that the code wishes to access the buffer on the host. It also uses access mode tags like (device) accessors to specify the dependencies. The example below is much clearer that the previous one above in its intent that the buffers are accessed on the host:

template <class T>
void SYCLStream<T>::read_arrays(std::vector<T>& a, std::vector<T>& b, std::vector<T>& c)
{
  sycl::host_accessor _a {d_a, sycl::read_only};
  sycl::host_accessor _b {d_b, sycl::read_only};
  sycl::host_accessor _c {d_c, sycl::read_only};
  for (int i = 0; i < array_size; i++)
  {
    a[i] = _a[i];
    b[i] = _b[i];
    c[i] = _c[i];
  }
}

Host accessors are a handy feature of SYCL. They give the host access to the buffers used on the device, and fit within the queue dependency graph to ensure correct synchronisation between the host and device. For those familiar with heterogeneous programming, getting the data back on the host can be a little verbose at times, but the host accessor in SYCL makes this convenient when the data really is required on the host.

Unified Shared Memory (USM)

SYCL 2020 also adds USM for those devices which can support it as an alternative to the buffer/accessor memory model. We’ll add this to a future version of BabelStream soon.

Reductions

SYCL 2020 brings with it first-class support for parallel reductions. This means that we no longer have to write our own reduction implementation by hand, which is quite verbose as can be seen in the SYCL 1.2.1 implementation of the dot product kernel in BabelStream. The following code implements a dot product operation.

T SYCLStream<T>::dot()
{
  queue->submit([&](handler &cgh)
  {
    auto ka   = d_a->template get_access<access::mode::read>(cgh);
    auto kb   = d_b->template get_access<access::mode::read>(cgh);
    auto ksum = d_sum->template get_access<access::mode::write>(cgh);

    auto wg_sum = accessor<T, 1, access::mode::read_write, access::target::local>(range<1>(dot_wgsize), cgh);

    size_t N = array_size;
    cgh.parallel_for<dot_kernel>(nd_range<1>(dot_num_groups*dot_wgsize, dot_wgsize), [=](nd_item<1> item)
    {
      size_t i = item.get_global_id(0);
      size_t li = item.get_local_id(0);
      size_t global_size = item.get_global_range()[0];

      wg_sum[li] = 0.0;
      for (; i < N; i += global_size)
        wg_sum[li] += ka[i] * kb[i];

      size_t local_size = item.get_local_range()[0];
      for (int offset = local_size / 2; offset > 0; offset /= 2)
      {
        item.barrier(cl::sycl::access::fence_space::local_space);
        if (li < offset)
          wg_sum[li] += wg_sum[li + offset];
      }

      if (li == 0)
        ksum[item.get_group(0)] = wg_sum[0];
    });
  });

  T sum = 0.0;
  auto h_sum = d_sum->template get_access<access::mode::read>();
  for (int i = 0; i < dot_num_groups; i++)
  {
    sum += h_sum[i];
  }

  return sum;
}

This code follows the usual approach used when writing global reductions in OpenCL. Each work-group produces a partially reduced value, storing them in global memory and using the host to perform the secondary stage to produce the final result. BabelStream used a “rule of thumb” to choose appropriate numbers of work-group sizes on CPUs and GPUs.

SYCL 2020 makes this much simpler. The sycl::reduction function creates the appropriate objects which describe what buffers need reduction semantics and what type of reduction operation needs applying (addition, multiplication, etc.). The lambda function now also takes an additional argument: the reduction variable. This reduction variable is used inside the kernel and updated using the operation (or combine()). This approach may look familiar to those readers who use the RAJA Performance Portability Layer.

The following code implements the same BabelStream dot kernel as above using the new reduction support in SYCL 2020. The kernel code is greatly simplified and reduced just to a single line.

template <class T>
T SYCLStream<T>::dot()
{

  queue->submit([&](sycl::handler &cgh)
  {
    sycl::accessor ka {d_a, cgh, sycl::read_only};
    sycl::accessor kb {d_b, cgh, sycl::read_only};

    cgh.parallel_for(sycl::range<1>{array_size},
      // Reduction object, to perform summation - initialises the result to zero
      sycl::reduction(d_sum, cgh, std::plus<T>(), sycl::property::reduction::initialize_to_identity),
      [=](sycl::id<1> idx, auto& sum)
      {
        sum += ka[idx] * kb[idx];
      });
  });

  // Get access on the host, and return a copy of the data (single number)
  // This will block until the result is available, so no need to wait on the queue.
  sycl::host_accessor result {d_sum, sycl::read_only};
  return result[0];
}

For BabelStream, we do not need to include the original value of the buffer (d_sum) in the reduction, so we can simply ask SYCL to ignore its value with the sycl::property::reduction::initialize_to_identity property. This saves us zeroing the buffer before calling the kernel. As an aside, there are a number of ways that the buffer could be zeroed: host_accessor, a single_task kernel, or the fill() explicit memory operation, each with potentially different performance characteristics.

The parallel_for function uses C++ parameter packs so that multiple variables can be reduced simultaneously.

The reduction is complete at the end of the kernel, so the host need only read it back using a host_accessor, as in the example above.

Support for reductions is an important addition to SYCL 2020, and as these examples show using reductions in codes is now far simpler.

More SYCL 2020

You can find out more about SYCL 2020 on the Khronos website or find links to lots of resources from the SYCL community at sycl.tech. Do join us for the 9th International Workshop on OpenCL and SYCL (IWOCL and SYCLCon) online on 27-29 April, 2021.

Tom Deakin is a Senior Research Associate in the HPC Research Group at the University of Bristol. Tom is Chair of the Khronos SYCL Advisory Panel and SYCL Outreach Liaison Officer.