Summary Guide for new Cusp developers

Introduction

This page describes how to develop CUDA applications with CUSP, a C++ template library for sparse matrix computations. This guide is intended to be accessible, even to developers with limited C++ experience.

Prerequisites

Cusp v0.4.0 requires CUDA 5.5 (or newer). You can check the CUDA installation by using nvcc on the command line as follows :

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12

Since Cusp is a C++ template library there is nothing to "build". Simply download the newest version of Cusp from here and extract the contents of each zip file to a directory. We suggest installing Cusp to the CUDA include directory, which is usually

/usr/local/cuda/include/ on a Linux and Mac OSX
C:\CUDA\include\ on a Windows system

Simple Example

Let's compile a simple program to make sure all the prerequisites are satisfied. Save the following source code to a file named version.cu.

#include <thrust/version.h>
#include <cusp/version.h>
#include <iostream>
int main(void)
{
    int cuda_major =  CUDA_VERSION / 1000;
    int cuda_minor = (CUDA_VERSION % 1000) / 10;
    int thrust_major = THRUST_MAJOR_VERSION;
    int thrust_minor = THRUST_MINOR_VERSION;
    int cusp_major = CUSP_MAJOR_VERSION;
    int cusp_minor = CUSP_MINOR_VERSION;
    std::cout << "CUDA   v" << cuda_major   << "." << cuda_minor   << std::endl;
    std::cout << "Thrust v" << thrust_major << "." << thrust_minor << std::endl;
    std::cout << "Cusp   v" << cusp_major   << "." << cusp_minor   << std::endl;
    return 0;
}

Now compile version.cu with nvcc. If Cusp was installed to the CUDA include directory then the following commands should work.

$ nvcc version.cu -o version
$ ls
thrust  version version.cu
$ ./version
CUDA   v6.5
Thrust v1.7
Cusp   v0.4

Other Examples

Additional Cusp examples are available for download or online browsing. These examples can be compiled using the same procedure as above. For instance, the Conjugate Gradient solver example is compiled and run as follows:

$ cd examples/Solvers/
$ nvcc -O2 cg.cu -o cg
$ ./cg
Solver will continue until residual norm 0.01 or reaching 100 iterations
Iteration Number  | Residual Norm
            0       1.000000e+01
            1       1.414214e+01
            2       1.093707e+01
            3       8.949319e+00
            4       6.190055e+00
            5       3.835189e+00
            6       1.745481e+00
            7       5.963546e-01
            8       2.371134e-01
            9       1.152524e-01
           10       3.134467e-02
           11       1.144415e-02
           12       1.824176e-03
Successfully converged after 12 iterations.

Sparse Matrices

Cusp natively supports several sparse matrix formats:

When manipulating matrices it's important to understand the advantages and disadvantages of each format. Broadly speaking, the DIA and ELL formats are the most efficient for computing sparse matrix-vector products, and therefore are the fastest formats for solving sparse linear systems with iterative methods (e.g. Conjugate Gradients). The COO and CSR formats are more flexible than DIA and ELL and easier manipulate. The HYB format is a hybrid combination of the ELL (fast) and COO (flexible) formats and is a good default choice. Refer to the matrix format examples for additional information.

Format Conversions

Cusp makes it easy to transfer data between the host and device and convert between sparse matrix formats. For example,

#include <cusp/csr_matrix.h>
#include <cusp/hyb_matrix.h>
int main()
{
  // Allocate storage for a 5 by 8 sparse matrix in CSR format with 12
  // nonzero entries on the host
  cusp::csr_matrix<int,float,cusp::host_memory> csr_host(5,8,12);
  // Transfer the matrix to the device
  cusp::csr_matrix<int,float,cusp::device_memory> csr_device(csr_host);
  // Convert the matrix to HYB format on the device
  cusp::hyb_matrix<int,float,cusp::device_memory> csr_device(csr_device);
}

Iterative Solvers

Cusp provides a variety of iterative methods for solving sparse linear systems. Established Krylov subspace methods are available:

More detailed examples are available here.

#include <cusp/hyb_matrix.h>
#include <cusp/monitor.h>
#include <cusp/gallery/poisson.h>
#include <cusp/krylov/cg.h>
// where to perform the computation
typedef cusp::device_memory MemorySpace;
// which floating point type to use
typedef float ValueType;
int main(void)
{
    // create an empty sparse matrix structure (HYB format)
    cusp::hyb_matrix<int, ValueType, MemorySpace> A;
    // create a 2d Poisson problem on a 10x10 mesh
    cusp::gallery::poisson5pt(A, 10, 10);
    // allocate storage for solution (x) and right hand side (b)
    cusp::array1d<ValueType, MemorySpace> x(A.num_rows, 0);
    cusp::array1d<ValueType, MemorySpace> b(A.num_rows, 1);
    // set stopping criteria:
    //  iteration_limit    = 100
    //  relative_tolerance = 1e-3
    //  absolute_tolerance = 0
    //  verbose            = true
    cusp::monitor<ValueType> monitor(b, 100, 1e-3, 0, true);
    // set preconditioner (identity)
    cusp::identity_operator<ValueType, MemorySpace> M(A.num_rows, A.num_rows);
    // solve the linear system A * x = b with the Conjugate Gradient method
    cusp::krylov::cg(A, x, b, monitor, M);
    return 0;
}

Preconditioners

Preconditioners are a way to improve the rate of convergence of iterative solvers. The good preconditioner is fast to compute and approximates the inverse of the matrix in some sense. Cusp provides the following preconditioners:

#include <cusp/precond/aggregation/smoothed_aggregation.h>
#include <cusp/krylov/cg.h>
#include <cusp/gallery/poisson.h>
#include <cusp/csr_matrix.h>
#include <iostream>
template <typename Monitor>
void report_status(Monitor& monitor)
{
    if (monitor.converged())
    {
        std::cout << "  Solver converged to " << monitor.tolerance() << " tolerance";
        std::cout << " after " << monitor.iteration_count() << " iterations";
        std::cout << " (" << monitor.residual_norm() << " final residual)" << std::endl;
    }
    else
    {
        std::cout << "  Solver reached iteration limit " << monitor.iteration_limit() << " before converging";
        std::cout << " to " << monitor.tolerance() << " tolerance ";
        std::cout << " (" << monitor.residual_norm() << " final residual)" << std::endl;
    }
}
int main(void)
{
    typedef int                 IndexType;
    typedef float               ValueType;
    typedef cusp::device_memory MemorySpace;
    // create an empty sparse matrix structure
    cusp::coo_matrix<IndexType, ValueType, MemorySpace> A;
    // create 2D Poisson problem
    cusp::gallery::poisson5pt(A, 256, 256);
    cusp::array1d<ValueType, MemorySpace> x0(A.num_rows, 0);
    cusp::array1d<ValueType, MemorySpace> b(A.num_rows, 1);
    cusp::monitor<ValueType> monitor(b, 1000, 1e-6);
    // solve without preconditioning
    {
        std::cout << "\nSolving with no preconditioner..." << std::endl;
        // allocate storage for solution (x)
        cusp::array1d<ValueType, MemorySpace> x(x0);
        // solve
        cusp::krylov::cg(A, x, b, monitor);
        // report status
        monitor.print();
    }
    // solve with smoothed aggregation algebraic multigrid preconditioner
    {
        std::cout << "\nSolving with smoothed aggregation preconditioner..." << std::endl;
        // allocate storage for solution (x)
        cusp::array1d<ValueType, MemorySpace> x(x0);
        // reset the monitor
        monitor.reset(b);
        // setup preconditioner
        cusp::precond::aggregation::smoothed_aggregation<IndexType, ValueType, MemorySpace> M(A);
        // solve
        cusp::krylov::cg(A, x, b, monitor, M);
        // report status
        monitor.print();
        // print hierarchy information
        std::cout << "\nPreconditioner statistics" << std::endl;
        M.print();
    }
    return 0;
}

User-Defined Linear Operators

Sometimes it is useful to solve a linear system A * x = b without converting the matrix A into one of Cusp's formats. For this reason Cusp supports user-defined linear operators that take in a vector x and compute the result y = A * x. These black-box operators can be used to interface matrix-free methods with Cusp's iterative solvers.

#include <cusp/linear_operator.h>
#include <cusp/krylov/cg.h>
// This example shows how to use cusp::linear_operator to solve
// a linear system with a user-defined linear operator A.  The
// linear_operator is a way to interface custom sparse matrix
// formats or so-called "matrix-free" methods with the iterative
// solvers in Cusp.  In this example, we illustrate a matrix-free
// implementation of a simple 5-point finite-difference stencil,
//
//                [  0 -1  0 ]
//                [ -1  4 -1 ]
//                [  0 -1  0 ]
//
// using a CUDA kernel.  We combine the linear_operator with the
// Conjugate Gradient method to solve a 2D Poisson problem.
__global__
void stencil_kernel(int N, const float * x, float * y)
{
    // compute y = A*x, where A is the 5-point stencil
    // note: pre-caching a window of x into __shared__ memory
    // would make this a lot faster.
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;
    if (i < N && j < N)
    {
        // linear index into 2D grid
        int index = N * i + j;
        float result = 4.0f * x[index];         // center point
        if (i > 0    ) result -= x[index - N];  // lower neighbor
        if (i < N - 1) result -= x[index + N];  // upper neighbor
        if (j > 0    ) result -= x[index - 1];  // left neighbor
        if (j < N - 1) result -= x[index + 1];  // right neighbor
        // write result
        y[N * i + j] = result;
    }
}
class stencil : public cusp::linear_operator<float,cusp::device_memory>
{
public:
    typedef cusp::linear_operator<float,cusp::device_memory> super;
    int N;
    // constructor
    stencil(int N)
        : super(N*N,N*N), N(N) {}
    // linear operator y = A*x
    template <typename VectorType1,
             typename VectorType2>
    void operator()(const VectorType1& x, VectorType2& y) const
    {
        // obtain a raw pointer to device memory
        const float * x_ptr = thrust::raw_pointer_cast(&x[0]);
        float * y_ptr = thrust::raw_pointer_cast(&y[0]);
        dim3 dimBlock(16,16);
        dim3 dimGrid((N + 15) / 16, (N + 15) / 16);
        stencil_kernel<<<dimGrid,dimBlock>>>(N, x_ptr, y_ptr);
    }
};
int main(void)
{
    // number of grid points in each dimension
    const int N = 10;
    // create a matrix-free linear operator
    stencil A(N);
    // allocate storage for solution (x) and right hand side (b)
    cusp::array1d<float, cusp::device_memory> x(A.num_rows, 0);
    cusp::array1d<float, cusp::device_memory> b(A.num_rows, 1);
    // set stopping criteria:
    //  iteration_limit    = 100
    //  relative_tolerance = 1e-6
    cusp::monitor<float> monitor(b, 100, 1e-5, 0, false);
    // solve the linear system A * x = b with the Conjugate Gradient method
    cusp::krylov::cg(A, x, b, monitor);
    return 0;
}

Additional Resources

This guide only scratches the surface of what you can do with Cusp. The following resources can help you learn to do more with Thrust or provide assistance when problems arise.

Comprehensive Documentation of Cusp's API
A list of Frequently Asked Questions
Collection of example programs

We strongly encourage users to subscribe to the cusp-users mailing list. The mailing list is a great place to seek out help from the Cusp developers and other Cusp users.