CUDA Swarm Simulator A lightweight simulator for swarms using CUDA

Notes: The code for the following simulation can be found on my GitHub page. Some knowledge of programming, parallel computing, and OpenGL may be useful when reading.


In the summer of 2015 I decided to begin work on a simulator to aid my lab in research on human-swarm interaction (HSI). Swarms represent a different sort of system to simulate than most other types of robotics in that it involves large numbers of independent agents with limited abilities, instead of a single complex robot. Thus, it required a different sort of simulation engine than those commonly used in the scientific literature. I had used previous simulations, from the Stage Robot Simulator to a crude, self-designed Java-based web simulation for previous studies.

The new simulator, however, would make use of GPUs to allow for parallel simulation of each robot during a step in a control loop. In other words, instead of having the CPU cycle through each robot in the swarm to process its next movement, the GPU would process each robot simultaneously, then update their positions and velocities together. The two main reasons for this implementation were:

  1. To increase the number of robots that could be simulated at any one time. The previous simulations I had used were limited to < 200 robots, due to increasing computational demands. The new, CUDA-based simulator initially allowed for over 1000 robots, and through later optimization, up to 16,000.
  2. To have fun. Parallel computing had been a growing interest of mine for several years, and I now finally had a good excuse to dive into it.

Creating a simulation using the CUDA framework as the computational backend, and OpenGL for the graphics, forced me to significantly expand my programming knowledge, but ultimately proved extremely useful for my work in the Usability Lab at the University of Pittsburgh. CUDA is NVIDIA’s proprietary framework for performing general purpose GPU computing. Projects using the simulator have now spawned two papers submitted to the 2016 IEEE Systems, Man, and Cybernetics conference, and allowed me to make significant progress toward finishing my dissertation.


How Does it Work?

Details on the CUDA architecture can be found on NVIDIA’s CUDA website, and so I will not discuss them here. Instead, I will (very roughly) lay out the overall architecture of the simulator.

The simulation displays everything in the simulated 2-dimensional world to the user in an OpenGL environment. A 3-dimensional mode is planned for later, but has not been implemented because it is not yet needed. This includes drawing the world boundaries, obstacles, and robots themselves. The OpenGL environment also accepts input from the users so that it can be used as an interface for user control studies in HSI. This environment is in change of the high-level operations of the simulator, including scheduling simulation steps of the control loop, to logging user input and swarm information between every step (or as dictated by the user in a separate parameters file).

The operations of the robots themselves is handled on the GPU through the execution of CUDA kernels. This is made simple by the CUDA-OpenGL interoperability provided by the CUDA framework, which allows the robot data to be stored on the GPU for both computation and output to the display, without the need to copy the updated data back to the system memory after every step. The simulation runs with a 60Hz control loop, and each step of the control loop follows this rough pattern (see figure):

  1. First, two kernels are executed in simultaneous CUDA streams. The first kernel performs the primary robot operations, such as sensing neighbors, determining goal vector (heading and speed), and moving forward one step based on this goal vector. The second kernel handles extra operations used for selecting robot leaders, computing the adjacency matrix, etc.
  2. Within each main kernel step, each robot first determines it’s current control law set based on the parameters of the experiment, then samples each neighboring robot and uses each robot’s data to update its goal vector in accordance with the control law. Finally, before updating its position, it will use simulation parameters to alter its goal vector (e.g. cap the goal speed to a maximum velocity) before then updating its position.

Outline of a single step in the simulator’s operation.

After the two kernels complete, OpenGL updates the display with the new robot positions, and any global operations are taken (such as data logging, or processing user input).

Finally, the simulation allows a user to run it without the GUI shown. In this case, the simulation is not bound by the 60Hz limit imposed by the OpenGL GUI. Instead, steps can be run as fast as the GPU can process them. This is useful for cases where I need to debug operations over many steps, or future cases where path planning may be done offline.



The NVIDIA Visual Profiler is an excellent tool to help optimize CUDA code, especially for those just learning parallel programming. Using the tool I was able to determine that the two main parts of the code preventing faster simulation were the lack of full use of shared memory, and the need to overlap memory transfers with computation. NVIDIA GPUs allow kernel threads within a single block to all access the same pool of shared memory, which is significantly faster to read and write from the global GPU memory. The main limitation, however, is that the size of shared memory is fairly limited, so it must be used intelligently. Overlapping shared memory with computation is simpler, however, once I discovered the cudaMemcpyAsync() methods.


An example of analysis done by the NVIDIA Visual Profiler on an early version of cuSwarmSim. This analysis was purposefully performed with a very high number of robots to highlight optimization problems. The profiler allows you to view the timeline of the program’s execution, and gives guided analysis towards improving performance. In particular, this example suggests overlapping memory copies with compute executions. The low memory copy throughput is due to the nature of the simulators design: frequent small memory copies do not lend themselves well to high throughput.

Using the tools from the Visual Profiler, I implemented several optimizations to improve both the speed and number of robot the simulation can handle. The most significant of which utilized the shared memory I just mentioned. Because parallel simulation of a swarm is very similar to n-body simulations, I took inspiration from a blog post on fast N-body simulations to implement a method of speeding up the time it takes each robot to sample every other member of the swarm to perform neighbor operations. Under this method, the kernel pulls data from “tiles” of robots–each the size of a block–into shared memory. Each robot then computes the interactions between itself and each member of the tile, then when all robots have completed, a new tile is loaded. Below is a rough outline of how this occurs:

// Index of this robot
uint i = blockIdx.x * blockDim.x + threadIdx.x;

__shared__ float4 s_pos[BLOCK_SIZE];
__shared__ float3 s_vel[BLOCK_SIZE];
__shared__ int s_mode[BLOCK_SIZE];

// Iterate through blocks to use shared memory within a block
for (uint tile = 0; tile < gridDim.x; tile++) {

    // Assign shared memory for this block
    uint n = tile * blockDim.x + threadIdx.x;
    s_pos[threadIdx.x] = pos[n];
    s_vel[threadIdx.x] = vel[n];
    s_mode[threadIdx.x] = mode[n];

    // Synchronize threads after shared memory is assigned

    // Iterate through all robots in this tile
    for (uint ti = 0; ti < blockDim.x; ti++) {
        // Do not perform an interaction between this robot and itself
        if (i != tile * blockDim.x + ti) {
            ...operations go here....

As mentioned previously, another optimization is to allow for asynchronous memory transfers between the GPU and system memory. What this means is that the GPU can execute a memory transfer, then continue further executions without waiting for the memory transfer to finish. This can be useful in situations where, for instance, the program wishes to write robot data to a log file, but such an operation would take longer than a control step to make. Without asynchronous transfers, the simulation would pause which the memory transfer and logging occurred. This can be done in CUDA easily with the following example:

// Synchronous memory transfer. Further execution is blocked until the transfer completes
cudaMemcpy(positions, d_positions, n * sizeof(float4), cudaMemcpyDeviceToHost);

// Asynchronous memory transfer. Further execution is not blocked.
cudaMemcpyAsync(velocities, d_velocities, n * sizeof(float3), cudaMemcpyDeviceToHost);

While such an operation can be extremely useful, it must be avoided in some cases. For instance, if a GUI element is to be drawn using robot data from the GPU, the system memory will need to be updated after every step with this information. Thus, further steps should not be executed until this data transfer completes.

After making these optimizations, the remaining complaints from the profiler were of low memory copy throughput, and a high number of registers used per thread. The prior problem is most likely due to the nature of the simulation: because GUI elements require updated summary information about the robots, small batches of data needed to be transferred back to the system side between each step, and small, frequent memory transfers will mean low throughput. The latter problem is one I am currently addressing.


What’s Next?

Because my current work deals with swarms in two dimensions, I have not focused on adding support for three. However, this is the main direction I plan to take once the current batch of experiments is done. I would also like a way for the user to specify control laws in separate text files so that new behaviors can be added without changing the source code. The code is available on GitHub. Please visit my GitHub page to see other projects.