ChinaRxiv

AVASX: A GPU-accelerated High-performance Beam Dynamics Simulation Code Using Hybrid Tracking Methods for Ion Linear Accelerators

Yuan Tian, Chao Jin, Zhongyi Li, Changwei Hao, Xin Qi, Yaxin Hu, Zhijun Wang, Yuan He

Submitted 2025-11-03 | ChinaXiv: chinaxiv-202511.00004 | Original in English

Note: Figures in this paper have not yet been translated.

Abstract

In the study of beam dynamics for high-intensity linear accelerators, 3D multi-particle simulations based on the particle-in-cell (PIC) algorithm are necessary due to numerous nonlinear effects caused by strong space charge effects. However, since such simulations require a large amount of computations, the simulations running on traditional CPUs (Central Processing Units) are always inefficient and time-consuming, which restricts their simulation scales and applications. In this work, a high-performance beam dynamics simulation code, AVASX (Advanced Virtual Accelerator Software X), designed based on a CPU-based code, AVAS (Advanced Virtual Accelerator Software), is developed with CUDA (Compute Unified Device Architecture) and runs on NVIDIA GPUs (Graphics Processing Units). AVASX enables the adaptive and dynamic switching between two particle tracking methods, one using time as the independent variable (t-code) and the other using position (z-code). This switching mechanism optimizes the trade-off between the simulation accuracy and the computational efficiency. To improve the computing performance of charge deposition when performing atomic operations on GPU global memory, three optimization schemes, which are the thread aggregation method, the strategy of scattering data processing scopes, and the approach of using duplicate memory instances, are proposed in this work and profiled using NVIDIA Nsight Compute to verify their effectiveness. According to the test results of simulating different beamlines, AVASX running on an A100 GPU is reliable and achieves a 174 ~ 550x speedup over AVAS running on 56 physical CPU cores, leading to the simulation durations being reduced from days or hours to a few minutes or seconds. Based on AVASX, the simulation tasks such as accelerator design, fault compensation, and machine learning datasets generation could be performed in a high-efficiency way.

Full Text

Preamble

AVASX: A GPU-accelerated High-performance Beam Dynamics Simulation Code Using Hybrid Tracking Methods for Ion Linear Accelerators Yuan Tian, 1, 2, 3, Chao Jin, 1, 2, 3, Zhongyi Li, Changwei Hao, Xin Qi, 1, 2, 3 Yaxin Hu, Zhijun Wang, 1, 2, 3, and Yuan He 1, 2, 3, 1 Institute of Modern Physics, Chinese Academy of Sciences, Lanzhou 730000, China School of Nuclear Science and Technology, University of Chinese Academy of Science, Beijing 101408, China Advanced Energy Science and Technology Guangdong Laboratory, Huizhou 516000, China In the study of beam dynamics for high-intensity linear accelerators, 3D multi-particle simulations based on Particle-in-Cell (PIC) algorithm are necessary due to numerous nonlinear effects caused by strong space charge effects. However, since multi-particle simulations require a large amount of computations, the simulations run- ning on traditional CPUs (Central Processing Units) are always inefficient and time-consuming, which restricts their simulation scales and applications. In this work, a high-performance beam dynamics simulation code, AVASX (Advanced Virtual Accelerator Software X), designed based on a CPU-based code, AVAS (Advanced Virtual Accelerator Software), is developed with CUDA (Compute Unified Device Architecture) and runs on NVIDIA GPUs (Graphics Processing Units). AVASX enables the adaptive and dynamic switching between two particle tracking methods, one using time as the independent variable (t-code) and the other using position (z-code). This switching mechanism optimizes the trade-off between the simulation accuracy and the com- putational efficiency. To improve the computing performance of charge deposition when performing atomic operations on GPU global memory, three optimization schemes, which are the thread aggregation method, the strategy of scattering data processing scopes, and the approach of using duplicate memory instances, are pro- posed in this work and profiled using NVIDIA Nsight Compute to verify their effectiveness. According to the test results of simulating different beamlines, AVASX running on an A100 GPU is reliable and achieves a 174 - speedup over AVAS running on 56 physical CPU cores, leading to the simulation durations being reduced from days or hours to a few minutes or seconds. Based on AVASX, the simulation tasks such as accelerator design, fault compensation, and machine learning datasets generation could be performed in a high-efficiency

Keywords

Beam dynamics simulation, Ion linear accelerators, Particle-in-Cell algorithm, GPU-accelerated code

1 INTRODUCTION

Accelerator-driven sub-critical systems (ADS) have been proposed as a promising technological path to achieve cleaner

nuclear energy sources [ 1 – 4 ]. The China initiative Acceler- 4

ator Driven System (CiADS) program [ ] aims to achieve accelerator-target-reactor coupling experiments and conduct nuclear waste transmutation research. The CiADS driving linear accelerator is designed to accelerate a 5 proton beam to 500 in continuous wave mode. For such high- intensity, high-power accelerators, beam loss control repre- sents a critical challenge. Beam halo formation, induced by nonlinear factors such as space charge effects and nonlinear radio frequency fields, constitutes the primary cause of beam loss [ To study the beam dynamics with space charge effects in accelerator physics, 3D multi-particle simulations based on Particle-in-Cell (PIC) algorithm [ ] are widely adopted in this field.

In order to accurately represent the underlying physics, a large number of particles (macro-particles) are re-

quired in the simulations. However, the simulations running 20

on traditional CPUs (Central Processing Units) are always in- 21

efficient and time-consuming, restricting their applications in These authors contributed equally to this work and should be considered co-first authors. critical areas such as accelerator design [ ], fault com-

pensation [ 14 ], and machine learning datasets generation [ 15 – 24

Recent advances in the high-performance computing ca-

pacity of Graphics Processing Units (GPUs) have promoted 27

their widespread adoption in the scientific computing appli- cations. The parallel nature of large-scale multi-particle sim- ulations aligns well with the massive parallelism of GPUs,

enabling significant computational efficiency gains. Driven 31

by the growing demand for high computing speed and large simulation scale in beam dynamics, several GPU-accelerated codes have been developed for accelerator simulations, in- cluding WarpX [ ], Elegant [ ], PARMILA [ ], and Im- pactX [ ]. WarpX, awarded the 2022 Gordon Bell Prize in high-performance computing, is a fully electromagnetic PIC code using time as the independent variable (t-code) in par- ticle pushing, while Elegant, PARMILA, and ImpactX are s- based beam dynamics codes using the longitudinal coordi- nates as the independent variable (z-code). Generally speak- ing, t-code methods provide high precision in beam dynam- ics but take long computation time, while z-code methods have high computational efficiency but lose precision when the energy of particles is low [ ]. Therefore, the im- plementation of a hybrid particle tracking methodology for comprehensive simulations can effectively achieve an optimal balance between computational efficiency and simulation ac- curacy. In addition, the optimizations of calculating charge deposition on GPUs in WarpX, Elegant, and PARMILA were not fully considered. In WarpX and PARMILA, only

hardware-accelerated atomic operations, working in a mu- tually exclusive manner, were employed to accumulate the charges from different particles onto the grid points allocated on GPU global memory (device memory, an off-chip storage).

Although the hardware-accelerated atomic operations provide a relatively efficient solution in the mutually exclusive accu- mulation of charges, the computing performance of charge deposition may not be optimal due to the massive memory access conflicts and the high latency of global memory ac- cess. In Elegant, the duplicate memory instances allocated on GPU shared memory (cache memory, an on-chip storage) were used to reduce the memory access latency of performing atomic operations. However, because the size of shared mem- ory is limited, the number of the grid points that can be stored on shared memory is restricted. Furthermore, the excessive use of shared memory can reduce the compute throughput of GPUs, thereby decreasing the computing performance of charge deposition. Therefore, the optimizations in calculat- ing charge deposition with the atomic operations performed on global memory are necessary for removing the limitations of using shared memory while increasing the computing per- formance.

In this work, a GPU-accelerated high-performance beam

dynamics simulation code, AVASX (Advanced Virtual Accel- 75

erator Software X), was developed based on our previously

released CPU-based code, AVAS (Advanced Virtual Acceler- 77

ator Software) [ ]. In AVASX, depending on the configura- tion of beamline elements, particles can be pushed in one of three modes, that is, t-code, z-code, or hybrid mode, meet- ing the requirements of computational efficiency and simula- tion accuracy. To reduce the memory access conflicts from the massive threads performing atomic operations on global memory when calculating charge deposition, three optimiza- tion schemes were employed in this work, which are the thread aggregation method, the strategy of scattering data pro- cessing scopes, and the approach of using duplicate memory instances. These optimization schemes were profiled using NVIDIA Nsight Compute to verify their effectiveness and the results showed that they all play an important role in im- proving the computing performance of charge deposition. In the reliability verification of AVASX’s simulation results, the simulation results of CAFe (Chinese ADS Front-end Demo Linac) and CiADS from AVASX were compared to those from AVAS. The comparison results showed that AVASX is reliable and no accuracy loss occurs in GPU-accelerated sim- ulations. By comparing the simulation durations of AVASX

and AVAS, we found that our GPU-accelerated code running 98

on a NVIDIA A100 GPU achieved a 174 - 550 speedup

over the CPU-based code running on 56 CPU cores. 100

This paper is organized as follows. In Section 2, we briefly 101

introduced the GPU programming model. In Section 3, we elaborated on the implementation details in AVASX, includ- ing the space charge effects, external field interpolation, and particle pushing. In Section 4, we verified the reliability of AVASX and analyzed the effectiveness of the optimizations in charge deposition calculation. 2 OVERVIEWS OF GPU PROGRAMMING WITH CUDA

CUDA (Compute Unified Device Architecture), developed 109

by NVIDIA, is a parallel computing platform and program- ming model, enabling NVIDIA GPUs as general purpose pro- cessors to solve complex computational problems faster than those on CPUs. With CUDA, the programmers can use stan- dard programming languages such as C/C++ to define func- tions, called kernels, launching massive threads executed on GPUs. In kernels, threads are grouped into thread blocks and

organized into a 1D, 2D, or 3D array in each block according 117

to the layout of the data required to be processed. Since the

number of SMs (Stream Multiprocessings, containing a set 119

of physical processing units called CUDA cores) is limited 120

on GPUs, blocks are scheduled in batches to SMs to be pro- cessed. In blocks, every 32 threads form a basic scheduling

unit, called a warp, and warps are processed alternately on 123

CUDA cores. The threads in a warp run in an SIMT (Single Instruction Multiple Thread) manner, that is, all threads exe- cute the same instruction in an instruction cycle and process different data.

The memory system in CUDA consists of global mem- ory, constant memory, texture memory, local memory, shared memory, and registers. Global memory, constant memory, texture memory, and local memory all reside in device mem- ory, having a low memory bandwidth and a high access la- tency.

Shared memory and registers are on-chip storages, having much higher bandwidth and much lower latency but smaller capacity than those off-chip storages. Global mem- ory is the most plentiful storage, and acts as the primary data provider and the only data output channel for threads. Gen- erally, the computing performance of kernels is constrained by the pattern of global memory access. On GPUs, the mem- ory requests from the threads within the same warp to global memory are served by one or more memory transactions. The number of memory transactions depends on the distribution of the memory addresses across the threads.

If the mem- ory addresses are more concentrated, the memory requests of the threads can be coalesced and served by fewer mem- ory transactions, otherwise more memory transactions are needed. More memory transactions always lead to a high ac- cess latency and a low memory bandwidth utilization, result- ing in reducing the compute throughput of kernels. Therefore, the optimizations of global memory access are most impor- tant to achieve the high computing performance of the kernels executed on GPUs. Because the performance considerations are only involved in global memory access in this paper, the details of other storages can be referred to CUDA C++ Pro- gramming Guide 3 IMPLEMENTATION DETAILS In this section, the implementation details of calculating space charge effects, external field interpolation, and particle pushing on GPUs are introduced. The threads in the kernels of calculating charge deposition, field interpolation, and par-

ticle pushing are organized into 1D arrays in blocks, working 161

one-to-one with particles; while those in the kernel of field

solve are organized into 3D arrays, working one-to-one with 163

PIC grid points. For coalesced global memory access, the data of particles are stored in a structure of arrays (SoA) man- ner, and the data of field are stored in an array of structures (AoS) manner (explained in Subsection 3.2).

3.1 Space Charge Effects

The calculation of space charge effects for accelerator beams typically follows three steps: 1) Charge deposition: Depositing particle charges on PIC grid points to obtain the charge density distribution. 2) Space charge field solve: Solving the Poisson equation by using the charge density distribution to obtain the elec- tric potentials at all grid points and then calculating the space charge field with potential gradients. 3) Space charge field interpolation: Interpolating the space charge field from the nearest grid points to particle positions.

3.1.1 Charge Deposition

In the charge deposition step, each particle deposits its charge on all vertices of the grid in which it is located through a weighted allocation method. A 2D allocation method is shown in Fig. , where represent the normalized lengths in the x-direction and y-direction, projected from the distance between the particle center and the left-bottom ver- tex of the grid. For any vertex in the figure, its charge allo- cation weight is determined by the area of the rectangle com- posed of the particle center and the vertex itself. Then the charge deposited on each vertex can be calculated by multi- plying the particle charge by the allocation weight of the ver- tex. For a common vertex of multiple grids, the total charge deposited on this vertex is accumulated from the charges con- tributed by the particles located in those grids. In a parallel computing system, the accumulation of different charges on the same grid point must be performed in a mutually exclu- sive manner. Although the CUDA toolkit provides a double- precision floating-point function named atomicAdd to achieve the mutually exclusive accumulation of charges, the comput- ing performance of charge deposition may not perform well due to a large number of memory access conflicts caused by massive threads. In order to improve the computing per- formance of charge deposition on GPUs, three optimization schemes were employed in our algorithm.

The first optimization scheme is the thread aggregation method, which was used to simulate the energy deposition of the grains bombarded by a high-energy beam [ ]. The basic idea of this method is to assign some specific threads in a CUDA warp as data collectors to collect the data from those non-collector threads within the same warp. Only col- lector threads are required to perform atomic functions to out- put final results to GPU global memory. Fig. shows an example of using the thread aggregation method to cal- culate the charge deposition in a multi-particle system. As- suming that the particles in Fig. are processed by a set of threads within the same warp, the threads are partitioned into six groups according to the grids in which the particles are located.

A collector thread is assigned to each group and accumulates the charges from those threads belonging to the same group on its own register.

Finally, each col- lector thread outputs the accumulated charges to grid points through the atomic function atomicAdd Compared to the case where all threads access global memory through atomic functions, the thread aggregation method can effectively re- duce the memory access conflicts caused by those threads accessing the same memory addresses, which may lead to a computing performance enhancement. The procedure of us- ing the thread aggregation method to calculate charge depo- sition in our algorithm can be divided into two steps. Two CUDA built-in data shuffling functions named , which are used to transfer the data of registers between the threads within the same warp, are em- ployed in these two steps.

The first step, which can be explained in Fig. , is to de- termine the collector threads in a warp. The elements in an ar- are used to indicate which threads are data collectors

and initialized to 1. The collector threads are determined after 236

running a loop of n cycles, where n is equal to the number of 237

threads minus one. The loop variable δ with an initial value 238

of 1 is used to determine the data sources for those threads calling the function . In each loop cycle, the thread whose identifier is greater than or equal to is a fetcher thread and fetches the grid index through the thread whose identifier is calculated by subtracting the fetcher’s identifier. For any fetcher thread, if the fetched grid index is equal to its own, it is not a collector thread and its corresponding element in the array is set to 0. At the end of the loop, those threads, whose corresponding elements in the array are still equal to 1, are selected as the collector threads.

Once the collector thread is determined for each thread group, the collector thread accumulates the charges from those non-collector threads within the same group in the sec- ond step, which can be seen in Fig. . In the second step, a new loop, which has the same number of cycles as in the first step, is performed to accumulate charges. When the identifier

Partitioning threads (particles) into groups and assigning a collector thread for each group to accumulate charges; (b) Employing the warp shuf- fling function in cycles to select a collector thread for each group; (c) Employing the warp shuffling function in cycles to accumulate charges for collector threads. of a collector thread is less than or equal to the loop vari- , the collector thread fetches the grid index and charge through from the thread whose identifier is calculated by adding and the collector’s identifier. The col- lector thread accumulates the fetched charge on its register if the fetched grid index matches its own. At the end of the loop, the charges deposited on grid points from all particles are stored in the registers of collector threads and are output by collector threads through the function atomicAdd The purpose of the first optimization scheme is to re- duce memory access conflicts by reducing the number of the threads performing atomic functions. The second scheme is to reduce memory access conflicts by scattering the data pro- cessing scopes of thread blocks. Because the number of phys-

ical processing units is limited on GPUs, blocks are scheduled 270

in batches to be processed. In a batch of scheduled blocks, if the particles processed by different blocks are located in adja- cent grids, the memory access conflicts may occur among the threads belonging to those blocks. A possible way to reduce the memory access conflicts in a batch of scheduled blocks is illustrated by the example in Fig. . Assuming that the par- ticles in Fig. are processed by four different blocks, and the blocks are scheduled in two batches. In the case shown in the upper left corner of Fig. , when the particles processed by the different blocks within the same batch are located in adjacent grids, the memory access conflicts are present on the common vertices of those grids. However, if the data process- ing scopes of blocks are changed, which is shown in the lower left corner of Fig. , the memory access conflicts can be reduced because the memory access addresses of threads are scattered.

To realize the second scheme, a strategy of scattering the data processing scopes of blocks is designed in our algorithm, which is illustrated in Fig. . Assuming that the blocks in the figure are scheduled in three batches, those blocks be- longing to the same batch are marked with the same color.

For each block, its data processing scope is determined by both the quotient and remainder of dividing its identifier by

the number of batches. Through the formula in the figure, where 5 is the size of a batch, the data processing scopes of the blocks within the same batch are scattered to a cer- tain degree. This means that when the particles in adjacent data processing scopes are located in adjacent grids, scatter- ing scopes may result in the particles processed by the blocks within the same batch being located in non-adjacent grids.

Therefore, the memory access conflicts among threads from different blocks can be reduced.

The last scheme, which is a common optimization solution in parallel computing systems, employs duplicate memory in- stances to scatter memory access addresses to reduce memory access conflicts, as illustrated in Fig. . In our algorithm, several duplicate memory instances of grid points are allo- cated on global memory according to the number of warps in a block. The threads from different warps output their charges to different memory instances, thus the memory access con- flicts can be reduced. Before solving the electric potentials at all grid points, all memory instances are merged into a single one by a sum reduction to obtain the complete charge density distribution, as shown in the figure.

It should be noted that, except for the scheme using du- plicate memory instances, the other two schemes only have effects on the condition that all particles are stored in an or- dered sequence according to their located grids. Because the located grids of particles may be changed after a pushing step, all particles need to be sorted again to preserve the sequen- tial ordering. Although the CUB library (CUDA UnBound) in the CUDA toolkit provides some high-performance sort- ing functions available to accelerate particle sorting, the fre- quent invocations of sorting may degrade the overall comput- ing performance of charge deposition due to the high compu- tational cost of sorting, even making the performance worse

than that without optimizations. In order to minimize the 327

computing performance degradation caused by sorting, par- ticles are sorted every pushing steps in our algorithm. Al- though the optimizations cannot perform optimally during the pushing steps between two sorting operations, they still have effects because some particles remain in ordered sequences.

3.1.2 Space Charge Field Solve

In our algorithm, the Poisson equation is solved through a spectral method using discrete Fourier transforms. method can be demonstrated via a one-dimensional example:

Given a grid side length of , the Poisson equation can be discretized to:

− ρ j ε = φ j − 1 − 2 φ j + φ j +1 ∆ x 2 (1) 339

where represent the charge and electric potential at the grid point . By applying the discrete Fourier transform (DFT) to Eq. , we obtain:

φ j = 1 N ∆ x

n =0 exp ( − i 2 πnj N ) φ n (2) 343

where represents the number of grids and represents the electric potential in wavevector space. The Poisson equation in wavevector space is represented as:

φ n = ρ n εK 2 n (3) 347

where represents the charge in wavevector space, and is represented as:

K 2 n = k 2 n

where k n = 2 πn/N ∆ x . Obviously, when φ n is solved in 351

wavevector space through Eq. can be obtained by ap- plying the inverse discrete Fourier transform (IDFT) to then the space charge field at the grid point is calculated through Eq. . According to the description above, the pro- cedure of field solve is divided into four steps: 1) Transforming the charges at all grid points from real space to wavevector space by DFT. 2) Solving the electric potentials using the transformed charges in wavevector space. 3) Transforming the electric potentials from wavevector space to real space by IDFT. 4) Calculating the space charge field at each grid point by

the finite difference method (FDM). 364

On GPUs, the CUDA toolkit includes a fast Fourier trans- form (FFT) library known as cuFFT providing some high- performance functions for DFT and IDFT. In our algorithm, the charges at all grid points are first transformed by three- dimensional DFT to wavevector space through the function cufftExecC2C with the parameter CUFFT FORWARD . The electric potential at each grid point is then calculated using and Eq. , where each thread is assigned to calcu- late for one grid point. After that the electric potentials are transformed back to real space via IDFT using cufftExecC2C with the parameter CUFFT INVERSE . Finally, each thread calculates the space charge field for one grid point using FDM based on the electric potentials.

3.1.3 Space Charge Field Interpolation

Once the space charge field is obtained, each thread cal- culates the field for one particle through linear interpolation from the nearest grid points. The weights used in field in- terpolation are exactly the same as those in charge deposi- tion. To avoid redundant calculations, the weights calculated in charge deposition are stored in global memory and reused in field interpolation. It should be noted that the field interpo- lation is only applied to the particles inside the grid domain.

For those particles outside the domain, their fields are cal- culated by Coulomb’s law from an equivalent point charge, where the point charge is the sum of charges from all parti- cles inside the domain. Therefore, after field interpolation, the number of particles inside the grid domain is calculated via the function cub::DeviceReduce::Sum provided from the CUB library, and then the equivalent point charge is obtained.

3.2 External Field Interpolation

When particles enter a beamline element existing prede- fined electromagnetic or magnetic fields, the fields acting on particles are interpolated with the same method as in the space charge field interpolation. In this stage, the computing perfor- mance is constrained by global memory bandwidth utiliza- tion, which can be explained in Fig. . Assuming that the particles in Fig. are processed by a set of threads within the same warp, all threads load an electromagnetic or a mag- netic field at the left-bottom vertices of the grids in which the particles are located from global memory. In the case shown in Fig. , the data of each field component are stored in a separate contiguous array in global memory (SoA manner), following row-major ordering according to the lin- ear indexes of grid points, and the data to be accessed in each array are not completely contiguous. According to the pat- tern of global memory access on GPUs, the memory requests of the threads loading the data of all field components in Fig. are coalesced into 9 memory transactions, where each transaction reads a 32-byte sector (a 32-byte aligned memory chunk) [ ] from global memory. Because the actual utilized data in each 32-byte sector are 12 bytes, the memory band- width utilization is limited to 37.5%. However, if the mem- ory layout of field is changed as shown in Fig. , where the

data of all field components are stored in a unified contiguous 418

array (AoS manner), the number of memory transactions is reduced from 9 to 6, improving the memory bandwidth uti- lization from 37.5% to 56.25%. Therefore, in our algorithm, a user-defined floating-point data structure Float6 Bx, By, Bz, Ex, Ey, Ez is employed to store the data of all compo-

nents of an electromagnetic or a magnetic field in a unified 424

contiguous array.

3.3 Particle Pushing

In the stage of particle pushing, whether in the t-code or z-code pushing mode, each thread processes one particle to update its velocity and position. At the end of a t-code push- ing step, the current beamline elements of particles should be determined for the next pushing. Because particles may stride across multiple beamline elements after a t-code pushing step,

as illustrated by the first pushing step in Fig. , each thread determines the latest element index for one particle by search- ing forward from the element index cached by the thread for the particle. Our algorithm supports dynamic switching be- tween t-code and z-code pushing modes. When particles pass through the entrance of a z-code element under the t-code pushing mode, their phase-space coordinates are interpolated at the entrance plane of the element, as illustrated by the sec- ond pushing step in Fig. , and then the pushing mode is switched to the z-code mode after all particles have entered the z-code element. In our algorithm, when a tracer particle (a virtual particle located at the beam centroid, illustrated in orange in Fig. ) passes through the entrance of a z-code ele- ment at a certain t-code pushing step, the number of simulated particles (excluding the tracer particle, illustrated in green in the function cub::DeviceReduce::Sum at the following t-code pushing steps. Once the counted number is equal to the total number of simulated particles, the pushing mode is switched from t-code to z-code.

In the z-code pushing mode, for any z-code element, par- ticles are pushed with space charge effects in several steps from the entrance to the outlet of the element. As illustrated in Fig. , the particles are pushed in 3 steps in a z-code el- ement with the length of . Except for the first pushing step, the phase-space coordinates of simulated particles are tem- porarily converted to real-space coordinates to calculate the space charge effects before each subsequent pushing step, as shown in the figure. When particles are pushed to the out- let of a z-code element, their coordinates are converted from phase-space coordinates to real-space coordinates, and then the pushing mode is switched from z-code to t-code. 4 RESULTS AND DISCUSSIONS In this section, the reliability of the GPU-accelerated code was first verified by comparing the simulation results between AVASX and AVAS. And then, the memory bandwidth utiliza- tion in the external field interpolation kernel and the optimiza- tion effects in the charge deposition kernel were analyzed us- ing NVIDIA Nsight Compute. At last, the computing per- formance of AVASX was evaluated in CAFe simulations, and was compared with that of AVAS. The profiling of kernels and the CAFe simulations were all performed on NVIDIA A100 GPUs.

4.1 Reliability Verification

This section compares the simulation results of AVASX and AVAS. The comparison consists of two parts: (1) the re- sults of AVAS and AVASX are compared on the CAFe super- conducting section; (2) a further comparison is carried out on CiADS, which features a higher beam current and a longer structure. Simulations with 1,000,000 and 10,000,000 proton particles are performed on both CAFe and CiADS in order to reduce the influence of particle statistics. (a) - (c) correspond to the results of AVAS, and (d) - (f) to the results of AVASX. (a) - (c) correspond to the results of AVAS, and (d) - (f) to the results of AVASX.

In CAFe simulations, the total length of the superconduct- ing section is 16.272 , including field-map solenoids and RF (Radio Frequency) cavities. The bunch used in simula-

tions had an initial energy of 1.36 MeV , an intensity of 0.27 488

, and a frequency of 162.5 . The dimensions of PIC grid points were set to 128

128. The emittance, rms

(root-mean-square) size, and energy of the bunch at the exit of the CAFe superconducting section, simulated by AVASX and AVAS, are listed in TABLE. . Clearly, the simulation results of AVASX were almost the same as those of AVAS. provide the phase space distributions of the exit bunch, including the distributions in the planes, where the distributions in Fig. were ob- tained from 1,000,000 particles while those in Fig. obtained from 10,000,000 particles. In both two figures, (a) - (c) correspond to the results of AVAS, and (d) - (f) to those of AVASX. Obviously, it is hard to distinguish differences be- tween the distributions of AVASX and AVAS, regardless of the number of particles used in the simulations. superconducting section between AVASX and AVAS, includ- ing energy, rms size, and emittance. The parameters in the left panels were obtained from the simulations with 1,000,000

Particle number Emittance Bunch size ( Energy AVASX 1,000,000 10,000,000 1,000,000 10,000,000 10,000,000 particles.

Particle number Emittance ( Bunch size ( Energy AVASX 1,000,000 10,000,000 1,000,000 10,000,000 exhibit a high degree of consistency throughout the beam To further verify the reliability of AVASX, the beam transport in the CiADS superconducting section, involving stronger space charge effects and a longer transport distance of 202.75 , was simulated. The bunch used in simulations

was initialized with an energy of 2.1 MeV , an intensity of 5.0 516

, and a frequency of 162.5 . TABLE. compares the simulation results between AVASX and AVAS, including the energy, rms size, and emittance of the bunch at the exit of the CiADS superconducting section. The results indicate that the energy and emittance of AVASX are nearly identical to those of AVAS, with only a minor difference in rms size.

CiADS superconducting section, obtained from 1,000,000 particles.

Panels (a) - (c) correspond to the results of AVAS, and (d) - (f) to the results of AVASX. particles, while those in the right panels from 10,000,000 par- The comparisons of the phase space distributions between AVASX and AVAS are shown in Fig. (1,000,000 particles) ticles. The simulation results reveal that AVASX and AVAS

CiADS superconducting section, obtained from 10,000,000 parti- cles. Panels (a) - (c) correspond to the results of AVAS, and (d) - (f) to the results of AVASX. and Fig. (10,000,000 particles), including the distributions of the exit bunch in the , and planes, where panels (a) - (c) in both figures correspond to the results of AVAS and (d) - (f) to those of AVASX. Although there are some minor but detectable differences at the distribution edges between AVASX and AVAS, their overall phase space distributions of the bunch remain largely consistent.

Since minor differences at the phase space distribution edges exist between AVASX and AVAS, we compared the maximum bunch size along the CiADS superconducting sec- tion between these two codes to further verify the reliability of AVASX, as shown in Fig. , where the left panels corre- spond to the results of simulating 1,000,000 particles and the right to those of 10,000,000 particles. It is clearly evident that the maximum bunch size of AVASX is highly consistent with that of AVAS.

In summary, regardless of beamline length or space charge effect strength, AVASX exhibits highly consistent simula- tion results with AVAS, despite some minor differences at the edges of phase space distributions in CiADS simulations, strongly confirming the reliability of AVASX. 4.2 Memory Bandwidth Utilization in External Field Interpolation As we mentioned in Subsection 3.2, the memory layout

where all field components are stored in a unified contigu- 549

ous array achieves better memory bandwidth utilization than the layout using separate arrays. To confirm our viewpoint, two field interpolation kernels, which are labeled as kernel- (loading the field data from separate arrays) and kernel-2

(loading the field data from a unified array), were profiled 554

using NVIDIA Nsight Compute. The memory workload met- rics of field data loading and the execution durations of these two kernels, both measured from simulating 1,000,000 parti- cles in an electromagnetic field, are reported in TABLE.

Based on the metrics reflecting the amount of data trans- ferred from device memory to L2 cache and from L2 cache to L1 cache, it can be seen that the amount of data transferred from device memory to L1 cache through L2 cache in both two kernels was almost the same. However, the number of memory requests from threads to L1 cache in kernel-2 obviously smaller than that in kernel-1 , resulting in a reduc- tion in the number of requested sectors. This means that when threads load the same amount of data in kernel-1 kernel-2 the memory bandwidth utilization of kernel-2 is higher than that of kernel-1 due to fewer requested sectors. High mem- ory bandwidth utilization can reduce memory access latency, thereby improving computing performance, as demonstrated by comparing the durations of these two kernels in TABLE. . However, because of the high L1 hit rate in both two ker- nels, the computing performance in kernel-1 was not much different from that in kernel-2 4.3 Optimization Effects in Charge Deposition To analyze the optimization effects in charge deposition, the same case simulating 1,000,000 particles was repeated

nine times, and each repetition employed one of nine kernels 579

to calculate charge deposition. For descriptive convenience, 580

these nine kernels are labeled as kernel-3 , kernel-4 , ..., kernel- 581

, and their runtime options are listed in TABLE. . The memory workload metrics during atomicAdd executions, the compute and memory throughputs, the execution durations, and the performance improvements of these kernels are re- ported in TABLE. , where the profiling of kernel-7 kernel-8 kernel-9 kernel-10 , and kernel-11 was only performed after sorting particles. The performance improvements were mea- sured based on the computing performance of the simulation employing kernel-3 As reported in the second column of TABLE. , the num- ber of memory requests from threads to L1 cache in all ker- nels was almost the same. However, as shown in the third column, the number of sectors requested from threads to L1 cache in kernel-8 kernel-9 kernel-10 , and kernel-11 was sig-

nificantly smaller than that in others. Although the thread 596

aggregation method described in Subsection 3.1.1 was em- ployed in kernel-4 kernel-5 , it had no effect on reducing the number of requested sectors at all. The hit rates of all ker-

(loading the field data from separate arrays) and kernel-2 (loading the field data from a unified array) in field data loading.

Employed kernel kernel-3 Employed kernel nels on L1 cache were 0%, indicating that all requested sec- tors missed in L1 cache and were served by L2 cache, as re- flected by comparing the third and sixth columns. The fewer the sectors requested from L1 cache to L2 cache were, the smaller the number of memory requests to L2 cache was, as reported in the fifth column. From the seventh column, it can kernel-4 kernel-5 , and kernel-6 accounted for 9.90% of the total, while that in kernel-8 kernel-9 kernel-10 , and kernel- Compute throughput Memory throughput Duration was only 0.01%. From all the above, it can be confirmed that if particles are sorted according to the indexes of the grids in which they are located, the memory addresses accessed by the threads within the same warp are more concentrated, mak- ing the thread aggregation method effective.

As shown in the last four columns, the compute through- put and the memory throughput in kernel-3 kernel-4 kernel- kernel-6 , and kernel-7 were lower than those in the other Threads to L1 cache L1 cache to L2 cache L2 cache to device memory Duration Memory requests Requested sectors Hit rate Memory requests Requested sectors Requested sectors Requested bytes kernel-1 2,812,500 6,008,627 300,060 1,144,828 1,125,396 36,012,672 kernel-2 1,656,250 4,501,828 290,704 1,135,932 1,125,176 36,005,632 Employed kernel Sorting particles Aggregating threads Scattering data processing scopes Duplicating memory instances Performance improvement Memory requests Requested sectors Hit rate Memory requests Requested sectors kernel-3 250,000 7,832,161 7,691,097 7,832,161 kernel-4 250,016 7,814,976 7,673,911 7,814,976 kernel-5 251,904 7,822,528 7,681,463 7,822,528 kernel-6 250,000 7,832,737 7,692,102 7,832,737 kernel-7 250,000 7,869,016 7,842,952 7,869,016 not provided kernel-8 250,016 997,324 992,204 997,324 kernel-9 251,904 1,004,876 999,756 1,004,876 kernel-10 250,016 997,324 992,204 997,324 kernel-11 251,904 1,004,876 999,756 1,004,876

cated that more warps were stalled due to the memory ac- cess conflicts during atomicAdd executions.

Although the compute throughput in kernel-4 kernel-5 was higher than that in kernel-3 , the memory throughput and the execution duration remained almost unchanged. This means that, un- der the condition of no particle sorting, the thread aggre- gation method and the strategy of scattering data process- ing scopes have no effects on reducing the memory access conflicts caused by atomicAdd executions, which is consis- tent with what was previously described in Subsection 3.1.1.

The memory throughput in kernel-6 was slightly higher than that in kernel-3 kernel-4 , and kernel-5 , leading to a small re- duction in the execution duration of kernel-6 , which means that using duplicate memory instances has a very small effect on overcoming memory access conflicts under the condition of no particle sorting. In kernel-7 , under the condition that only particles were sorted, the compute and memory through- puts were both the lowest among all kernels, leading to the longest duration of all and resulting in an obvious degradation (-62.36%) in computing performance. This means that, with- out the thread aggregation method, the memory access con- flicts become more serious because of the concentrated mem- ory access addresses. However, when the thread aggrega- tion method was employed in kernel-8 , the compute through- put reached 29.95% of the peak throughput, and the memory throughput increased to 167.78 , leading to the com- puting performance increasing by 222.54%. When the thread aggregation method was combined with one of the other two schemes, the compute and memory throughputs in kernel-

9 and kernel-10 were further significantly increased.

The 647

performance improvement in kernel-9 (490.57%) was higher

than that in kernel-10 (378.96%), meaning that the strategy 649

of scattering data processing scopes can reduce memory ac- cess conflicts better than using duplicate memory instances. kernel-11 , where all optimization schemes were employed, the compute and memory throughputs were the highest of all, resulting in the shortest execution duration and increasing the computing performance by 580.30%.

4.4 Computing Performance Tests

In Subsection 3.1.1, we pointed out that to avoid the com- puting performance degradation caused by frequent particle sorting, particles were sorted every pushing steps in our algorithm. To determine a suitable value for , the CAFe simulations using different sorting intervals were first per- formed with 1,000,000 and 10,000,000 particles. For best performance, all simulations employed kernel-2 in calculat- ing external field interpolation and kernel-11 in calculating charge deposition. Fig. shows the sorting intervals and their corresponding simulation durations in terms of natural logarithm. It can be seen that the simulation duration was the

shortest when n = 18 or 20. Therefore, we choose n = 18 as 668

the period of particle sorting.

After determining the suitable value of n , then we tested 670

the influences of the kernels in Subsection 4.3 on computing performance in CAFe simulations with 1,000,000 particles.

Simulation durations with different sorting intervals in CAFe simulations. (a) Simulation durations of 1,000,000 particles; (b) Simulation durations of 10,000,000 particles.

The test results are reported in TABLE. , where the perfor- mance improvement is measured based on the computing per- formance of the simulation employing kernel-3 Per step duration Computing performance G particles/s Performance improvement Employed kernel As shown in the table, the durations and the computing per- formances of the simulations employing kernel-3 kernel-4 kernel-5 were almost the same, proving that the thread aggregation method and the scatter strategy have no effects on increasing the computing performance of the simulations without particle sorting. The computing performance of the kernel-3 kernel-4 , and kernel-5 , being consistent with what was previously analyzed. Under the condition of pe- riodically sorting particles in simulations, the computing per- 16.37%. However, when the thread aggregation method was employed, the computing performances of the simulations employing kernel-8 kernel-9 kernel-10 , and kernel-11 all increased, and the one using all optimization schemes was the best of all. The performance improvements in the

simulations using optimizations were significantly lower than 692

those in Subsection 4.3. The reason can be explained in two aspects: First, the computational load of charge deposition accounts for about 15% of the total; Second, the profiling

in Subsection 4.3 was only performed after particle sorting, leading the optimizations to achieving the best effects, while the effects were not optimal in the pushing steps with no par- ticle sorting. Nevertheless, the computing performance of the simulation using optimizations was still increased by 21.48% in maximum.

And then two CAFe simulations with 1,000,000 particles were performed to test the influences of the kernels in Sub- section 4.2 on computing performance, where all optimiza- tion schemes in calculating charge deposition were employed.

TABLE. shows the test results. Per step duration Computing performance G particles/s Performance improvement Employed kernel It can be seen that the computing performance of the sim-

ulation using a unified array in field loading was very slightly 708

higher than that using separate arrays, leading to increasing the performance only by 1.30%. One of the reasons causing the little improvement is that the computational load of ex- ternal field interpolation accounts for about 23% of the total, and another is that the high L1 hit rate decreases the memory At last, the computing performance of AVASX was com- pared with that of AVAS in both CAFe and CiADS simula- tions with 1,000,000 and 10,000,000 particles. Since AVAS was parallelized based on OpenMP (Open Multi-Processing), it was tested on a server equipped with two Intel Gold 6330 CPUs, each of which contains 28 physical cores. The simu- lation durations of AVASX and AVAS, along with the corre- sponding speedup ratios, are listed in TABLE.

CiADS simulations and corresponding speedup ratios. 1,000,000 particles 10,000,000 particles Simulation durations CiADS CiADS C.D. Bowman, E.D. Arthur, P.W. Lisowski, et al. Nuclear en- ergy generation and waste transmutation using an accelerator- driven intense thermal neutron source[J]. Nuclear Instruments In the simulations with 1,000,000 particles, AVASX run-

ning on a single A100 achieved a 472 - 550 × speedup over 724

AVAS running on 56 physical CPU cores. As the number 725

of simulated particles increased to 10,000,000, AVAS fully leveraged the multi-core computing capability of CPUs, re- sulting in the speedup decreasing to 174 - 176 . Neverthe- less, AVASX still outperforms AVAS by a large margin in computing performance, leading to the simulation durations being reduced from days or hours to a few minutes or seconds and providing a high-performance tool for linear accelerator simulations.

5 CONCLUSION

In this paper, a GPU-accelerated high-performance beam dynamics simulation code, AVASX, was introduced. It pro- vides a variety of particle tracking methods to balance sim- ulation accuracy and computational speed. To overcome the performance degradation caused by massive threads perform- ing atomic operations on GPU global memory in charge de- position calculation, three optimization schemes, which are the thread aggregation method, the strategy of scattering data processing scopes, and the approach of using duplicate mem- ory instances, were introduced in detail.

All optimization schemes were profiled using NVIDIA Nsight Compute and tested in CAFe simulations.

The analysis and simulation results indicate that they play an important role in reduc- ing the memory access conflicts among the threads perform- ing atomic operations on global memory. The reliability of AVASX was confirmed by comparing its simulation results with those of AVAS in CAFe and CiADS simulations. The comparison results showed that AVASX exhibits high consis- tency with AVAS, indicating that no accuracy loss occurs in GPU-accelerated simulations. By comparing the simulation durations of AVASX and AVAS in CAFe and CiADS sim- ulations with 1,000,000 and 10,000,000 particles, it can be

seen that AVASX running on an A100 achieved a 174 - 550 × 757

speedup over AVAS running on 56 physical CPU cores. Thus, 758

AVASX could be a highly useful and efficient tool for linear accelerator simulations.

ACKNOWLEDGEMENT This work was supported by the National Natural Sci- ence Foundation of China (Grant No. 12475161, 12405189,

U22A20261) and Large Research Infrastructures China ini- 764

tiative Accelerator Driven System (Grant No. 2017-000052- 75-01-000590). and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, (1992).

Conde H. The technological status of accelerator-Driven nu- clear systems for transmutation and energy production[J].

Progress in Nuclear Energy 32(3/4) , 463-470 (1998). 10.1016/S0149-1970(97)00054-1 Kapoor S S. Accelerator-driven sub-critical reactor system (ADS) for nuclear energy generation[J]. Pramana 59(6) , 941- 950 (2002).

Wenlong Z, Lei Y, Xuesong Y, et al. Accelerator-driven Ad- vanced Nuclear Energy System and Its Research Progress[J].

Atomic Energy Science and Technology, (2019). Guoqing X, Hushan X U, Sicheng W. HIAF and CiADS Na- tional Research Facilities: Progress and Prospect[J]. Nuclear Physics Review, (2017).

Shuhui Zhijun Physics design CiADS 25 MeV demo facility[J]. Nuclear Instruments & Methods in Physics Research , 11-17 (2017).

[7] Shuhui L, Zhijun W, et al. Commissioning of China ADS 790

demo Linac and baseline design of CiADS project[J]. Jour- nal of Physics Conference Series, , 012009 (2020). 10.1088/1742-6596/1401/1/012009 Shouyan X, Sheng W. Study on space charge effects of the CSNS/RCS[J]. Chinese Physics C 35(12) , 1152-1158 (2011).

Zhang X Y, Zhang L H, Tang J Y. Study on time-dependent lattice to alleviate space charge effects in CSNS/RCS[J]. Ra- diation Detection Technology and Methods, , 1-11 (2019).

Duanyang J, Zhijun W, Huan J, et al. Investigation of

beam loss mechanism by parasitic H- in high-power pro- 802

ton linac[J]. PHYSICAL REVIEW ACCELERATORS AND BEAMS, 28(9) , 090101 (2025).

[11] Birdsall C K, Langdon A B. Plasma Physics Via Computer 805

Simulation[M]. Plasma Physics via Computer Simulation, The Adam Hilger Series on Plasma Physics, edited by C. Birdsall & A. Langdon. Adam Hilger, Bristol, England (ISBN: 0-07- 005371-5), (1991).

Wang Z J, Liu S H, Chen W L, et al. Beam physics design of a superconducting linac[J]. Physical Review Accelerators and Beams, 27(1) , 010101 (2024). celBeams.27.010101 Zheng P P, Wang X H, He Z F, et al. Design and beam dynam- ics study of a magnet system for an 11 MeV superconduct-

ing isochronous cyclotron[J]. Nuclear Science and Techniques, 816

36(6) , 107 (2025). Yee-Rendon B, Kondo Y, Tamura J, et al. Beam dynamics studies for fast beam trip recovery of the Japan Atomic En- ergy Agency accelerator-driven subcritical system[J]. Physical Review Accelerators and Beams, 25(8) , 080101 (2022). 10.1103/PhysRevAccelBeams.25.080101 Kaiser J, Xu C, Eichler A, et al. Bridging the gap between

machine learning and particle accelerator physics with high- 824

speed, differentiable simulations[J]. Physical review accelera- tors and beams, 27(5) , 054601 (2024). vAccelBeams.27.054601 Yang L J, Peng J Y, Qiu F, et al. Classification of super- conducting radio-frequency cavity faults of CAFE2 using ma-

chine learning[J]. Nuclear Science and Techniques, 36(6) , 104 830

(2025).

[17] Chen XL, Wang ZJ, He Y, et al. Machine learning for 832

online control of particle accelerators[J]. Science China

Physics, Mechanics & Astronomy, 68(2) , 1-11 (2025). 834

10.1007/s11433-024-2492-5 Chen XL, Jia YZ, Wang ZJ, et al. Orbit correction based on

improved reinforcement learning algorithm[J]. Physical Re- 837

view Accelerators and Beams, 26(4) , 044601 (2023). 10.1103/PhysRevAccelBeams.26.044601 Vay J L, Huebl A, Almgren A, et al. Modeling of a chain of three plasma accelerator stages with the WarpX electromag- netic PIC code on GPUs[J]. Physics of Plasmas, 28(2) , 023105 (2021).

King J R, Pogorelov I V, Amyx K M, et al. GPU accelera- tion and performance of the particle-beam-dynamics code El-

egant[J]. Computer Physics Communications, 235 , 346-355 846

(2019). Pang X, Rybarcyk L. GPU accelerated online multi-particle beam dynamics simulator for ion linear particle accelerators[J].

Computer Physics Communications, 185(3) , 744-753 (2014). 850

Huebl A, Lehe R, Mitchell C E, et al. Next generation compu- tational tools for the modeling and design of particle accelera- tors at Exascale[C]. 5th International Particle Accelerator Con- ference (NAPAC’22), 302-306 (2022).

NAPAC2022-TUYE2 Nath S, Qiang J, Ryne R, et al. Comparison of linac simu- lation codes[C]. Proceedings of the 2001 Particle Accelera- tor Conference (Cat. No. 01CH37268), 264-266 (2001). 10.1109/PAC.2001.987488 Li H, Easton M, Lu Y, et al. Development and benchmark- ing of the IMPACT-T code[C]. Proc. 9th International Parti- cle Accelerator Conference (IPAC’18), 3408-3410 (2018). 10.18429/JACoW-IPAC2018-THPAK076

[25] Jin C, Wang Z, Qi X, et al. Advanced Virtual Accelerator 865

Software: A linear accelerator simulation code[J]. Physical Review Accelerators and Beams, 28(4) , 044602 (2025). 10.1103/PhysRevAccelBeams.28.044602 NVIDIA. CUDA C++ Programming Guide[EB/OL]. Ver- sion 13.0, [2025-09-02].

Hao CW, Tian Y, Lin P, et al. Optimizing the GPU based method calculating energy deposition of beams coupling with discrete materials in dynamical and thermal simulations for

higher computing efficiency[J]. Computer physics communi- 875

cations , 108426 (2022). NVIDIA. NVIDIA Nsight Compute[EB/OL].

Version 2025.3.1, [2025-03-01].

Submission history

[v1] 2025-11-03