Abstract
Currently, high-performance computing for large-scale finite element methods primarily relies on parallel computing with multi-CPU computer clusters, which incurs high computational costs and offers limited improvements in computational efficiency. Based on the advantages of current state-of-the-art GPU chips in large-scale data-parallel processing, this paper proposes a GPU-based parallel solution method for large-scale finite element sparse matrices, including compressed storage encoding and decoding of large-scale sparse matrices, preconditioned parallel iterative solution strategies for large-scale sparse matrices, and the implementation method of GPU kernel functions for the main solution steps. A GPU parallel computing finite element program was developed using the C++ programming language, and solution experiments were conducted on finite element sparse matrices of various scales. Numerical example results demonstrate that the computational efficiency of parallel solving on the GPU side is significantly improved, with the advantage becoming more pronounced as the matrix dimension increases. Additionally, preconditioning the global sparse matrix during the GPU solution process can significantly accelerate the convergence speed of matrix solving, and the computational accuracy is also relatively high.
Full Text
Preamble
Parallel Computation of Large-Scale Finite Element Sparse Matrices with GPU
ZHOU Qinglong, LIN Wancang
(1. School of Resources and Safety Engineering, Central South University, Changsha, Hunan 410083, China)
Abstract: Currently, high-performance computing for large-scale finite element analysis primarily relies on multi-CPU computer cluster parallelization. This approach entails high computational costs and offers limited efficiency gains. Leveraging the advantages of state-of-the-art GPU chips in large-scale data parallel processing, this paper proposes a GPU-based parallel solution method for large-scale finite element sparse matrices. The method encompasses compression storage encoding and decoding schemes for large-scale sparse matrices, preconditioned parallel iterative solution strategies, and GPU kernel function implementation for key computational steps. A GPU parallel finite element program was developed using C++ and tested on finite element sparse matrices of varying scales. Numerical results demonstrate that GPU-based parallel solving significantly improves computational efficiency, with the advantage becoming more pronounced as matrix dimensions increase. Furthermore, preconditioning the global sparse matrix during GPU solution substantially accelerates convergence while maintaining high computational accuracy.
Keywords: parallel computing; finite element method; GPU computing; sparse matrix
1 Introduction
Large-scale geotechnical engineering numerical simulations currently face several critical challenges. First, geotechnical projects operate at enormous scales, with computational domains often spanning several kilometers to tens of kilometers. Due to memory and computational efficiency limitations in conventional commercial software, numerical analysis requires mesh discretization at the scale of hundreds of meters or even kilometers. Such coarse meshes fail to capture local deformation and failure characteristics, yet engineering catastrophes typically initiate from localized regions \cite{1}. Consequently, computational results offer limited practical guidance. Second, many major geotechnical projects in China operate under complex geological conditions heavily influenced by faults, fractures, topography, hydrogeology, in-situ stress conditions, and rock properties, while incorporating numerous reinforcement measures such as rock bolts and anchors. However, computational efficiency constraints force substantial model simplifications or reduced computational domains, eliminating many important influencing factors and reinforcement effects. The resulting calculations poorly reflect actual disaster initiation and evolution processes. Third, disaster incubation and occurrence often involve nonlinear, complex multi-physics coupling processes requiring ultra-long timescale iterative solutions spanning years or decades. Simulating these phenomena demands millions to hundreds of millions of iterations—computationally infeasible for general commercial software.
To address these numerical challenges in large-scale geotechnical engineering, researchers have conducted pioneering studies. Some scholars \cite{2-4} proposed parallel computation methods based on domain decomposition strategies, distributing computational information across different machines. Wang et al. \cite{5} employed ParMetis for parallel pre- and post-processing and implemented preconditioned parallel solvers for finite element linear systems to simulate tunnel excavation. Ni et al. \cite{6,7} developed a master-slave distributed parallel implementation of optimization algorithms for large-scale underground geotechnical engineering inverse analysis. References \cite{8-10} introduced element-by-element (EBE) parallel algorithms for distributed-memory systems, adopting on-demand data collection and exchange strategies to reduce data transfer and storage requirements. Xie et al. \cite{11} ran the self-developed RFPA3D-Parallel program on clusters to analyze three-dimensional fracture processes in rock specimens with cavities. Reference \cite{12} performed dynamic seismic analysis of subway tunnel-civil air defense basement interaction using Abaqus on a 64-CPU explicit parallel computing cluster.
Compared to CPU parallel computing, GPU parallelization offers tremendous advantages in efficiency and cost. Extensive research has investigated GPU-based sparse matrix computations. Bolz et al. \cite{13} implemented GPU-accelerated conjugate gradient algorithms. Göddeke et al. \cite{14,15} studied GPU-based multigrid methods. Naumov \cite{16} demonstrated IC/ILU preconditioned conjugate gradient and stabilized biconjugate gradient algorithms using CUSPARSE and CUBLAS libraries. Chen et al. \cite{17} proposed using incomplete Cholesky decomposition preconditioned conjugate gradient methods for large sparse symmetric positive definite linear systems. Zhang et al. \cite{18} explored optimization methods for sparse matrix-vector multiplication using the CUSPARSE library.
This paper addresses the solution of large equation systems in large-scale geotechnical finite element parallel computing by systematically presenting storage methods for large-scale finite element sparse matrices, GPU parallel iterative solution methods, and corresponding GPU kernel implementations. Using C++, we developed a finite element GPU parallel computing program and conducted numerical experiments to systematically evaluate GPU parallel solving efficiency and accuracy.
2 Preconditioned Iterative Solution of Large-Scale Sparse Matrices
Standard conjugate gradient methods converge rapidly for well-conditioned matrices but may converge slowly for matrices with large condition numbers. Preconditioning techniques can significantly improve convergence rates.
The fundamental idea involves applying the standard conjugate gradient method to a transformed system. Assuming the transformed equation:
$$
\tilde{A}\tilde{x} = \tilde{b} \quad (1)
$$
where $\tilde{A} = C^{-1}AC^{-1}$, $\tilde{x} = Cx$, $\tilde{b} = C^{-1}b$, and $C$ is a symmetric positive definite matrix. A properly chosen matrix $C$ can reduce the condition number of $\tilde{A}$. Applying the standard conjugate gradient method with initial solution $\tilde{x}_0$ and setting $\tilde{p}_0 = \tilde{r}_0$, the preconditioned iteration for $k=0,1,2,3,\ldots$ until convergence proceeds as:
$$
\tilde{r}_{k+1} = \tilde{r}_k - \alpha_k C^{-1}AC^{-1}\tilde{p}_k \quad (6)
$$
Established matrix preconditioning methods include Jacobi preconditioning, block Jacobi preconditioning, and incomplete factorization preconditioning.
For the finite element system $Ax = b$, performing incomplete Cholesky decomposition on matrix $A$ yields:
$$
A = LL^T - R \quad (7)
$$
where $L$ is a lower triangular matrix. Using $C = LL^T$ as the preconditioning matrix, the following relationship holds:
$$
(LL^T)^{-1}A(LL^T)^{-1} \cdot LL^Tx = (LL^T)^{-1}b \quad (8)
$$
Since $(LL^T)^{-1} = (L^T)^{-1}(L)^{-1}$, we can derive:
$$
L^{-1}Ax = L^{-1}b \quad (9)
$$
$$
L^{-1}A(L^T)^{-1}L^Tx = L^{-1}b \quad (10)
$$
$$
\tilde{A} = L^{-1}A(L^T)^{-1}, \quad \tilde{x} = L^Tx, \quad \tilde{b} = L^{-1}b \quad (11)
$$
3.1 General Procedure for CUDA-Based Large-Scale Sparse Matrix Parallel Solution
The general procedure for solving large-scale sparse matrices using CUDA involves the following steps:
- Initialize the cuSPARSE and cuBLAS libraries required for CUDA-based solutions.
- Allocate appropriate memory for pointers and arrays used in the program and initialize them to zero.
- Ensure that sparse matrix values, row indices, and column index parameters have been computed and reside in GPU device memory.
- Convert data to standard cuSPARSE library formats for function interface compatibility using conversion functions such as
cusparseCreateCsr(for creating CSR compressed matrices) andcusparseCreateDnVec(for creating vectors). - Transfer specific sparse matrix parameters from host to device using
cudaMemcpy(CPU-to-GPU data transfer function). For CSR-compressed matrices \cite{19}, transferred parameters include matrix dimensions, non-zero values, and row/column indices of non-zero elements. - Develop GPU kernel functions implementing vector multiplication ($x \cdot y$), sparse matrix-vector multiplication ($Ax$), and scalar-vector multiplication with vector addition ($\alpha x + y$).
- Compute the initial residual $r = b - Ax$ using the GPU kernels developed in step 6 (including sparse matrix-vector multiplication and scalar-vector operations).
- Perform iterative calculations using equations (2)-(6) with the kernels from step 6 (including dot products, sparse matrix-vector multiplication, and scalar-vector operations) until the residual converges to the specified tolerance, yielding the final solution.
3.2 GPU-Side Parallel Multi-Thread Kernel Design for CUDA
The preconditioned sparse matrix iterative solution requires three computationally intensive operations: vector multiplication ($z = x \cdot y$), scalar multiplication and vector addition ($y = \alpha x + y$), and sparse matrix-vector multiplication ($y = Ax$). Efficient GPU kernel design is essential for parallelizing these operations.
Vector Multiplication GPU Kernel Design
The GPU kernel for vector multiplication ($z = x \cdot y$) can be designed as follows:
Algorithm 2: GPU Kernel Design for Vector Multiplication
__global__ void dot(float *x, float *y, float *z) {
int tid = threadIdx.x + blockIdx.x * numThread;
while (tid < N) {
z[tid] = x[tid] * y[tid];
tid += numThread * numBlock; // Advance by total threads per grid
}
}
cudaMemcpy(z, dev_z, N*sizeof(float), cudaMemcpyDeviceToHost);
sum = 0;
for (int i = 0; i < N; i++) {
sum += z[i];
}
Scalar Multiplication and Vector Addition GPU Kernel Design
Assuming $n$ is the length of vectors $x$ and $y$, and $\alpha$ is the scalar parameter, designing each thread to compute one element of $x$ multiplied by $\alpha$ and added to the corresponding element of $y$ yields a simple kernel for $y = \alpha x + y$:
Algorithm 3: GPU Kernel Design for Scalar Multiplication and Vector Addition
__global__ void saxpy(int n, float α, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = α * x[i] + y[i];
}
}
Sparse Matrix-Vector Multiplication GPU Kernel Based on CSR Storage
Sparse matrix-vector multiplication occurs repeatedly during iterative solution. CPU-based implementation consumes substantial computational time and memory, making efficient GPU parallelization critical for large sparse matrix solutions. Under the CUDA framework, three kernel design approaches enable high-performance parallel computation for CSR-based matrix-vector multiplication. The first method assigns one thread per output vector element, where each thread computes the product of non-zero elements in a matrix row with the vector and accumulates the sum \cite{20} (see Algorithm 4). The second approach uses one warp per output vector element, where threads within a warp compute row products and perform a reduction sum on intermediate warp results \cite{21}. The third method assigns one block per output vector element, followed by reduction summation of intermediate results within the block. For matrix $A$ stored in CSR format with vectors data (non-zero values), col (column indices), and rowptr (row offsets), a GPU kernel for $y = Ax$ can be implemented as:
Algorithm 4: GPU Kernel Design for Sparse Matrix-Vector Multiplication Using CSR Storage
__global__ void spmv_scalar_kernel(const int num_rows, const int *rowptr,
const int *col, const float *data,
const float *x, float *y) {
int row = blockDim.x * blockIdx.x + threadIdx.x;
if (row < num_rows) {
float dot = 0;
int row_start = rowptr[row];
int row_end = rowptr[row + 1];
for (int jj = row_start; jj < row_end; jj++)
dot += data[jj] * x[col[jj]];
y[row] += dot;
}
}
4 Numerical Experiments
4.1 Experimental Platform and Model
Numerical experiments were conducted on a conventional laboratory desktop computer with the following specifications: CPU: Intel(R) Core(TM) i7-8700 @ 3.20 GHz with 8192 MB RAM; GPU: NVIDIA GeForce GTX 1050 Ti with 8026 MB total memory (4019 MB dedicated video memory, 4007 MB shared memory), compute capability 6.1; Operating system: Windows 7 64-bit; Development environment: Visual Studio 2019 with CUDA 12.4.
The numerical model consists of a two-dimensional soil site measuring 100 m in both height and width. Material properties include an elastic modulus of 10 MN/m² and Poisson's ratio of 0.3. A surface load of 1 kN is applied. The computational domain is discretized using four-node rectangular elements with varying mesh densities corresponding to node counts of 18, 800, 1800, 80000, 180000, and 8000000 (see [FIGURE:1]). With two degrees of freedom (x and y displacement) per node, the resulting global stiffness sparse matrix dimensions are 36×36, 1600×1600, 3600×3600, 160000×160000, 360000×360000, and 16000000×16000000, respectively. Thus, different mesh resolutions produce finite element sparse matrices of varying scales.
4.2 Experimental Results and Analysis
Using the sparse matrix conjugate gradient method described in Section 2, we developed both CPU-based serial and GPU-based parallel iterative solvers using CUDA. The convergence tolerance was set to $1 \times 10^{-12}$. Computational times for different matrix dimensions are presented in [TABLE:1].
For small to medium-scale matrices, CPU serial solving exhibits higher efficiency than GPU parallel solving. At dimensions of 36 and 1600, CPU performance exceeds GPU performance, while at dimension 3600, both approaches show comparable efficiency. This phenomenon primarily stems from data transfer overhead between CPU and GPU, as all data must first be transferred from CPU to GPU memory. However, for large-scale sparse matrices (dimensions in the hundreds of thousands to tens of millions), GPU parallel solving dramatically outperforms CPU, with the advantage growing exponentially with matrix size. For example, at dimension 160000, GPU parallel solving achieves an 18.1× speedup; at 360000, 23.7×; and at 16 million, 176.7×.
To validate our results, we reproduced the finite element mesh from literature [22] and performed comparative numerical experiments on both CPU and GPU. The detailed comparison appears in [TABLE:2]. Our results show significantly improved computational efficiency compared to [22], partially attributable to newer CPU and GPU hardware versions. However, the overall trend remains consistent: CPU solving is more efficient for small matrices, while GPU speedup grows exponentially with increasing matrix dimensions.
We also investigated the effect of preconditioning on finite element sparse matrix solving by implementing incomplete Cholesky decomposition preconditioning of the global stiffness matrix prior to GPU-based iterative solution. Results are summarized in [TABLE:3]. Preconditioning substantially reduces convergence steps—for instance, from 1742 to 545 steps at dimension 160000 under the specified tolerance of $1 \times 10^{-12}$. [FIGURE:2] illustrates the residual reduction during the first 200 iterations for a 160000×160000 matrix, showing significantly faster residual decrease with preconditioning. At iteration 200, the standard conjugate gradient method yields a residual of $1.21 \times 10^{-1}$, while the preconditioned method reaches $5.35 \times 10^{-5}$. These results demonstrate that preconditioning the global finite element sparse matrix markedly improves convergence and computational efficiency.
Setting the convergence tolerance to $1 \times 10^{-4}$, we analyzed the final computational error for preconditioned matrices solved on GPU. Results appear in [TABLE:4] and [FIGURE:3]. While final errors exhibit some randomness without direct correlation to matrix dimension, preconditioned solutions consistently show smaller errors and higher accuracy than non-preconditioned solutions.
Compared to open-source sparse matrix solvers available in the CUDA platform, our optimized preconditioned parallel solving method significantly reduces matrix condition numbers, thereby decreasing numerical errors during solution. Additionally, accelerated convergence through preconditioning reduces iteration counts and cumulative numerical errors.
5 Conclusions
This paper presents a GPU-based parallel solution method for large-scale finite element sparse matrices in geotechnical engineering, detailing storage schemes for large sparse matrices, preconditioned optimization methods, and corresponding GPU kernel implementations. Numerical experiments conducted on an NVIDIA GeForce GTX 1050 Ti graphics card in a conventional desktop computer yield the following conclusions: (1) For large-scale finite element sparse matrices, GPU parallel solving significantly improves computational efficiency compared to CPU serial solving, with advantages becoming more pronounced as matrix dimensions increase; (2) Preconditioning the global sparse matrix during GPU parallel solving substantially accelerates convergence and improves computational efficiency; (3) Preconditioning the global sparse matrix during GPU parallel solving yields relatively smaller computational errors and higher solution accuracy.
References
[12] Mao Kunming, Zhao Kai, Zhu Liming, et al. Seismic response analysis of subway tunnel-civil air defense basement interaction [J]. Journal of Vibration and Shock, 2019, 38 (05), 243-250.
[13] Bolz, J., Farmer, I., Grinspun, E. Sparse matrix solvers on the GPU: conjugate gradients and multigrid[J]. ACM transactions on graphics (TOG), 2003, 22 (3), 917-924.
[14] Göddeke, D., Strzodka, R., Turek, S. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations[J]. International Journal of Parallel, Emergent and Distributed Systems, 2007, 22 (4), 221-256.
[1] Zhang Youliang, Tan Fei, Zhang Liren, et al. Scalable parallel computing for billion-element finite element models in geotechnical engineering[J]. Rock and Soil Mechanics, 2016, 37 (11), 3309-3317.
[15] Goddeke, D., Buijssen, S. H., Wobker, H.. GPU acceleration of an unmodified parallel finite element Navier-Stokes solver[C]. In 2009 International Conference on High Performance Computing & Simulation, 2009; IEEE: pp 12-21.
[2] Zhang Youliang, Feng Xiating. Parallel computation of finite element analysis with over one million degrees of freedom in geotechnical engineering[J]. Rock and Soil Mechanics, 2007, 28 (4), 684-688.
[16] Naumov, M. Incomplete-LU and Cholesky preconditioned iterative methods using CUSPARSE and CUBLAS. Nvidia white paper 3, 2011.
[3] Ru Zhongliang, Feng Xiating, Zhang Youliang, et al. Parallel computation of finite element analysis of anchored rock mass in underground engineering[J]. Chinese Journal of Rock Mechanics and Engineering, 2005, 24 (1), 13-13.
[17] Chen Yao, Zhao Yonghua, Zhao Wei, et al. GPU-accelerated incomplete Cholesky decomposition preconditioned conjugate gradient method[J]. Journal of Computer Research and Development, 2015, 52 (4), 843-852.
[4] Ru Zhongliang, Feng Xiating, Li Hongdong, et al. Three-dimensional elastoplastic parallel finite element analysis of large underground engineering[J]. Chinese Journal of Rock Mechanics and Engineering, 2006, (06), 1141-1147.
[18] Zhang Jianfei, Shen Defei. Preconditioned conjugate gradient method for sparse linear systems based on GPU[J]. Computer Applications, 2013, 33 (3), 825-829.
[5] Wang Xiaorui, Zhang Zhen, Jia Xiaofeng. Numerical simulation of tunnel excavation based on high-performance parallel computing[J]. Earth Science: Journal of China University of Geosciences, 2015, (12), 2119-2126.
[19] Buatois, L., Caumon, G., Levy, B. Concurrent number cruncher: a GPU implementation of a general sparse linear solver[J]. International Journal of Parallel, Emergent and Distributed Systems, 2009, 24 (3), 205-223.
[6] Ni Shaohu, Xiao Ming, He Shihai, et al. Parallel optimization back analysis and case verification of underground engineering[J]. Chinese Journal of Rock Mechanics and Engineering, 2013, 32 (3), 501-511.
[20] Bell, N., Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors[J]. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, 1-11.
[7] Ni Shaohu, Xiao Ming. Displacement back analysis of underground engineering parameters based on surrounding rock loosening zone[J]. Chinese Journal of Rock Mechanics and Engineering, 2009, 28 (7), 1439-1446.
[21] Filippone, S., Cardellini, V., Barbieri, D.. Sparse Matrix-Vector Multiplication on GPGPUs[J]. ACM Trans. Math. Softw., 2017, 43 (4), Article 30.
[8] Liu Yaoru, Zhou Weiyuan, Yang Qiang. Finite element parallel EBE method and its application[J]. Chinese Journal of Rock Mechanics and Engineering, 2005, 24(17), 3023-3028.
[22] Zheng Jingwei, An Xuehui, Huang Miansong. Optimization of PCG algorithm for large-scale sparse matrices based on CUDA[J]. Journal of Tsinghua University: Science and Technology, 2014, (8), 889-894.
[9] Bova, S., Carey, G. A distributed memory parallel element-by-element scheme for semiconductor device simulation[J]. Computer methods in applied mechanics and engineering, 2000, 181 (4), 403-423.
[10] Khan, A., Topping, B. Parallel finite element analysis using Jacobi-conditioned conjugate gradient algorithm[J]. Advances in Engineering Software, 1996, 25 (2-3), 309-319.
[11] Xie Linmao, Zhu Wancheng, Wang Shuhong, et al. Parallel computational analysis of three-dimensional fracture process in rock specimens with cavities[J]. Chinese Journal of Geotechnical Engineering, 2011, 33 (09), 1447-1453.