ChinaRxiv

Deep Q-learning for autonomous optimization of neutron thermalization devices for PGNAA applications

Abdelnour, Miss Marina, Liu, Mr. Juntao, Wajid, Mr. Muneeb, yuan, Mr. chao, Heng, Mr. Tian, Li, Miss Wenxin, Liu, Prof. Zhiyi

Submitted 2025-11-28 | ChinaXiv: chinaxiv-202512.00009 | Original in English

Note: Figures in this paper have not yet been translated.

Abstract

Prompt Gamma Neutron Activation Analysis (PGNAA) uses thermal neutron capture to execute isotopic analysis using characteristic gamma emissions. However, practical neutron sources mostly emit fast neutrons, and efficient thermalization is required to enhance signal quality and analytical accuracy. The complex multiparameter optimization of material compositions and geometries makes the design of thermalization devices computationally challenging. In this study, we introduce a Deep Q-Learning (DQL) framework that combines reinforcement learning with Monte Carlo N-Particle Code to autonomously optimize neutron thermalization device design. By defining the optimization as a Markov Decision Process, the DQL agent successfully explores over seven million possible design configurations over 1,500 training episodes. By episode 520, the agent had reached an ideal configuration, increasing thermalization efficiency by 1.42 times and reducing computing costs by 23% compared to standard genetic algorithm. This work demonstrates that transport simulations can be utilized as dynamic reinforcement learning settings, offering a scalable approach to intelligent, self-adaptive nuclear system design for complex analytical applications.

Full Text

Preamble

Deep Q-learning for autonomous optimization of neutron thermalization devices for PGNAA applications Marina R. Abdelnour , Juntao Liu , A. M. Wajid , Chao Yuan , Tian Heng Wenxin Li , Zhiyi Liu Author Affiliations 1. Frontiers Science Center for Rare Isotopes, Lanzhou University, Lanzhou, Gansu, 730000, China 2. School of Nuclear Science and Technology, Lanzhou University, Lanzhou, Gansu, 730000, China

3. Department of Physics, Faculty of Women for Arts, Science, and Education, Ain Shams

University, Cairo, Egypt

Abstract

Prompt Gamma Neutron Activation Analysis (PGNAA) uses thermal neutron capture to execute isotopic analysis using characteristic gamma emissions. However, practical neutron sources mostly emit fast neutrons, and efficient thermalization is required to en- hance signal quality and analytical accuracy. The complex multiparameter optimization of material compositions and geometries makes the design of thermalization devices compu- tationally challenging. In this study, we introduce a Deep Q-Learning (DQL) framework that combines reinforcement learning with Monte Carlo N-Particle Code to autonomously optimize neutron thermalization device design. By defining the optimization as a Markov Decision Process, the DQL agent successfully explores over seven million possible design configurations over 1,500 training episodes. By episode 520, the agent had reached an ideal configuration, increasing thermalization efficiency by 1.42 times and reducing com- puting costs by 23% compared to standard genetic algorithm. This work demonstrates that transport simulations can be utilized as dynamic reinforcement learning settings, offering a scalable approach to intelligent, self-adaptive nuclear system design for complex analytical applications.

Keywords

Neutron thermalization, PGNAA, deep Q-learning, reinforcement learning, Monte Carlo simulation, and nuclear system optimization

1 Introduction

Prompt gamma-ray neutron activation analysis (PGNAA) has been extensively studied due to its high potential and broad range of applications as a quantitative isotope identification method.

The main setup components of a PGNAA system include a neutron source, an optimized mod- erator, a collimator, the sample under investigation, and a detector array. PGNAA systems are widely used modern neutron generators, which offer advantages in terms of compactness and operational efficiency compared to conventional neutron sources. These generators use deuterium-deuterium (D-D) and deuterium-tritium (D-T) fusion processes to create neutrons at 2.5 or 14.1 MeV. Effective thermalization is necessary to optimize these high-energy neutrons to thermal energies appropriate for analytical use [ The analytical method requires bombarding the target material with thermal neutrons that interact with the target nuclei to induce distinct gamma emissions, resulting in energy spectra with discrete peaks that serve as isotopic fingerprints for elemental identification. One of the main advantages of PGNAA in material characterization is the simultaneous irradiation and detection techniques, which show significantly high efficiency in real-time analysis. Given this potential, PGNAA has become an essential tool in many applications [ ], including advanced material analysis in nuclear technology, medicine, and environmental monitoring; geophysi- cal tracking for oil and coal exploration; security screening at airports and borders; and the detection of buried explosive devices [ Although PGNAA systems have significant advancements, optimizing neutron thermaliza- tion remains a major challenge. insufficiently thermalized neutrons affecting gamma spectra, which can lead to inaccurate determination of isotopic composition. Designing effective neu- tron thermalization devices (NTDs) is challenging due to the complex interplay of moderator geometry, material choice, and surrounding system components. Even small adjustments in these parameters can significantly affect thermal neutron flux and spectrum quality, making traditional optimization methods often insufficient [ ] (Fig.

Monte Carlo (MC) simulations provide highly accurate neutron transport modeling, but their computational demands make practical optimization of neutron thermalization devices (NTDs) challenging. It often requires thousands of calculations through gradual layer adjust- ments or exhaustive parameter sweeps. Monte Carlo N-Particle transport (MCNP) based PG- NAA designs have achieved success [ ], yet the lengthy design cycles highlight the need for more efficient strategies to navigate complex, multi-dimensional parameter spaces. Automated approaches, including simulated annealing [ ], particle swarm optimization [ ], differential evolution, and genetic algorithms (GA) [ ], reduce computational effort compared to man- ual tuning. However, these methods remain constrained by fixed parameter spaces, sensitivity to initial conditions [ ], extensive hyperparameter adjustments [ ], and the inability to lever- age insights from previous optimization steps. While recent applications of GA in the design of neutron thermalization devices are promising, they are limited by local optima, subjective weighting, and high MC costs [ ]. Their static nature prevents adaptation, highlighting the need for self-learning strategies that dynamically respond to simulation feedback.

In recent years, nuclear physics has made extensive use of artificial intelligence (AI) meth- ods such as machine learning and neural networks to enhance nuclear systems [ ]. AI has transformed nuclear design by enabling the automation of complex processes that were previ- ously too computationally taxing. Supervised learning techniques have been successfully used to optimize reactor fuel loading and beam parameters [ ], and recent research has demon- strated the successful integration of MC simulations and neural networks for beam shaping and isotope detection [ ]. However, supervised approaches are vulnerable to Monte Carlo noise, require large pre-computed datasets, and suffer with criticality limitations.

Reinforcement learning (RL), which learns the optimal policies by direct interaction with the environment, is a good alternative. RL agents are very suitable for complex optimization problems [ ]. By receiving rewards or punishments based on their behavior, they indepen- dently discover the best strategies and gradually improve their performance through trial-and- error research. Recent PGNAA applications of Q-learning have shown promise in optimizing collimator geometries and moderator dimensions [ ], even though these implementations are still limited to discrete action spaces, simpler configurations, and have not fully utilized RL’s potential for continuous parameter optimization and autonomous learning capabilities.

In this study, we present the first application of deep Q-learning (DQL) for autonomous op- timization of thermal neutron assemblies in PGNAA systems. The DQL agent autonomously identifies optimal configurations by exploring the parameter space and interacting with MCNP simulations (Fig. ), leveraging learned value functions to capture complex dependencies be- tween the six optimization parameters: shielding, collimator diameter, multiplier, and moder- ator thickness/material. Morris’s sensitivity analysis identified the most crucial elements for guiding research [ ]. With a reward that increases thermal flux and penalizes epithermal/fast neutrons, the method has been trained over 1,500 episodes and achieves optimal configurations with 23% lower computational cost compared to genetic algorithm.

DQL with high thermalization efficiency for PGNAA applications. Both Table and the text provide definitions for abbreviations. 2 Materials and Methods Computational environment and setup We optimized PGNAA setups using deep Q-learning (DQL) and MCNP6 [ ] radiation trans- port simulations, comparing the outcomes to a genetic algorithm (GA) commonly used in neu- tron activation studies [ ]. The GA used standard hyperparameters (mutation rate 0.001, crossover rate 0.7, ranking selection, and simulated binary crossover with 0.1 preci- sion [ ] with a population of 50 across 30 generations (1500 evaluations) and a random seed of 42 for repeatability (see supplementary materials ). Testing both strategies with the same computing budget allowed for a direct performance comparison.

The baseline geometry included a 14.1 MeV DT neutron source and a multilayer assembly consisting of 45 cm beryllium oxide (BeO) primary moderator (S1), 31 cm polyethylene (PE) secondary moderator (S2), 10 cm lead (Pd) neutron multiplier (S3), 6 cm lead gamma shield (S4), and 5 cm beryllium (Be) collimator (S5). While the use of such materials is common in PGNAA designs [ ], the distinguishing aspect of this work lies in the method used to determine their optimal arrangement and thickness. Rather than relying on expert knowledge or predefined rules, the DQL agent autonomously explored the design space, interacting directly with the MCNP simulation environment to discover high-performance configurations. The optimization process included use of iterative simulation, learning, and validation stages.

Sensitivity analysis and design framework We used the Morris method, a computationally effective one-at-a-time (OAT) global sensitivity strategy appropriate for high-dimensional models [ ], to prioritize important design variables for DQL optimization. (Fig. (a)), with 600 assessments produced with 6 levels and 100 trajec- tories. This technique calculates the standard deviation ( ) to evaluate parameter interactions and nonlinearities and the mean absolute elementary effects ( ) to measure overall sensitivity.

Five primary parameters (moderator layer thicknesses, collimator diameter, shielding, and neu- tron multiplier thickness) were chosen because they had a significant impact on thermal neutron flux, even though secondary factors like climatic variations and neutron source stability were considered insignificant in the controlled simulation context [ The results of the PGNAA system demonstrated a considerable parameter coupling (Fig. (b)).) The highest sensitivity ( = 112 ) of the primary moderator layer indicates its critical involvement in thermal neutron flux enhancement. The large standard deviation ), where the moderator’s effectiveness is strongly dependent on other component configurations, indicates strong parameter interactions. The same interaction pattern was also observed in the side moderator S2 ( ) and Pd multiplier ( for each of the three parameters indicated a strong nonlinear coupling between the design variables.

These significant values demonstrate that enhancing any one parameter on its own will lead to less-than-ideal results since the impact of changing one parameter depends on the settings of others.

In contrast, detector shielding showed poor sensitivity ( ) and limited interactions ), suggesting that its impact on neutron thermalization is basically independent of other parameters. Its low value indicates that it can be tuned independently without worrying about coupling effects, even if it is still crucial for improving the signal-to-noise ratio (SNR) through background radiation attenuation [ ]. The interaction patterns show that the coupled dynamics predicted by large values cannot be captured by sequential parameter adjustment.

Deep Q-Learning, which naturally takes parameter interactions into account and explores the entire joint parameter space, was inspired by this. DQL’s exploration allows it to find syner- gistic parameter combinations that the Morris analysis reveals, in contrast to gradient-based or grid-search methodologies that may fall into local optima in coupled systems.

(b) Param- eter sensitivity rankings showing values with 95% confidence intervals (whiskers) and values as point estimates. The primary moderator exhibits the highest sensitivity ( ), confirming its priority in the DQL action space design.

Component Layer (State) Material Thickness range (cm) Steps primary mod. (S1) See Table secondary mod. 1 (S2) secondary mod. 2 (mirror) Multiplier Multiplier (S3) Shield Detector shield (S4) Moderator Collimator Coll. 1 (S5) Coll. 2 (mirror) State: The discretized state space contained 7,022,400 configurations ( State space definition The state space was defined by geometrical properties and materials of the moderator, colli- mator, neutron multiplier, and shielding components. To reduce computational overhead, sym- metric components were optimized on one side only, with the results automatically mirrored to preserve symmetry. State variables and ranges are summarized in Table Action space and material selection Actions defined discrete adjustments to component geometry and moderator materials within predefined ranges. Each parameter had two actions: increase toward the upper bound or de- crease toward the lower bound. Based on sensitivity analysis results, six candidate materials were selected for the primary moderator layer using key nuclear property criteria: optimal neutron slowing-down power through elastic scattering with light nuclei (hydrogen, carbon, beryllium, and oxygen) and low thermal neutron absorption cross-sections to minimize neutron

loss [ ]. The material selection prioritized solid moderators to meet compact PGNAA system requirements. While water exhibits excellent neutron moderation properties, solid moderator materials provide superior mechanical stability, eliminate containment complexities, and main- tain effective neutron thermalization capabilities for portable applications [ ]. These materials enable the DQL agent to explore trade-offs between geometry, shielding synergy, and modera- tion efficiency. A unique ID between 1 and 6 is given to each potential material (Table ). The DQL algorithm’s actions 10 and 11 allow the agent to explicitly choose from all six candidate materials by changing the principal moderator layer’s material ID. Table contains complete definitions of actions.

1 Polyethylene (C

Graphite (C) Teflon (C Plexiglass (C Beryllium oxide (BeO) Borated polyethylene (C 0 Decrease S1 thickness by 1 cm (min -45 cm) Increase S1 thickness by 1 cm (max -26 cm) Decrease S2 thickness by 1 cm (min -35 cm) Increase S2 thickness by 1 cm (max -1 cm) Decrease S3 thickness by 0.5 cm (min -1 cm) Increase S3 thickness by 0.5 cm (max -10 cm) Decrease S4 thickness by 1 cm (min 1 cm) Increase S4 thickness by 1 cm (max 8 cm) Decrease S5 thickness by 0.2 cm (min -3 cm) Increase S5 thickness by 0.2 cm (max -5 cm) Decrease the material ID of the S1 by 1 (min 1) Increase the material ID of the S1 by 1 (max 6) Reward function We aim at obtaining optimal system geometry via simultaneous maximization of thermal neu- tron flux ( ) and minimization of fast neutron contributions to improve SNR for PGNAA applications. The reward function R guides the optimization process by incorporating three key performance metrics as defined in Eq.(

R = w 1 Φ th + w 2

where , and total are thermal, fast, and total neutron fluxes, respectively. The weights , and balance direct thermal flux contribution, thermal- to-fast neutron ratio, and thermalization efficiency. The larger R value leads to higher thermal neutron flux and more efficient neutron beam for PGNAA applications. To evaluate neutron flux components, the MCNP F4 tally has been applied across thermal ( MeV), epithermal MeV), and fast ( MeV) energy ranges with relative errors less than 0.5%.

Deep Q-learning architecture We implement deep Q-learning to optimize PGNAA system parameters through reinforcement learning. The end-to-end framework consists of three main stages: initial state, training process, and terminal state (Fig.

The framework optimizes the action-value function , representing the expected cu- mulative reward for action in state

Q ( s, a ) = E

t =0 γ t r t | s t = s, a t = a

Where is the reward at time step , and is the discount factor ( ). The reward function incorporates thermal neutron flux maximization to guide convergence toward optimal configurations. The complete DQL training integrating all components is presented in Algorithm We ensure stable learning through experience replay and target networks. Experience replay stores agent interactions s, a, r, s in memory , breaking data correlations during training.

The target network with parameters provides stable targets.

L i ( θ i ) = E ( s,a,r,s ′ ) ∼ D

where represents primary network parameters. Training and exploration strategy The model uses dual neural networks: a primary network for action prediction and a target network for weight updates. First, experience replay stabilizes learning from past experiences, while random weights allow comprehensive exploration of the configuration space. To reach a balance between exploration and exploitation, we use an epsilon-greedy approach. Over train- ing, epsilon declines from 0.98 to 0.01 while keeping 1% random exploration. At each step, action selection is determined by random number generation; numbers below epsilon initiate random actions (exploration), while values above epsilon choose the best course of action (ex- ploitation). MPI parallelization was used to run MCNP6 simulations on a high-performance computing cluster (Intel Platinum 8358, 503 GB RAM, 128 cores). All simulations used 128 cores via the command . This setup made it possible to use almost all of the available cores, resulting in effective parallel execution. Over 1500 training episodes, the agent gradually im- proves its choices while lowering the loss. Every geometric and material parameter is included in the state representation, and the policy is updated at each stage based on simulation feed- back. The epsilon-greedy decay curve (Fig. (a)) demonstrates the transition from exploration- dominated early training to exploitation-focused later episodes. Training performance metrics (Fig. (b)) show total reward, average reward, and training loss evolution, indicating success- ful convergence with increasing reward stability over episodes. Hyperparameters are listed in Table and were selected within established reinforcement learning ranges [

Q-Learning Parameters Activation Functions ReLU, Linear Optimizer Training

Learning Rate 0.0005

Minibatch Size 64

Other Parameters (convergence criteria met). The agent continuously interacts with the MCNP simulation envi- ronment until the optimal geometry configuration is achieved.

decreases from 0.98 to 0.01. (b) Training metrics display the total reward (blue), average reward (orange), and training loss (green), demonstrating convergence through increased reward stability and decreased loss over episodes. The highest average reward was observed at episode 520, indicating that the optimal policy had been learned during training.

Algorithm 1 MCNP DQL Auto-Optimization Initialize: (capacity start Parameters: episodes, step_size, , batch_size (update frequency), , decay_rate Environment:

MCNP simulator with random initial geometry/material episode ResetEnvironment() material step_size Select action = arg max with probability , random action otherwise Execute action , run MCNP simulation Observe reward , next state D ←D ∪{ |D| ≥ Sample minibatch Update via gradient descent on end for end if Update target network end if decay_rate end for Return:

Optimal geometry and material found else if episode training (maximum episodes reached) end if end for Return:

Best configuration achieving highest cumulative reward Model validation and reproducibility We performed 30 independent training runs with various random initializations to evaluate learning consistency and policy stability in order to verify the resilience of our DQL technique (Fig. , Supplementary Table ). The stability of the optimization process was evaluated because each experiment began with individual baseline environmental factors and randomly initialized network weights. The model showed remarkable optimization consistency with a mean end reward of 16.41 ± 2.32 for all runs (CV = 14.13%) [ The most significant rewards were regularly obtained by Action 9, which increased the thickness of the S5 collimator by 0.2 cm (mean = 17.18, 95% CI [14.59, 19.77], n = 6, CV = 14.4%). The performance of Action 1 (raising S1 moderator thickness toward -26 cm) was consistently low (mean = 5.56, 95% CI [4.20, 6.92], n = 6) (Fig. ). Welch’s t-test [ ] (t(10) = 10.18, p < 0.001) and Cohen’s d = 5.93 both revealed a highly significant performance dif- ference. The significant effect size and strong statistical power (0.99) [ ] validate this result’s stability. Complete normalcy tests [ ], Mann-Whitney U values [ ]. The statistical anal- ysis of action-reward distributions from 30 DQL runs determined the mean reward, SD, 95% CI, CV, Shapiro-Wilk test (W and p-value), and action ranking. Reproducibility metrics for high-reward runs are provided in Table , comparing high- and low-reward actions.

Action Action Thermalization Efficiencу Thickness Thickness Action 9 demonstrates superior efficiency, validating the DQN’s reward-based optimization strategy for thermal neutron flux maximization.

Statistical analysis of action- reward distributions from 30 DQL runs.

Action 95% CI [L,U] CV (%) p-value [14.59, 19.77] [8.36, 20.66] [12.84, 14.98] [10.30, 17.44] [4.58, 17.52] [3.81, 16.53] [1.92, 15.30] [3.56, 12.48] [4.20, 6.92] [3.93, 6.27] Action 9 vs Action 1:

Welch’s ), Mann–Whitney ), Cohen’s High-reward runs ( , Mean , Power 3 Results and Discussion The DQL-MCNP framework found the optimal solution at episode 520 after 1,035 training steps and 2.7 hours of training, exploring a design space of possible configurations.

This results in a 23% reduction in computational time when compared to GA baseline. Table lists the best five performing episodes together with the relevant geometric parameters and complete thermal metrics. BeO is found to be the best primary moderator at 44 cm thickness (a)), which illustrates the methodical material exploration. The probability distributions (Fig. (b)) confirm that BeO outperforms other materials in terms of performance and depend- ability. Figures (c) (26 cm S2), (d) (8.5 cm S3), (e) (4 cm S4), and (f) (3.2 cm S5) illustrate secondary component optimization. Our DQL-MCNP approach outperforms the baseline al- gorithm in a number of important criteria. The main goal of maximizing thermal neutron pro- duction is shown in the improvement factor of 1.81 times for thermal neutron flux ( ) over

baseline. With a DQL-to-baseline factor of 1.42 times, thermalization efficiency is also im- proved, indicating more efficient neutron moderation. The SNR is reduced to 0.7 times relative to the baseline; however, since the reward function does not target SNR, this change reflects shielding and detector effects and should be considered an additional benefit rather than a direct optimization outcome. The SNR in PGNAA systems is calculated by comparing the net gamma signal from sample elements to background contributions from surrounding structures [ The FMESH neutron flux tally is displayed (Fig. ). The thermal neutron flux inside the sample region is comparatively smaller in the baseline arrangement (Fig. (a)). On the other hand, thermal and total flux inside the sample region are significantly higher in the DQL- optimized design (Fig. (b)). The DQL-derived design improves thermalization efficiency by optimizing the reward term total In the optimal configuration, the SNR is , the thermal flux improves from /s (a enhancement), and the thermalization efficiency reaches Episode Optimized Geometry [Material Thickness [cm]] Thermalization Efficiency (DQL/Baseline) S1 (Moderator) S2 (Moderator) S3 (Multiplier) S4 (Shield) S5 (Collimator) 44 (BeO) 1.65×10 43 (BeO) 1.53×10 42 (BeO) 1.36×10 45 (BeO) 1.46×10 43 (BeO) 1.11×10 GA Baseline Initial Configuration: [45 (BeO), 31, 10, 6, 5] 1.16×10 Configurations (Episode 240 as Geometry 1 and Episode 780 as Geometry 2) in Figure

, an SNR of 2, and an improvement factor of 1.42.

Black indicates the sample region. The spectrum performance analysis shows a considerable enhancement of peak neutron flux at thermal energy. With similar energy resolution (FWHM = 0.600 for both configurations), the DQL-optimized design achieves a peak intensity of 1.8× times higher (1.89× n/cm²/s) at approximately MeV than the baseline (1.04× n/cm²/s) (Fig. (a)). Figure shows improvement factors for the DQL-optimized design along the energy spectrum, with values ranging from 2.3× at lower indices to 7.4× at higher indices. Spectral fitting study (Fig. (c,d)) indicates that Lorentzian profiles offer the greatest fit for both configurations = 0.891 baseline, = 0.829 optimized). This increase in peak flux without compromising energy selectivity validates the DQL optimization approach.

Optimal PGNAA architecture States:[ S1, Sm, S2, S3, S4, S5] States:[ 44, BeO, 26, 8.5, 4, 3.2] Episode/Step:520/1,035 We chose Aluminum (Al) to validate our proposed DQL approach due to its industrial relevance and basic importance as the third most prevalent crustal element, with an annual global production of around 62 million tons [ ]. With its two greatest characteristic gamma signatures at 7.724 MeV and 4.133 MeV ( = 0.0493 and 0.0149 barns, respectively), the DQL modification effectively improved Al detection. Important trade-offs in performance are revealed by comparing three geometries. An early-stage DQL solution with mediocre perfor- mance is Geometry 1 (Episode 240). Geometry 2 (Episode 780) trades absolute gamma-ray flux intensity to attain the maximum SNR (2.19) with good noise suppression. The DQL-optimized configuration, which achieves enhanced thermalization efficiency (1.65×10 ) while maintain- ing balanced spectral performance. Improvements in the characteristic peaks of Al are shown via gamma spectrum analysis. At the primary 7.724 MeV line, the DQL-optimized configura- tion offers a significant increase, enhancing intensity by 25% from 4.0 × 10 to 5.0 × 10 photons/cm ·s (Fig. ). At 4.133 MeV, the baseline setup shows a greater intensity (1.18 × versus 1.14 × 10 photons/cm To evaluate the generalizability of our method, we used sodium chloride (NaCl), a com- pound that has been widely studied in neutron activation investigations [ ]. The DOL optimization increased the prompt gamma-ray emissions from both sodium and chlorine by increasing the neutron thermalization efficiency in the NaCl compound. This approach reliably increases thermal neutron flux and detection sensitivity, as shown by the almost same enhance- ment factors (31% for Na and 30% for Cl) as shown in (Fig.

3.1 Highlight Differences between Deep Q-learning–based Monte Carlo and Genetic Algorithm optimization The DQL-MCNP and standard GA approaches differ in how they explore the optimization space and handle parameter tuning. The GA approach uses fixed, well-established hyperpa- rameters for mutation, population size, and crossover rates [ ]. On the other hand, DQL adaptively adjusts its policy and action-selection strategy based on feedback from previous simulation outcomes.

DQL shows online convergence because it uses real-time reward feedback to modify its pol- icy at each step. This makes it possible to identify the optimal area of the solution space during training and to discover performance plateaus in real time. In contrast, GA requires an exten- sive generational analysis. actual convergence can only be confirmed once all generations are completed since population-based search makes it difficult to distinguish actual convergence from early plateau during training [ DQL and GA examined a comparable amount of evaluations (1500 episodes 10 steps 3,000 MCNP; GA: 30 generations), 50 population = 1,500 MCNP evaluations; however, their approaches to achieving the ideal area differed significantly. The GA received its best fitness at generation 21 after 3.56 hours of training, while the DQL received its best reward at episode 520, which is equivalent to 2.74 hours of training. This implies that DQL found its optimal configuration around 23% faster (Fig. ). Since DQL continuously modifies its strategy and identifies high-reward locations throughout training, it converges earlier. On the other hand, GA could get trapped in a local optimum. Material set to BeO and parameters at extreme or boundary values (S1, S2, S3, S4, and S5) comprised the ideal GA configuration for this run (Table. , Supplementary Figure. ). This suggests that even while the approach generated highest reward that GA could find, it might not fully explore the search area and might overlook configurations that perform better.

Method

Optimized Geometry [cm] Therm. Eff. Improvement S1 (BeO) 1.65E-05 1.42× / 0.7× GA (30×50) 1.16E-05 GA (50×30) [Baseline] 1.16E-05 GA (100×30) 1.15E-05 Convergence Performance Configuration 1500 episodes 30 pop × 50 gen 50 pop × 30 gen [Baseline] 100 pop × 30 gen Total Evaluations Evaluations to Convergence 1050 [Baseline] Time to Convergence (h) 3.56 [Baseline] Total Time (h) Best Fitness 754.43 [Baseline] Compared with other studies, our approach superiors the multi-objective automated GA of Ma et al. [ ] in thermal flux and offers superior thermalization. Single-objective automated methods such as those reported by Cheng et al. [ ] achieve a higher SNR (2.56×). However, as SNR was not included in our reward, the observed values are comparable. In terms of ther- mal flux and thermalization efficiency, real-time adaptive optimization performs better than pre-trained or manual techniques like hybrid MLP + Q-learning and manual tuning [ Although the reported improvement factors (Table .) are relative due to variations in neutron sources, system designs, and simulation parameters, they demonstrate how well the DQL ap- proach adapts to novel geometries in real-time MCNP simulations. Although this study focuses on a 14.1 MeV D-T neutron source, the framework is naturally flexible and can be extended to various neutron energies and PGNAA system scales by changing the state representation and reward function. Larger state spaces or more complex geometries may result in increased computing costs, even though the technique is expected to maintain effective performance.

Method

Multi-objective Automated Computational Cost Improvement Factor [ Learning Type Reference DQL-MCNP Medium Real-time adaptive This work Genetic Algorithm Population-based Ma et al. [ Genetic Algorithm Population-based Cheng et al. [ Hybrid MLP + Q-Learning Pre-trained model Zolfaghari et al. [ Manual Parameter Tuning Very High Manual iteration Li et al. [

4 Conclusions

We concluded that our deep Q-learning approach yielded superior multi-parameter optimiza- tion for PGNAA system design when coupled with Monte Carlo simulations. The framework incorporated sensitivity analysis utilizing the Morris approach to reduce computing cost and identify key elements. The optimization approach for 14.1 MeV neutron energy was divided into three stages: initialization (random state and network setup), training (iterative state obser- vation, action selection, reward evaluation, and Q-network updates), and termination (conver- gence criteria fulfilled). The reward function includes thermalization efficiency, thermal-to-fast flux ratio, and thermal flux related to the geometric properties of moderator, collimator, mul- tiplier, and shielding components. The DQL agent efficiently explored 7 million states and performed multi-objective optimization, reducing computation time by 23% without the need of expert intervention. The effectiveness of optimization was demonstrated by the notable im- provement in distinctive gamma signals observed during materials validation. The DQL algo- rithm demonstrates broad applicability to different neutron thermalization device designs and material compositions. Its text-based input processing allows direct optimization from Monte Carlo simulation results without the requirement for pre-trained data, offering autonomous, data-driven nuclear system development with demonstrable advantages over traditional meth- ods. Although this study focuses on a 14.1 MeV D–T neutron source, the framework is au- tomatically flexible and could be extended to different neutron energies, system scales, and reward functions, highlighting its potential generalizability to a wide range of PGNAA config- urations. We believe that automated hyperparameter optimization and real-time experimental data integration can significantly improve the proposed pipeline efficiency. Future work may include transfer learning methodologies for broader nuclear engineering applications and ex- perimental verification to validate simulation-based results.

Data Availability The complete reproducibility analysis across 30 independent DQL runs is provided in the sup- plementary materials. The pseudocode and key results of the GA baseline are also included for reference. Additional implementation details and simulation scripts are available from the the full MCNP simulation datasets are not publicly available but can be shared upon request for research purposes.

References

M. R. Abdelnour, J. Liu, K. Hossny, A. Wajid, W. Li, and Z. Liu, “Prompt gamma neutron activation analysis: A review of applications, design, analytics, challenges, and prospects,” Radiation Physics and Chemistry , p. 112693, 2025. [Online]. Available:

Z. U. Koreshi and H. Khan, “Optimization of Moderator Design for Explosive Detection by Thermal Neutron Activation Using a Genetic Algorithm,” Journal of Nuclear Engineering and Radiation Science , vol. 2, no. 3, p. 031018, 06 2016. [Online].

Available: M. Zolfaghari, S. F. Masoudi, and F. Rahmani, “Optimization of linac-based neutron source for thermal neutron activation analysis,” Journal of Radioanalytical and Nuclear Chemistry , vol. 317, pp. 1477–1483, 2018.

M. Vatani, M. Hassanvand, J. Mokhtari, and M. Choopan Dastjerdi, “Design of an in-tank thermal neutron beam for pgnaa application at isfahan mnsr,” Nuclear Engineering and Design , vol. 412, p. 112451, 2023. [Online]. Available:

R. Uhláˇr, M. Kadulová, P. Alexa, and J. Pištora, “A new reflector structure for facility thermalizing d–t neutrons,” Journal of Radioanalytical and Nuclear Chemistry , vol. 300, pp. 809–818, 2014.

A. H. Hegazy, V. Skoy, and K. Hossny, “Optimization of shielding-collimator parameters for ing-27 neutron generator using mcnp5,” EPJ Web of Conferences , vol. 177, p. 02003, 2018, the XXI International Scientific Conference of Young Scientists and Specialists (AYSS-2017). [Online]. Available:

C. Cheng, Z. Wei, D. Hei, W. Jia, A. Sun, J. Li, P. Cai, D. Zhao, Q. Shan, and Y. Ling, “Design of a pgnaa facility using d-t neutron generator for bulk samples analysis,” Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms , vol. 452, pp. 30–35, 2019. [Online]. Available:

Z. Wang, Y. Wang, H. Xu, and H. Xie, “Application of simulated annealing algorithm in core flow distribution optimization,” Energies , vol. 15, no. 21, p. 8242, 2022.

Y.-H. Lin, M.-T. Lee, and Y.-H. Hung, “A thermal management control using particle swarm optimization for hybrid electric energy system of electric vehicles,” Results in Engineering vol. 21, p. 101717, 2024. [Online]. Available: https:

C. Cheng, Y. Xie, X. Xia, J. Gu, P. Wang, L. Xing, M. Wang, D. Hei, H. Lei, and J. Wenbao, “Neutron collimator optimization for 14.1 mev dt neutrons using monte carlo and genetic algorithms,” Applied Radiation and Isotopes , vol. 198, p. 110838, 2023. [Online]. Available:

S0969804323001914

B. Ma, M. Yan, X. Li, Q. Jiang, S. Wang, and Z. Liu, “Optimization design for moderator, reflector, and shielding of deuterium-deuterium neutron source,” Journal of Instrumentation , vol. 19, no. 05, p. P05076, may 2024. [Online]. Available:

B. Kazimipour, X. Li, and A. K. Qin, “A review of population initialization techniques for evolutionary algorithms,” in 2014 IEEE congress on evolutionary computation (CEC) IEEE, 2014, pp. 2585–2592.

H. Alibrahim and S. A. Ludwig, “Hyperparameter optimization: Comparing genetic al- gorithm against grid search and bayesian optimization,” in 2021 IEEE congress on evolu- tionary computation (CEC) IEEE, 2021, pp. 1551–1559.

Y. Ge, Y. Zhong, I. Murata, S. Tamaki, N. Yuan, Y. Sun, W. Ma, L. Zou, Z. Yang, and L. Lu, “Efficient optimization of an accelerator neutron source for neutron capture therapy using genetic algorithms,” Medical Physics , vol. 51, no. 9, pp. 6445–6457, 2024. [Online]. Available:

V. Sobes, B. Hiscox, E. Popov, R. Archibald, C. Hauck, B. Betzler, and K. Terrani, “Ai-based design of a nuclear reactor core,” Scientific Reports , vol. 11, no. 1, p. 19646, 2021. [Online]. Available:

A. Erdo˘gan and M. Geçkinli, “A pwr reload optimisation code (xcore) using artificial neural networks and genetic algorithms,” Annals of Nuclear Energy , vol. 30, no. 1, pp. 35–53, 2003. [Online]. Available:

S0306454902000415 M. Kamuda and C. J. Sullivan, “An automated isotope identification and quantification algorithm for isotope mixtures in low-resolution gamma-ray spectra,” Radiation Physics and Chemistry , vol. 155, pp. 281–286, 2019, iRRMA-10. [Online]. Available:

L.-F. Chen, “Machine learning-assisted optimization of modular neutron shielding based on monte carlo simulations,” arXiv preprint arXiv:2504.17319 , 2025.

Y. Liu, B. Wang, S. Tan, T. Li, W. Lv, Z. Niu, J. Li, P. Gao, and R. Tian, “Applications reinforcement learning nuclear energy: review,” Nuclear Engineering and Design , vol. 429, p. 113655, 2024. [Online]. Available:

M. Zolfaghari, S. F. Masoudi, F. Rahmani, and A. Fathi, “Thermal neutron beam opti- mization for pgnaa applications using q-learning algorithm and neural network,” Scientific Reports , vol. 12, no. 1, p. 8635, 2022.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.

X.-. M. C. Team, “Mcnp – a general monte carlo n-particle transport code, version 6,” Los Alamos National Laboratory, Tech. Rep. LA-UR-13-22934, 2013. [Online].

Available:

J. Li, W. Jia, D. Hei, Z. Yao, and C. Cheng, “Research on the optimization method for pgnaa system design based on signal-to-noise ratio evaluation,” Nuclear Engineering and Technology , vol. 54, no. 6, pp. 2221–2229, 2022. [Online]. Available:

C. Wang, M. Peng, and G. Xia, “Sensitivity analysis based on morris method of passive system performance under ocean conditions,” Annals of Nuclear Energy , vol. 137, p. 107067, 2020. [Online]. Available:

S0306454919305699 A. Saltelli, M. Ratto, T. Andres, F. Campolongo, J. Cariboni, D. Gatelli, M. Saisana, and S. Tarantola, Global sensitivity analysis: the primer John Wiley & Sons, 2008.

Y. Wang, Y. Song, W. Yin, H. Li, J. Lv, A.-J. Wang, and H.-C. Wang, “Modeling pro- cesses and sensitivity analysis of machine learning methods for environmental data,” in Water Security: Big Data-Driven Risk Identification, Assessment and Control of Emerg- ing Contaminants Elsevier, 2024, pp. 511–522.

L. L. Snead, D. Sprouster, B. Cheng, N. Brown, C. Ang, E. M. Duchnowski, X. Hu, and J. Trelewicz, “Development and potential of composite moderators for elevated temper- ature nuclear applications,” Journal of Asian Ceramic Societies , vol. 10, no. 1, pp. 9–32, R. S. Detwiler, R. J. McConn, T. F. Grimes, S. A. Upton, and E. J. Engel, “Compendium of material composition data for radiation transport modeling,” Pacific Northwest National Lab.(PNNL), Richland, WA (United States), Tech. Rep., 2021. [Online].

Available: R. G. Williams, C. J. Gesh, and R. T. Pagh, “Compendium of material composition data for radiation transport modeling,” Pacific Northwest National Lab.(PNNL), Richland, WA (United States), Tech. Rep., 2006. [Online]. Available:

R. Uhláˇr, P. Alexa, and J. Pištora, “A system of materials composition and geometry arrangement for fast neutron beam thermalization:

An mcnp study,” Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions Materials Atoms 81–85, [Online].

Available: C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning , vol. 8, no. 3, pp. 279–292, 1992. [Online]. Available:

T. Eimer, M. Lindauer, and R. Raileanu, “Hyperparameters in reinforcement learning and how to tune them,” in International conference on machine learning PMLR, 2023, pp.

X. Dong, J. Shen, W. Wang, L. Shao, H. Ling, and F. Porikli, “Dynamical hyperparameter optimization via deep reinforcement learning in tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 43, no. 5, pp. 1515–1529, 2021.

S. Aronhime, C. Calcagno, G. H. Jajamovich, H. A. Dyvorne, P. Robson, D. Dieterich, M. Isabel Fiel, V. Martel-Laferriere, M. Chatterji, H. Rusinek, and B. Taouli, “Dce-mri of the liver: Effect of linear and nonlinear conversions on hepatic perfusion quantification

and reproducibility,” Journal of Magnetic Resonance Imaging , vol. 40, no. 1, pp. 90–98, 2014. [Online]. Available:

M. Delacre, D. Lakens, and C. Leys, “Why psychologists should by default use welch’s t-test instead of student’s t-test,” International Review of Social Psychology , vol. 30, no. 1, pp. 92–101, 2017.

D. S. Quintana, “Statistical considerations for reporting and planning heart rate variability case-control studies,” Psychophysiology , vol. 54, no. 3, pp. 344–349, 2017.

P. E. McKnight and J. Najab, “Mann-whitney u test,” The Corsini encyclopedia of psy- chology , pp. 1–1, 2010.

Z. Birnbaum, “On a use of the mann-whitney statistic,” in Proceedings of the Third Berke- ley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics , vol. 3.

University of California Press, 1956, pp. 13–18. H. Folz, J. Henjes, A. Heuer, J. Lahl, P. Olfert, B. Seen, S. Stabenau, K. Krycki, M. Lange- Hegermann, and H. Shayan, “Pgnaa spectral classification of aluminium and copper alloys with machine learning,” arXiv preprint arXiv:2404.14107 , 2024.

T. Czakoj, M. Košt’ál, E. Novák, E. Losa, J. Šimon, M. Schulc, F. Mravec, F. Cvachovec, J. Rataj, and Zdenˇek Matˇej, “Measurement of prompt neutron capture gamma coming from iron and chlorine,” Annals of Nuclear Energy , vol. 198, p. 110317, 2024. [Online].

Available: N. A. Elsheikh, “Characterization of (252cf-zrh2) monte carlo model for detection of ni- trogen and chlorine by thermal neutron-capture pgnaa,” Radiation Physics and Chemistry vol. 188, p. 109591, 2021.

M. Sarfraz and S. A. Raza, “Visualization of data using genetic algorithm,” in Soft Computing and Industry , R. Roy, M. Köppen, S. Ovaska, T. Furuhashi, and F. Hoffmann, Eds.

Springer, London, 2002, pp. 403–410. [Online]. Available:

R. Uhlar, M. Kadulova, P. Alexa, and J. Pistora, “A new reflector structure for facility thermalizing d-t neutrons,” Journal of Radioanalytical and Nuclear Chemistry , vol. 300, no. 2, pp. 809–818, 2014, wOS:000334505700045.

Acknowledgments

The authors acknowledge the support of the following: the Fundamental Research Funds for Central Universities at Lanzhou University (lzujbky-2023-ct05, lzujbky-2023-stlt01); the Cen- tral Government’s Guidance Funds for Local Science and Technology Development (24ZYQA045, YDZX20216200001297); the Ling Chuang Research Project of China National Nuclear Cor- poration (CNNC-LCKY-2024-080); the Special Funds from Gansu Nuclear Industry Research Institute; the National Key Research and Development Program of China (2023YFF1303501); the Lanzhou University Talent Cooperation Research Funds sponsored by Lanzhou City (561121203); and the National Natural Science Foundation of China (11975115).

Author contributions Competing interests The authors declare no competing interests.

Supplementary Material Deep Q-learning for autonomous optimization of neutron thermalization devices for PGNAA applications

Supplementary Tables Table S1: reproducibility over 30 independent DQL runs. The optimal geometry parameters (cm) are shown by M1–M6. The actions with the highest and lowest rewards are shown by Action and Action , which show consistent learning behavior.

Therm. Thermal Action Reward Action Reward

43.80 BeO 24.17

CV (%)

High-reward actions: Action 5 (26.7%), Action 9 (20.0%), Action 0 (13.3%), Action 2 (13.3%) Low-reward actions:

Action 1 (16.7%), Action 6 (16.7%), Action 8 (16.7%), Action 4 (13.3%) Performance gap:

Mean reward high = 16.41 vs Mean reward low = 5.30 (67.7% lower, p < 0 . 001 )

Supplementary Figures Algorithm 2 Genetic Algorithm for PGNAA Optimization Require:

Population size , generations , mutation rate , crossover rate , elitism , precision Parameters:

Input: MCNP base input file, parameter ranges, material options Initialize random seed Create initial population create_individual best_ind , best_fitness generation Evaluate fitness for each individual – create modified MCNP input from individual’s parameters – run MCNP and extract counts (thermal, fast, total) – compute reward: thermal

R = 0 . 3 · � thermal

total thermal reward best_fitness update best_fitness and best_ind end if Record statistics (best, average) Create new population elites by fitness) while Select parents by ranking selection (rank-based probabilities) rand() simulated binary crossover mutate mutate append if space) to append mutate end if end while end for Save results, plots and best configuration return best_ind Figure S1: Fitness evolution in genetic algorithms: (a) 50 populations over 30 generations, (b) 50 populations over 50 generations, and (c) 30 populations over 100 generations.

Submission history

[v1] 2025-11-28

Abstract

Full Text

Preamble

3. Department of Physics, Faculty of Women for Arts, Science, and Education, Ain Shams

Abstract

Keywords

1 Introduction

1 Polyethylene (C

Method

Method

4 Conclusions

References

Acknowledgments

43.80 BeO 24.17

Submission history

Access Paper

Citation

Share

Related Papers

Feedback

Deep Q-learning for autonomous optimization of neutron thermalization devices for PGNAA applications