Machine learning methods for predicting coordination numbers based on EXAFS spectra
Haitao Zeng, Hu Longfei, Yao Tao
Submitted 2025-10-30 | ChinaXiv: chinaxiv-202511.00010 | Mixed source text

Abstract

X-ray absorption fine structure (XAFS) is an important structural analysis technique widely used to study the oxidation states, coordination environments, and neighboring atomic characteristics of amorphous materials and disordered systems. However, due to the complexity of XAFS spectra, their interpretation relies on experienced researchers and is prone to inaccuracies. This study utilizes machine learning methods, specifically neural networks, bagging decision tree models, and random forest models, to analyze XAFS data for predicting the coordination numbers of absorbing elements. EXAFS data for fourth-period transition metal elements were collected from the Materials Project database, covering a variety of coordination environments and totaling 13,374 valid data entries. The results demonstrate that both neural network and random forest models exhibit high accuracy in predicting coordination numbers. By enhancing the generalization capability and interpretability of the models, this study provides a more efficient and reliable method for XAFS data analysis.

Full Text

Machine Learning Methods for Predicting Coordination Numbers from EXAFS Spectra

(National Synchrotron Radiation Laboratory, University of Science and Technology of China)

Abstract

X-ray Absorption Fine Structure (XAFS) is a critical structural analysis technique widely utilized to investigate the oxidation states, coordination environments, and neighboring atomic characteristics of amorphous materials and disordered systems. However, due to the inherent complexity of XAFS spectra, data interpretation traditionally relies heavily on the expertise of researchers and is prone to inaccuracies. To address these challenges, this study employs machine learning methods—specifically neural networks, Bagging decision trees, and Random Forest models—to automate and enhance the precision of spectral analysis.

Data was collected from the Materials Project database to predict the coordination numbers of absorbing elements. The dataset comprises Extended X-ray Absorption Fine Structure (EXAFS) data for fourth-period transition metal elements, encompassing a wide variety of coordination environments and totaling 13,374 valid entries. The results demonstrate that both neural network and random forest models achieve high accuracy in predicting coordination numbers. By enhancing the generalization capability and interpretability of these models, this study provides an efficient and reliable methodology for data analysis. Feature importance analysis reveals that specific points within the low-$R$ region are critical for predictions, consistent with the dominance of short-range atomic interactions in EXAFS theory.

1. Introduction

X-ray Absorption Fine Structure (XAFS) spectroscopy and its extended region, Extended X-ray Absorption Fine Structure (EXAFS), provide detailed structural information regarding the local chemical environment surrounding an absorbing atom. This technology has become an essential tool in materials science and catalysis research, applied to topics such as the coordination environment of metal clusters, local structural distortions in ferroelectric materials, and the analysis of catalytic mechanisms in single-atom catalysts.

The standard EXAFS equation is typically expressed as:

$$\chi(k) = \sum_{j} \frac{N_j S_0^2 f_j(k)}{k R_j^2} e^{-2k^2 \sigma_j^2} e^{-2R_j/\lambda(k)} \sin(2kR_j + \delta_j(k))$$

In this expression, the coordination number $N_j$ is a critical parameter describing the quantity of neighboring atoms in the $j$-th shell. Traditionally, $N_j$ is obtained through non-linear least-squares fitting. However, this process is often hindered by the high correlation between $N$ and other parameters, such as the amplitude reduction factor $S_0^2$ and the Debye-Waller factor $\sigma^2$. Furthermore, traditional analysis requires high-quality initial structural models and significant computational expertise.

With the rapid development of artificial intelligence, machine learning (ML) has emerged as a transformative technique for spectral analysis. Unlike traditional fitting, ML models can capture complex, non-linear relationships within the data without requiring a detailed physical model for every iteration. In this work, we leverage neural networks and ensemble learning (Bagging and Random Forest) to establish a predictive framework for determining coordination numbers directly from EXAFS data.

2. Methodology

2.1 Data Acquisition and Preprocessing

The data for this study were sourced from the Materials Project database. We systematically retrieved materials containing fourth-period transition metals and extracted their corresponding EXAFS data generated using the FEFF calculation package. After filtering, 13,374 entries with integer coordination numbers were retained.

For large-scale data processing, we utilized the Larch toolkit. The autobk function was used for background subtraction and normalization, while the xftf function performed the Fourier Transform (FT) to convert EXAFS data from $k$-space to $R$-space. Regarding the processing of $R$-space data, intensity values were extracted as features at intervals of 0.003 Å. The coordination numbers used as ground truth were calculated using the CrystalNN algorithm from the pymatgen package, which is based on Voronoi decomposition and recognized for high computational accuracy.

2.2 Machine Learning Models

We implemented three primary machine learning strategies:

  1. Fully Connected Neural Network (FCNN): The input layer consists of features corresponding to the intensity values of the EXAFS spectrum. The network adopts a dual-hidden-layer structure, with each layer followed by a ReLU activation function and a Dropout layer to mitigate overfitting.
  2. Bagging (Bootstrap Aggregating): This involves training multiple versions of a regressor on different subsets of the training data. The final prediction is an average of the individual model outputs, which effectively reduces variance.
  3. Random Forest: As an extension of Bagging, Random Forest introduces feature randomness during the construction of individual decision trees ($\sqrt{d}$ features selected for each split). This further decorrelates the trees, leading to a more robust model.

The modeling mechanism of a decision tree follows a recursive feature space partitioning strategy. While neural networks often act as "black boxes," ensemble methods allow for feature importance analysis via the Gini Index, providing greater interpretability regarding which spectral regions influence the coordination number prediction.

3. Results and Discussion

The performance of the models was evaluated using five-fold cross-validation. Computational results indicate that the prediction accuracy for both neural networks and ensemble methods is approximately 75-80%.

[TABLE:1]

As shown in [FIGURE:1], the neural network model exhibits slightly superior predictive performance compared to the decision tree models. For example, Fe elements demonstrate the best performance within the neural network framework, reaching an accuracy of 81.74%, while the prediction accuracy for Cu elements is lower at 67.39%. This discrepancy is likely related to the distribution of coordination numbers; Fe maintains a stable six-coordinate structure in most materials, whereas Cu exhibits more diverse variations.

[FIGURE:1]

Confusion matrices were constructed to compare true versus predicted coordination numbers. Furthermore, based on the Gini index of the ensemble models, we constructed feature importance distribution maps. The analysis reveals that feature importance weights are primarily concentrated in the short-range interaction zone below 0.3 nm. This aligns with empirical EXAFS theory, where the first coordination shell dominates the signal. For distant coordination shells beyond 0.6 nm, the models struggle to establish reliable correspondences due to low signal intensity and increased noise.

We also observed that the Bagging method tends to amplify importance differences, whereas the Random Forest model provides a more balanced distribution. This suggests that Random Forest better handles high-dimensional spectral data and reduces the risk of overfitting.

4. Conclusion

This study successfully developed a machine learning-based approach to predict coordination numbers from EXAFS spectra. By employing neural networks and ensemble learning on a large-scale dataset of fourth-period transition metals, we achieved an average prediction accuracy of approximately 80%.

The research demonstrates that the intensity information within the range of 0 to 0.3 nm in $R$-space is crucial for predicting coordination numbers, providing a theoretical bridge between data-driven models and physical EXAFS theory. While the model's accuracy exhibits elemental dependence—likely due to the diversity of coordination environments in the training data—this approach provides a scalable tool for high-throughput, automated structural analysis. Future work will focus on optimizing feature engineering and exploring hybrid methods that combine physical models with deep learning to further enhance reliability across all elemental species.

Submission history

Machine learning methods for predicting coordination numbers based on EXAFS spectra