Abstract
X-ray Absorption Fine Structure (XAFS) is an important structural analysis technique widely applied to investigate the oxidation state, coordination environment, and neighboring atomic properties of amorphous materials and disordered systems. However, due to the complexity of XAFS spectra, their interpretation relies on experienced researchers and is prone to inaccuracies. This study employs machine learning methods, specifically neural networks, bagged decision tree models, and random forest models, to analyze XAFS data for predicting the coordination number of absorbing elements. The research collected EXAFS data for fourth-period transition metal elements from the Materials Project database, covering various coordination environments, totaling 13,374 valid data points. The results demonstrate that both neural networks and random forest models exhibit high accuracy in predicting coordination numbers. This study provides a more efficient and reliable approach for XAFS data analysis by enhancing the generalization capability and interpretability of the models.
Full Text
Machine Learning Approach for Predicting Coordination Numbers from EXAFS Spectra
Haitao Zeng¹, Longfei Hu¹, Tao Yao¹
¹National Synchrotron Radiation Laboratory, University of Science and Technology of China, Hefei 230029, China
Abstract
[Background] X-ray Absorption Fine Structure (XAFS) is a vital technique for structural analysis, widely employed to investigate the oxidation state, coordination environment, and neighboring atom properties of amorphous materials and disordered systems. However, the complexity of XAFS spectra often requires interpretation by experienced researchers, which can still lead to inaccuracies. [Purpose] This study aims to use machine learning approaches to analyze XAFS data and predict the coordination number of absorbing atoms. [Methods] First, a dataset of 13,374 valid EXAFS spectra of fourth-period transition metal elements was sourced from the Materials Project database. Second, this data was utilized to train three machine learning models: neural networks, bagging models, and random forest models. Finally, these models were applied to predict coordination numbers of the absorbing atoms in the spectra. [Results] The study achieved an average prediction accuracy of approximately 70%. Feature importance analysis revealed that data points within R < 3.0 Å were critical for predictions, consistent with the prominence of short-range atomic interactions in EXAFS theory. [Conclusions] This research enhances the efficiency and reliability of XAFS data analysis by improving model generalizability and interpretability.
Keywords: EXAFS, Coordination Number, Bagging, Random Forest
Introduction
X-ray Absorption Fine Structure (XAFS) can be divided into two main regions: X-ray Absorption Near Edge Structure (XANES) and Extended X-ray Absorption Fine Structure (EXAFS). XAFS is highly sensitive to the oxidation state and coordination environment of absorbing atoms, as well as the type and distance of neighboring atoms. Consequently, it finds widespread application across physics, chemistry, materials science, and biology. However, due to the complexity of XAFS spectra, their interpretation typically relies on experienced researchers, who may still reach inaccurate conclusions [1]. Therefore, developing more efficient and reliable analytical methods to improve the accuracy of XAFS data interpretation has become an important research direction.
Machine learning, as a representative data-driven research method in recent years, has emerged as a major focus in scientific research. Among these methods, neural networks have attracted considerable attention for their high prediction accuracy. Currently, neural networks have been widely applied to extended-edge spectral analysis and near-edge structure feature extraction. For example, Tetef et al. (2021) demonstrated that unsupervised methods such as t-distributed stochastic neighbor embedding (t-SNE) and variational autoencoders (VAE) can effectively classify sulfur organic compounds based on XANES and valence-to-core X-ray emission spectroscopy (VtC-XES), revealing detailed chemical properties such as oxidation state and aromaticity [2]. Timoshenko et al. (2018) developed a neural network-based approach to directly extract radial distribution functions (RDF) from EXAFS spectra without relying on prior structural assumptions. This method was successfully applied to analyze the high-temperature bcc-fcc phase transition in iron (Fe) and extended to cobalt (Co) and nickel (Ni) systems [3].
Nevertheless, existing studies are typically limited to simple material systems analyzing specific materials or material categories under different parameters (such as temperature variations), presenting challenges for model generalization. Additionally, due to the complexity of neural networks and their "black box" nature, researchers often struggle to interpret the specific basis for model-generated parameters and cannot intuitively understand their decision-making processes. This lack of interpretability limits the application of neural networks in research scenarios requiring explicit feature importance analysis [4]. To improve the interpretability of machine learning models, researchers have developed various techniques, including Ridge Regression (RR) [5], LASSO regression [6], the SISSO method [7], and Decision Trees (DT) [8]. Among these, decision trees have gained widespread attention due to their simple structure and high interpretability. Furthermore, based on Ensemble Learning (EL) [9], researchers have developed additional methods such as Bagging [10], Random Forest (RF) [11], and Boosting [12]. Decision tree models have already demonstrated their potential in XAFS spectroscopy research. For instance, Torrisi et al. (2020) used random forest machine learning models to analyze XANES spectra, predicting coordination numbers, nearest-neighbor distances, and Bader charges of absorbing atoms in transition metal oxides [13]. The study improved model interpretability through multi-scale polynomial featurization, revealing key spectral regions related to different properties.
Addressing the two core constraints of machine learning in XAFS analysis—model generalization difficulties and lack of interpretability—this study constructed a large-scale EXAFS dataset using the Materials Project database, covering multiple fourth-period transition metal elements across different material categories and diverse coordination environments. Neural network and random forest models developed based on this dataset demonstrated generalization capability in predicting coordination numbers with high accuracy. Moreover, interpretability analysis of the random forest model successfully identified the key spectral feature region determining coordination number: the low-R region after Fourier transformation. This study provides a scalable, highly generalizable XAFS data analysis tool, laying a solid foundation for high-throughput, automated XAFS structural analysis across broad material systems, with significant methodological value and practical prospects.
The workflow of this study is shown in Figure 1 [FIGURE:1]. We collected a total of 15,498 XAFS data entries for fourth-period transition metal elements from the Materials Project. After screening, 13,374 entries with integer coordination numbers were retained. Subsequently, we transformed these XAFS data from E-space to R-space and imported them into fully-connected neural networks and decision tree-based ensemble learning models (Bagging and Random Forest). Separate models were trained and evaluated for each element.
1.1 X-ray Absorption Fine Structure (XAFS) and Its Extended Region (EXAFS)
The XAFS phenomenon was first discovered and scientifically documented in experiments by Friche and Hertz in 1920. A breakthrough in XAFS research came from the innovative method proposed by Sayers, Stern, and Lytle—using Fourier transform techniques to effectively analyze the wave vector space of Extended X-ray Absorption Fine Structure (EXAFS) oscillation signals [14]. This milestone work marked the entry of XAFS technology into the field of quantitative analysis and initiated its systematic development for materials characterization [15].
XAFS technology provides detailed structural information about the local chemical environment around absorbing atoms. In recent years, atomic-level local property analysis based on XAFS has become an important approach in materials science and catalysis research. Specifically, XAFS techniques have been applied to investigate the coordination environments of metal clusters, local structural distortions in ferroelectric materials, and catalytic mechanisms in single-atom catalysts.
XAFS spectra are obtained by measuring the variation of a material's X-ray absorption coefficient with incident photon energy. XAFS spectra are typically divided into two characteristic regions: X-ray Absorption Near Edge Structure (XANES) and Extended X-ray Absorption Fine Structure (EXAFS). The theoretical expression for EXAFS can be written as:
$$\chi(k) = S_0^2 N_j f_j(k) e^{-2R_j/\lambda(k)} \sin(2kR_j + \delta_j(k))$$
where k represents the photoelectron wave vector, N_j represents the coordination number of the j-th coordination shell, R_j represents the average distance from the absorbing atom to neighboring atoms in the j-th shell, σ_j characterizes the disorder of the j-th shell, f_j(k) represents the scattering ability of neighboring atoms for photoelectrons, δ_j(k) represents the phase shift, λ(k) represents the mean free path of photoelectron propagation, and S₀² represents the amplitude reduction factor.
This theoretical expression shows that Fourier transform can convert the EXAFS oscillation function from k-space to R-space, obtaining radial distribution information about the local structure around absorbing atoms. However, due to multiple coupled variable parameters in the EXAFS equation and possible superposition effects from multiple scattering paths, precise structural analysis relying solely on EXAFS spectra faces a severe ill-posed problem. Therefore, in practical EXAFS data analysis, researchers typically adopt a "fingerprint comparison" approach, obtaining relative models of material local structure by comparing experimental data with reference spectra from known structures or theoretical simulations.
Although directly inverting all structural parameters from XAFS spectra presents inherent difficulties, calculating theoretical XAFS spectra based on known structural models has become a reliable research method. In XAFS research, FEFF [16] and FDMNES [17] are two widely adopted computational packages. Both software tools employ first-principles (ab initio) calculation methods to accurately simulate XAFS spectral features by solving the Schrödinger equation. FEFF uses Real-Space Multiple Scattering (RSMS) theory, while FDMNES is based on the Finite Difference Method (FDM), providing reliable theoretical references for experimental data analysis.
1.2 Larch [18]
Larch is a comprehensive toolkit specifically designed for XAFS and related spectroscopic data analysis. Compared to the IFEFFIT package developed in FORTRAN, Larch's reconstruction in Python significantly enhances large-scale data processing capabilities and data visualization functions. In this study, we utilized core functions from the Larch toolkit for data processing: the autobk function performs background subtraction and normalization to convert EXAFS spectra from E-space to k-space, while the xftf function executes Fourier Transform (FT) to convert EXAFS spectra from k-space to R-space.
1.3 Coordination Number
Coordination number is a key structural parameter describing the local coordination environment of atoms in materials and is crucial for studying material properties. In traditional research, coordination numbers were typically determined through intuitive judgment of individual atom coordination. However, with the development of high-throughput materials computing technology and the generation of large-scale datasets, developing reliable and automated coordination number calculation algorithms has become an urgent research need.
Currently, various coordination number calculation methods have been proposed in materials science [19]. Simpler algorithms are based on interatomic distances and empirical tolerance parameters, such as BrunnerNN [20] and MinimumOKeeffeNN [21]. These methods determine coordination numbers by comparing actual interatomic distances in models with tolerance values. Although straightforward and intuitive, these algorithms are sensitive to small atomic perturbations and changes in empirical tolerance parameters, where minor atomic variations may lead to different coordination number assignments.
The VoronoiNN algorithm proposed by O'Keeffe [22] offers an alternative solution. Based on geometric principles, this algorithm uses Voronoi decomposition [23] to treat the environment around a central atom as a polyhedron and performs weighted processing according to spatial angles to determine the number of neighboring atoms. Similarly, this study employs the CrystalNN [24] algorithm. The CrystalNN method is also based on Voronoi decomposition and is widely recognized for its high computational accuracy. This method calculates probabilities of possible coordination environments through Voronoi decomposition and selects the result with the highest probability as the atom's coordination environment. In this paper, we systematically calculated coordination numbers of absorbing atoms in various material systems using the CrystalNN algorithm from the pymatgen materials computing package [25].
1.4 Neural Networks
As an important branch of machine learning, neural network (deep learning) technology has become one of the core areas in current scientific research and technological applications due to its exceptional performance in regression and classification tasks. Neural network models achieve feature extraction and pattern recognition through numerous adjustable parameters that are automatically optimized during training via backpropagation algorithms. Researchers primarily optimize model performance by tuning hyperparameters (including batch size, learning rate, and network layer structure).
A single hidden layer feedforward neural network model can be formally defined by the following mathematical expression:
$$f(X) = \beta_0 + \sum_k \beta_k g\left(\alpha_{k0} + \sum_j \alpha_{kj} X_j\right)$$
where $X = (X_1, X_2, \ldots, X_p) \in \mathbb{R}^p$ represents the p-dimensional input feature vector. Model parameters include: $\alpha_{kj}$ representing connection weights from the j-th feature in the input layer to the k-th neuron in the hidden layer, $\beta_k$ representing weight coefficients from the k-th hidden layer neuron to the output layer, and $\alpha_{k0}$ and $\beta_0$ representing bias terms for the corresponding linear transformations.
Although this architecture appears as a linear superposition in form, nonlinear activation functions $g(\cdot)$ must be introduced to effectively approximate nonlinear functional relationships in real-world scenarios. In machine learning, the Sigmoid function and ReLU (Rectified Linear Unit) function [26] are two classic activation functions. The Sigmoid function has S-shaped saturation characteristics with an output range of (0,1), which made it widely used in early neural network research. However, this function suffers from vanishing gradients when input values approach positive or negative infinity (derivatives approach zero), significantly reducing parameter update efficiency during backpropagation. In contrast, the ReLU function is defined as $g(x) = \max(0,x)$, and its piecewise linear characteristics ensure a constant derivative value in the positive interval, effectively alleviating the vanishing gradient problem. Moreover, ReLU's computational complexity is significantly lower than that of the Sigmoid function, enabling superior computational efficiency and convergence properties in deep neural network training.
1.5 Decision Trees
Despite neural networks' significant advantages in prediction performance, their inherent "black box" characteristics result in insufficient model interpretability. Specifically, the mechanism for setting numerous neuron parameters in neural networks is difficult to interpret clearly, prompting researchers to seek alternative analysis tools with stronger interpretability. In this study, we employ decision trees as an analysis tool due to their intuitive structure and ease of interpretation. Through decision trees, researchers can clearly identify relationships between data features and outcomes. However, traditional decision tree methods exhibit obvious deficiencies in prediction accuracy. To overcome this limitation, researchers have developed various ensemble learning methods, including Bagging, Random Forest, and Boosting. While these ensemble methods significantly improve prediction performance, they sacrifice the intuitive interpretability of single decision trees. Nevertheless, by analyzing the Gini Index, we can still effectively evaluate the importance of each feature in classification decisions.
Decision tree modeling follows a recursive feature space partitioning strategy. The algorithm uses preset tree structure hyperparameters (such as maximum depth) and supervised learning to extract optimal splitting criteria from training datasets, thereby decomposing the p-dimensional feature space (assuming input samples are p-dimensional feature vectors) into several mutually exclusive geometric subspaces. Theoretical studies show that this model exhibits excellent classification performance on datasets with explicit hyperplane partitioning characteristics, but its classification effectiveness decreases significantly when dealing with nonlinearly separable data or classification problems with complex decision boundaries. However, empirical studies demonstrate that when processing tabular data, ensemble learning algorithms derived from decision tree architectures show superior prediction accuracy compared to neural network models. The XGBoost algorithm, for example, achieves significant performance breakthroughs in multiple benchmark tests through iterative decision tree integration and regularization strategies [27].
1.6 Materials Database
Robust predictive model development in materials science machine learning research highly depends on support from large-scale, high-quality datasets [28]. Current mainstream materials informatics platforms include Open Catalyst [29], Materials Project [30], and the Open Quantum Materials Database (OQMD) [31]. This study selected Materials Project as the data source based on the following considerations: the database integrates structural information for over 160,000 materials and provides XAFS spectral data based on FEFF theoretical calculations, which is valuable for establishing structure-property relationships. Additionally, its open Python Application Programming Interface (API) significantly improves data acquisition efficiency and enables seamless integration with high-throughput computational workflows. Data acquisition was performed through the pymatgen materials analysis toolkit, and all processed structured data were stored in a MongoDB document database for efficient retrieval.
According to literature referenced in the official Materials Project documentation, convergence checks and optimization tests were performed on input fields in FEFF9: in convergence checks, researchers varied the rfms1 value from 2 Å to 8 Å in 1 Å increments to alter FEFF's self-consistent potential calculations, controlled by the Self-Consistent Field (SCF) card. Simultaneously, the rfms value was varied from 3 Å to 11 Å in 1 Å increments to determine parameters for the Full Multiple Scattering (FMS) card.
2 Dataset and Preprocessing
The data for this study were sourced from the Materials Project database. We systematically retrieved materials containing fourth-period transition metals and extracted EXAFS calculation data. All EXAFS data were generated through the FEFF computational package. To ensure data quality, we excluded samples where different absorbing atoms within the same material had different coordination numbers, as such data could interfere with coordination number classification.
In this study, we used theoretically calculated EXAFS data from the Materials Project database rather than noise-added calculated EXAFS data. According to research by Paolo Ghigna [32] and Nicholas Marcella [33] et al., the impact of adding noise to calculated data is smaller than that of systematic factors (such as background subtraction and other data processing). Therefore, we could directly use the calculated EXAFS data for machine learning.
The database contains various complex material systems, such as oxides, sulfides, and alloys. While this enhances model generalization capability, it poses challenges for classification. Additionally, overly complex classification dilutes prediction accuracy for individual categories. Therefore, this study focuses on classifying only the absorbing atoms.
For periodic EXAFS data, we employed Fourier transform for preprocessing. For large-scale data processing, we implemented automated processing using the Larch toolkit combined with Python programming. Specifically, the autobk function in Larch was used to convert raw E-space data to k-space data, followed by the xftf function to Fourier transform k²-weighted k-space data into R-space data.
For R-space data processing, we extracted intensity values at 0.030 Å intervals as features. Additionally, we used Python's Peak package to calculate R-coordinates and intensity values corresponding to each characteristic peak, incorporating these parameters as additional features into the dataset. For comparison, we also trained neural network models using k-space data and wavelet-transformed data. Wavelet transformation employed the cauchy_wavelet function in the Larch toolkit with parameters set to kweight = 0, rmax_out = 6, transforming k-space data from k_min = 3 to k_max = 14.
3 Model Construction and Training
To accomplish coordination number prediction tasks, this study constructed a prediction model based on fully-connected neural networks. As shown in Figure 2 [FIGURE:2], the neural network's input layer contains 326 features corresponding to intensity values of EXAFS R-space spectra at 0.030–0.031 Å intervals. The network employs a dual hidden-layer structure with 512 neurons in each layer. Each hidden layer is followed by ReLU activation function layers and Dropout regularization layers: ReLU activation functions enhance the model's nonlinear expression capability, while Dropout regularization mitigates overfitting. The output layer parameter k varies according to the coordination number range for different elements. During model optimization, the model with optimal validation set performance was selected as the final model through five-fold cross-validation, effectively improving model generalization.
To enhance model interpretability and validate neural network predictions, this study simultaneously constructed ensemble learning models based on decision trees. We first implemented two ensemble models using the Scikit-learn [34] machine learning framework: decision tree ensemble based on Bagging and Random Forest. Bagging generates sub-training sets equal to the data volume through Bootstrap Sampling, trains independent decision tree models on each subset, and integrates predictions through voting. Random Forest introduces feature randomness on top of Bagging—during each node split, only sqrt(d) features (where d is the total feature count) are randomly selected for optimal partitioning, thereby enhancing model diversity. Both models use the Gini Index as the splitting criterion. As shown in Figure 3 [FIGURE:3], each box in the diagram represents a split node. "Freq_r" in the box indicates the intensity corresponding to coordinate r in R-space of the EXAFS spectrum, while "samples" indicates the data volume contained in that node before splitting.
This study systematically analyzed EXAFS spectra of fourth-period transition metal materials using neural networks and decision tree algorithms. As shown in Figure 4 [FIGURE:4], for each metal, the models from left to right are: neural network model, Bagging model, Random Forest model, Bagging model with peak height and position features, and Random Forest model with peak height and position features. Results show that both methods achieve prediction accuracies around 70%, with neural networks exhibiting slightly better performance than decision tree models. Notably, the accuracies of both methods show significant correlation, displaying synchronous increasing or decreasing trends with element type. In ensemble learning applications, Random Forest models achieve approximately 1–3 percentage points higher prediction accuracy than Bagging models. Additionally, this study examined feature datasets incorporating peak position and height information, finding no significant difference in decision tree model prediction accuracy compared to the original dataset.
Results indicate that model prediction accuracy exhibits significant element dependence in coordination number classification. Vanadium (V) shows the best prediction performance in the neural network framework, reaching 81.74% accuracy, while cobalt (Co) shows relatively lower accuracy at only 67.39%. Further analysis reveals that this difference may relate to coordination number distribution characteristics of elements in the database. Specifically, V maintains a relatively stable six-coordinate structure in most materials, whereas Co exhibits more diverse coordination number variations, likely contributing to its lower prediction accuracy.
As shown in Figure 5 [FIGURE:5], we plotted each element's prediction rate against the information entropy of coordination numbers. The scatter plot uses least squares linear regression to reveal the relationship between these variables, with R-squared values indicating fit strength. The results show that prediction accuracy exhibits certain correlation with coordination number distribution characteristics.
Additionally, this experiment compared prediction accuracies based on k-space and q-space (wavelet transform) data. As shown in Figure 6 [FIGURE:6], prediction accuracies vary across the three spaces, but the overall trend is similar: prediction rates show common increasing or decreasing trends with element changes. R-space prediction accuracy is superior to k-space and q-space in most cases, leading us to recommend R-space for neural network learning.
To demonstrate model bias characteristics, we used Zn as an example to create a confusion matrix comparing true and predicted coordination numbers.
This study further constructed feature importance distribution maps for EXAFS spectra of each element based on the Gini Index from decision tree ensemble models. Figure 7 [FIGURE:7] shows the feature importance and its correspondence to R-space for Co and Mn EXAFS data in Bagging and Random Forest models, where (a) represents Bagging Gini Index for Co, (b) represents Random Forest Gini Index for Co, (c) represents Bagging Gini Index for Mn, and (d) represents Random Forest Gini Index for Mn. Since the R-space dataset contains intensity values at 326 equally spaced coordinate points, the maximum single feature importance value is generally below 0.1. Based on experience in EXAFS spectral processing, we only discuss the region where R < 6.0 Å. Comparative analysis of relative importance distributions reveals that feature importance weights are primarily concentrated in the short-range interaction region with R < 3.0 Å, consistent with the empirical understanding from EXAFS theory that short-range atomic interactions dominate spectral features.
For distant coordination shells with R > 6.0 Å, current models struggle to reliably establish direct correspondence between high-shell coordination numbers and R-space parameters. Because high-shell signal intensities are low, prediction errors increase significantly, failing to meet accuracy requirements. In practice, obtaining coordination information about high coordination shells is also difficult.
Furthermore, this study observed that Bagging significantly amplifies importance differences between features in feature importance distribution maps, while Random Forest shows more balanced feature importance distribution. This phenomenon may relate to Bagging's tendency to overfit. Specifically, Bagging removes only one feature per iteration for optimization, causing features with higher importance to be further amplified across multiple iterations. In contrast, the Random Forest model used in this study simultaneously screens 14 features per iteration for optimization, thereby preventing excessive enhancement of single feature importance and effectively improving model generalization. This finding indicates that Random Forest models can better balance feature importance and reduce overfitting risk when processing high-dimensional data, providing a more robust solution for EXAFS spectral analysis.
Conclusion
This study successfully developed and validated a machine learning method for predicting coordination numbers of fourth-period transition metal elements from EXAFS spectra. By collecting large-scale EXAFS data from the Materials Project database and combining neural networks with decision tree algorithms, our models achieved approximately 70% average prediction accuracy in coordination number classification. Notably, neural network models performed exceptionally well for certain elements (such as V), reaching up to 81.74% accuracy, demonstrating their strong potential for processing complex spectral data.
Results show that Random Forest models not only match neural networks in prediction performance but also reveal key information in EXAFS spectra through feature importance analysis. Specifically, intensity information from data points in R-space with R < 3.0 Å after Fourier transformation is crucial for coordination number prediction. This finding aligns with the EXAFS theory perspective that short-range atomic interactions dominate spectral features, providing theoretical support and interpretability for model decision-making processes.
However, the study also found significant element dependence in model prediction accuracy. For example, Co showed lower accuracy (67.39%), possibly related to diverse coordination number distributions in the database. Additionally, feature importance analysis revealed that peak position and intensity parameters for some elements (such as Co) were less important than expected, suggesting potentially underutilized spectral features. This indicates room for improvement in current feature extraction methods.
Future research directions should include optimizing feature engineering, exploring more sophisticated spectral feature extraction techniques, and trying other advanced machine learning algorithms to improve prediction accuracy and generalization across different elements. Meanwhile, hybrid approaches combining physical models with machine learning also warrant further investigation to enhance the efficiency and reliability of EXAFS spectral analysis. The results of this study provide new tools and ideas for rapid analysis of local material structures, with significant scientific importance and application prospects.
References
[1] (Citation referenced in text but not provided in reference list)
[2] Tetef S, Govind N, Seidler G T. Unsupervised machine learning for unbiased chemical classification in X-ray absorption spectroscopy and X-ray emission spectroscopy[J]. Physical Chemistry Chemical Physics, 2021, 23(41): 23586–23601. DOI: 10.1039/D1CP02903G.
[3] Timoshenko J, Anspoks A, Cintins A, et al. Neural Network Approach for Characterizing Structural Transformations by X-Ray Absorption Fine Structure Spectroscopy[J]. Physical Review Letters, 2018, 120(22): 225502: 1-6. DOI: 10.1103/PhysRevLett.120.225502.
[4] James G, Witten D, Hastie T, et al. An Introduction to Statistical Learning: with Applications in Python[M]. Springer New York, 2023.
[5] Hoerl A E, Kennard R W. Ridge regression: biased estimation for nonorthogonal problems[J]. Technometrics A Journal of Stats for the Physical Chemical & Engineering Sciences, 2000, 42(1): 80-86. DOI: 10.2307/1271436.
[6] Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2011, 73(3): 267-288. DOI: 10.1111/j.1467-9868.2011.00771.x.
[7] Ouyang R, Curtarolo S, Ahmetcik E, et al. SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates[J]. Phys. Rev. Mater, 2018, 2: 083802. DOI: 10.1103/PhysRevMaterials.2.083802.
[8] Quinlan J R. Induction of Decision Trees[J]. Machine Learning, 1986, 1(1): 81-106. DOI: 10.1007/BF00116251.
[9] Opitz D, Maclin R. Popular Ensemble Methods: An Empirical Study[J]. Journal of Artificial Intelligence Research, 1999, 11: 169-198. DOI: 10.1613/jair.614.
[10] Breiman L. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123–140. DOI: 10.1007/BF00058655.
[11] Ho T K. The random subspace method for constructing decision forests[J]. Transactions on Pattern Analysis & Machine Intelligence, 1998, 20(8): 832-844. DOI: 10.1109/34.709601.
[12] Freund Y, Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139. DOI: 10.1007/3-540-59119-2_166.
[13] Torrisi S B, Carbone M R, Rohr B A, et al. Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships[J]. npj Comput Mater, 2020, 6(1): 109. DOI: 10.1038/s41524-020-00376-6.
[14] Sayers D E, Stern E A, Lytle F W. New Technique for Investigating Noncrystalline Structures: Fourier Analysis of the Extended X-Ray—Absorption Fine Structure[J]. Physical Review Letters, 1971, 27(18): 1204-1207. DOI: 10.1103/PhysRevLett.27.1204.
[15] Sun Z, Liu Q, Yao T, et al. X-ray absorption fine structure spectroscopy in nanomaterials[J]. Science China Materials, 2015, 58(4): 313-341. DOI: 10.1007/s40843-015-0043-4.
[16] Rehr J J, Kas J J, Vila F D, et al. Parameter-free calculations of X-ray spectra with FEFF9[J]. Physical Chemistry Chemical Physics, 2010, 12(21): 5503-5513. DOI: 10.1039/b926434e.
[17] Joly Y. X-ray absorption near-edge structure calculations beyond the muffin-tin approximation[J]. Physical Review B, 2001, 63: 125120. DOI: 10.1103/PhysRevB.63.125120.
[18] Newville M. Larch: An Analysis Package for XAFS and Related Spectroscopies[J]. Journal of Physics Conference Series, 2013, 430: 012007. DOI: 10.1088/1742-6596/430/1/012007.
[19] Pan H, Ganose A M, Horton M, et al. Benchmarking Coordination Number Prediction Algorithms on Inorganic Crystal Structures[J]. Inorganic Chemistry, 2021, 60(3): 1590–1603. DOI: 10.1021/acs.inorgchem.0c02996.
[20] Brunner G O. A definition of coordination and its relevance in the structure types AlB2 and NiAs[J]. Acta Crystallographica, 1977, 33(1): 226-227. DOI:10.1107/s0567739477000461.
[21] O'Keefe M, Brese N E. Atom sizes and bond lengths in molecules and crystals[J]. Journal of the American Chemical Society, 1991, 113(9): 3226-3229. DOI:10.1021/ja00009a002.
[22] O'Keeffe M. A proposed rigorous definition of coordination number[J]. Acta Crystallographica Section A, 1979, 35: 772-775. DOI: 10.1107/S0567739479001765.
[23] Voronoi G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Premier mémoire. Sur quelques propriétés des formes quadratiques positives parfaites[J]. Journal für die reine und angewandte Mathematik, 1908, 133: 97–178. DOI: 10.1515/crll.1908.133.97.
[24] Zimmermann N E R, Jain A. Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity[J]. RSC Advances, 2020, 10: 6063–6081. DOI: 10.1039/C9RA07755C.
[25] Ong S P, Richards W D, Jain A, et al. Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis[J]. Computational Materials Science, 2013, 68: 314-319. DOI: 10.1016/j.commatsci.2012.10.028.
[26] Agarap A F M. Deep Learning using Rectified Linear Units (ReLU)[J]. arXiv, 2018. DOI:10.48550/arXiv.1803.08375.
[27] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 785–794. DOI: 10.1145/2939672.2939785.
[28] Lee K L K, Gonzales C, Nassar M, et al. MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling[J]. arXiv, 2023. DOI: 10.48550/arXiv.2309.05934.
[29] Chanussot L, Das A, Goyal S, et al. Open Catalyst 2020 (OC20) Dataset and Community Challenges[J]. ACS Catalysis, 2021, 11(10): 6059-6072. DOI: 10.1021/acscatal.0c04525.
[30] Jain A, Ong S P, Hautier G, et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation[J]. APL Materials, 2013, 1(1): 011002. DOI: 10.1063/1.4812323.
[31] Kirklin S, Saal J E, Meredig B, et al. The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies[J]. npj Computational Materials, 2015, 1: 15010. DOI: 10.1038/npjcompumats.2015.10.
[32] Ghigna P, Muri M D, Spinolo G. Computer simulation approach to reliability and accuracy in EXAFS structural determinations[J]. Journal of Applied Crystallography, 2010, 34(3): 325-329. DOI: 10.1107/S0021889801004745.
[33] Marcella N, Shimogawa R, Xiang Y, et al. First shell EXAFS data analysis of nanocatalysts using neural networks[J]. Journal of Catalysis, 2025, 164145. DOI: 10.1016/j.jcat.2025.116145.
[34] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python[J]. arXiv, 2011. DOI: 10.5555/1953048.2078195.