A Novel Approach for Signal Number Estimation in Low-Statistics Measurements
Tian, Mr. Ye, Lu, Dr. Senquan, Tang, Dr. Zhi-Cheng, Chou, Dr. Hsin-Yi, Zhang, Dr. Feng-Ze, Chang, Prof. Yuan-Hann, Li, Prof. Zu-Hao
Submitted 2025-11-11 | ChinaXiv: chinaxiv-202511.00129 | Original in English

Abstract

We present CombineFit, a novel approach for estimating the signal number in low-statistics measurements. Traditional binned maximum likelihood and template fitting techniques often suffer from significant bias and increased uncertainties when the statistics of templates or target data are limited. CombineFit employs analytical functions to simultaneously fit the signal and background templates and target data by minimizing a joint likelihood function. This method is validated with toy Monte Carlo simulation by varying the number of signal/background templates and data samples and has been successfully applied in the data analysis of the Alpha Magnetic Spectrometer. With 10 events in the background template, the binned CombineFit achieved a minimal bias of 2% and an uncertainty of 5.8%, compared to TFractionFitter's bias of 30% and uncertainty of 10%. The unbinned CombineFit further reduces the bias to 1% while maintaining the same uncertainty, whereas the RooFit with Kernel Density Estimation method yields a bias of 3.5% and an uncertainty of 11.5%. These results demonstrate that CombineFit provides a robust solution for signal number estimation under limited data statistics, offering broad applicability in the search for new physics.

Full Text

Preamble

A Novel Approach for Signal Number Estimation in Low-Statistics Measurements Ye Tian, Sen-Quan Lu, Zhi-Cheng Tang, Hsin-Yi Chou, Feng-Ze Zhang, Yuan-Hann Chang, and Zu-Hao Li 1, 2, 1 Key Laboratory of Particle Astrophysics, Institute of High Energy Physics, Chinese Academy of Sciences, 100049 Beijing University of Chinese Academy of Sciences, 100049 Beijing Institute of Physics, Academia Sinica, 115201 Taipei We present CombineFit, a novel approach for estimating the signal number in low-statistics measurements.

Traditional binned maximum likelihood and template fitting techniques often suffer from significant bias and increased uncertainties when the statistics of templates or target data are limited. CombineFit employs ana- lytical functions to simultaneously fit the signal and background templates and target data by minimizing a joint likelihood function. This method is validated with toy Monte Carlo simulation by varying the number of signal/background templates and data samples and has been successfully applied in the data analysis of the Alpha Magnetic Spectrometer. With 10 events in the background template, the binned CombineFit achieved a minimal bias of 2% and an uncertainty of 5.8%, compared to TFractionFitter’s bias of 30% and uncertainty of 10%. The unbinned CombineFit further reduces the bias to 1% while maintaining the same uncertainty, whereas the RooFit with Kernel Density Estimation method yields a bias of 3.5% and an uncertainty of 11.5%. These results demonstrate that CombineFit provides a robust solution for signal number estimation under limited data statistics, offering broad applicability in the search for new physics.

Keywords

Template Fitting; Statistical Analysis; RooFit; Low Statistics

INTRODUCTION

The decomposition of measured data into contributions from distinct physical processes lies at the core of statisti- cal analysis in particle physics[ ]. Generally, the proba- bility density functions (PDFs) of the observable (e.g. invari- ant mass) from different contributions are obtained by Monte Carlo (MC) simulations. And then the template fitting tech-

nique is applied to the experimental data, by minimizing the 8

statistics, or negative log-likelihood (NLL) functions to obtain the composition of each contribution.

In some cases, Monte Carlo simulations do not reproduce the data precisely, thus the templates are obtained by applying tighter selections from data directly. For example, in the mea- surement of monthly cosmic antiproton fluxes with the Al- pha Magnetic Spectrometer[ ], the background mainly con- sist of electron and pion and the templates are selected by using the Ring Imaging Cherenkov detector, which would re-

duce the samples significantly. And the antiproton signal, due 18

to the nature of its production mechanism, is rare and has a 19

low signal-to-background ratio. The low-statistics of the tem- plates or target data, introduce critical challenges, i.e., the sta- tistical fluctuations in the noisy template distort the true un-

derlying distributions, resulting in significantly biased num- 23

ber of antiproton and increasing uncertainties with traditional template fitting methods.

Under low-statistics condition, due to the low-count bins, the assumption of Gaussian errors is no longer valid, thus the

usage of traditional weighted least square or χ 2 minimization 28

is inadequate. To correctly account for the low-statistics in

This work was supported by Ministry of Science and Technology, National Key R&D Program Grants No. 2022YFA1604802; and National Science and Technology Council, NSTC 114-2123-M-001-007.

each bin, one can use the likelihood functions for Poisson

distribution and perform the minimization of the binned NLL 31

to obtain the signal fraction [

− ln L = �

where is the number of observed events in the bin and is the sum of predictions from signal and background with the signal fraction However, such likelihood does not incorporate the fluctu- ations of the template distribution. Barlow and Beeston [ found the exact likelihood, by using the template expectations in each bin and source as nuisance parameters:

− ln L = �

, where f i = �

TFractionFitter in ROOT[ ], and is widely used in High En- ergy Physics experiments searching for new physics [ However, the large number of nuisances parameters, which scales as sources , pose a significant challenge in solving the non-linear equation[ ], requiring long com- putation time, and often lead to biased results. Several works ] propose to use only one nuisance parameter in each bin to approximate the exact likelihood, which are imple- mented in the iMinuit package[ ]. By shrinking the number of nuisance parameters, the computation time is reduced but the resulting bias are not resolved.

With the unbinned likelihood method, the templates PDFs are analytical functions or kernel densities. For N events, the joint likelihood can be written as:

− ln L = − �

where is the fraction of component is the PDF of component evaluated at . The unbinned template fitting

can reduce the bias of the fitted signal number by avoiding the

binning artifacts. However, the unbinned methods are sensi- 61

tive to the PDF modeling, i.e., the parameters describing the templates distributions, or the widths of the kernel densities.

In this paper, we propose the CombineFit method, which jointly models the templates and target data through analyti-

cal likelihood optimization, aiming to significantly reduce the 66

biases and constrain the template uncertainties. METHODOLOGY: LIKELIHOOD FOR COMBINEFIT Generally, template fitting is performed by firstly con- structing the signal template PDF and background

template PDF f bkg ( θ b ) , by minimizing: 71

1. Signal template NLL:

− ln L sig = �

, where n s,i is the signal counts, and λ s,i ( θ s ) = 74 �

represents the number of expected signal events in the

2. Background template NLL:

− ln L bkg = �

, where n b,i is the background counts, and λ b,i ( θ b ) = 79 �

, representing the number of ex- pected background events in the And then the templates PDFs with are used to fit the experimental data to determine the fraction of the signal

α , by minimizing the data NLL: 84

− ln L data = �

, where

λ ( α ) = �

) + (1 , represents the predicted number of events, and is the data event counts.

In this process, the minimization of NLL for the templates 90

and data are separated. Namely, the parameter fixed during the fitting of . Thus the uncertainties of the signal and background templates are not taken into account in the fittings of the composition.

In CombineFit, instead of fitting in separated steps, we construct a combined likelihood function that in- cludes both the data and the templates: total

By minimizing this combined NLL, the template param- 99

eters are constrained by not only their respective template data but also the target data, i.e., they are determined simultaneously with the signal fraction This is similar to the TFractionFitter method, where the statistical uncertainties of the signal and background tem- plates are taken into account. However, CombineFit differs from TFractionFitter in that it models the template distribu- tions as continuous analytical functions (e.g., exponential, Gaussian mixtures), which limits the number of parameters and smooth the bin-wise fluctuations, reducing the biases and errors.

Similarly, the CombineFit can be also implemented in the unbinned likelihood fit, by replacing the binned Poisson like- lihood functions with the unbinned likelihood functions. This

will further reduce the biases due to the binning effects. 114

MONTE CARLO SIMULATION TEST The MC data generation is based on the data used in the time-variation antiproton flux analysis in the Alpha Magnetic Spectrometer (AMS-02), in which the number of antipro- ton signals is extracted by performing a template fit on the mass distribution of negatively charged samples[ ]. The sig- nal templates are constructed with the proton data sample,

which is > 10 4 more abundant than the antiproton. The back- 122

ground templates are constructed with the pion+electron sam- ple, which has a much smaller statistic (order of ) since we have to apply tighter event selection criteria to have clean samples. Since the analysis is performed on a monthly ba- sis, the statistics of the data sample are on the order of only , which consists of approximately 200 background events and 200 signal events.

The signal and background mass templates are modeled using a function ( ExpGaussExp characterized by a Gaussian core with exponential tails on both sides[ ], which well represents their physical distribu- tions. These parameterized functions are then used for Toy Monte Carlo Simulations.

1. Parameterization of Template PDFs from Data

As shown in Fig. , the AMS-02 proton sample and elec- tron+pion sample selected from flight data are fitted using the ExpGaussExp function in Eq.(

ExpGaussExp( x ) =

where: : The central location parameter of the distribution. : The scale parameters on the left and right sides, respectively, analogous to the standard deviation,

determining the spread around the central point. 144

The transition parameters on the left and right sides, respectively, specifying where the function tran- sitions from a Gaussian core to exponential tails. . The resulting PDFs are used to generate the Monte Carlo (MC) data.

2 Mass

with large statistics sample: (a) background distribution, including . (b) signal distribu- tion obtained from proton data. The black triangle and square the data points and the red lines represent the fit to Eq.(

2. Generation of MC Data

For each MC test, the signal templates, background tem- plates, and target (mixture of signal and background) datasets are all independently generated using the fitted PDFs. For each test, we generate:

Signal Template: 5000 events Background Template: Ranging from 10 to 8000 events Target Data: 400 events per experiment, composed of a mix of 200 signal events and 200 background events. templates, (b) signal templates and (c) target data. For each number of events configuration, a total of rounds of ex- periments are simulated.

3. Decomposition of signal contribution

� , x < x 0 − α L σ L ,

� , x > x 0 + α R σ R ,

For each generated dataset, we perform the fitting to ob- tain the signal numbers with several methods: the Combine- Fit described in this paper, the TFractionFitter implemented in the ROOT framework[ ], and the recommended method ] that are implemented in the latest iMinuit package.

Events/bin Events/bin Events/bin

2 Mass

(b) 5000-event signal template; (c)target data (200 background + 200 signal events). The black dots are generated data and curves are the fitted function with CombineFit. The magenta (dash-dotted) and red (dash) curves are constrained by both the template histograms in (a) and (b), as well as the target data histogram in (c).

To further evaluate the robustness of our methods, we also implemented the unbinned version of CombineFit and compare with RooFit[ ]’s results using RooAbsPdf RooKeysPdf . The unbinned CombineFit followed the same procedure as described in the binned scenario, but with un- binned datasets for the signal, background, and target data.

RESULTS

A total of rounds of MC simulations for each set of event number configuration, and different template fitting methods are applied to obtain the signal numbers. The dis-

tribution of fitted signal numbers of each method is ob- tained and it follows Gaussian distribution. A Gaussian fit is performed and the difference between the Gaussian mean and the number of generated signal events represents the bias, , and the Gaussian represents the uncertainty of the method. Then the root mean squared error represents the total error relative to the number of generated signal events:

1. Binned Scenario

As shown in Fig. (a), the bias in the signal numbers as a function of the statistics of the background template is pre- sented for different methods. As expected, with all methods, the bias decreases as the number events in the background template increases. The results shows that TFractionFitter, DA methods exhibit large bias under low-template statistics

( ≈ 30% ), while the CombineFit method resulted in minimal 195

bias that’s around with 10 events in the background tem- plate and drop to with 100 events in the background template, while the other methods still have more than biases. ods. With 10 events in the background template, the Com- bineFit shows an uncertainty of , compared to about 9% for DA and 10% for TFractionFitter. And with more than 100 events in the background template, all methods show less than 5% uncertainties. than 100 events in the background templates, since Combin- eFit has both the smaller bias and sigma, the total error is also smaller. As the number of events increase, all methods perform similarly.

2. Unbinned Scenario

As shown in Fig. (a), similar to binned scenario, as the number of background template events increases, the esti- mated signal yields converge toward the true signal value,

resulting diminishing biases. Particularly, the results with un- 215

binned RootFit function fitting show larger biases than the binned CombineFit method, demonstrating that the separated

minimization between the templates and the target data will 218

cause larger biases. The unbinned CombineFit method fur- ther reduced the bias to below with 10 events in the back- ground template, and bias with 100 events in the back- ground template. The RooFit KDE method shows faster de- creasing biases, however it exhibits a systematic shift to nega- tive biases at larger number of events, demonstrating the lim- itation of this method. methods. With 10 events in the background template, the un- binned RooFit’s method showed larger uncertainties of compared to the binned methods. The unbinned and binned version of CombineFit have the similar uncertainties. With more than 100 events, all methods show below 5% uncertain- ties, consistent with the binned methods. ods. With 10 events in the background template, since the CombineFit TFractionFitter Uncertainty CombineFit TFractionFitter Total Error CombineFit TFractionFitter Bkg Template Fitter; magenta triangle: DA). (b)The statistical uncertainties of the fitted signal number as a function of the number of events in the background templates; (c)The total error representing the deviation relative to the number of generated signal events unbinned CombineFit has smaller bias, the total error is also smaller. With more than 100 events in the all methods per- form similarly.

3. Robustness Across Configurations

Tables demonstrate CombineFit’s robust performance under varied signal/background composition in the target data, and different number of events in the templates.

rors between different methods, for N b : N s = 200 : 200 , 243

N b : N s = 50 : 200 or N b : N s = 200 : 50 in the target 244

data, with different number of events in the background or signal templates. In Table , the numbers of signal and back- ground are the same in the target data, and both are varied. In Table , only the template corresponds with less number of events in the target data is varied. When one template statistic is varied, the other is fixed as 8000 to independently check their contributions.

As shown in the tables, in low-statistics the Combine- Fit methods result in the smallest errors compared to the other methods. In most of the cases, the binned Combine- Fit method shows smaller errors compared to the unbinned In Table and Table , the target data are fixed to

N b = 200 : 200 , but both the numbers of events in the signal 258

and background templates are small compared to the target

Binned CombineFit Unbinned CombineFit Unbinned RooFit(Func) Unbinned RooFit(KDE) Uncertainty Binned CombineFit Unbinned CombineFit Unbinned RooFit(Func) Unbinned RooFit(KDE) Total Error Binned CombineFit Unbinned CombineFit Unbinned RooFit(Func) Unbinned RooFit(KDE) Bkg Template (a) The bias between the fitted signal and the true signal events as a function of number of events in background templates, in the unbinned scenario (red dots: binned CombineFit; blue circle unbinned CombineFit; magenta square: RooFit with function PDF; green triangle: RooFit with KDE). (b) The statistical uncertainties of the fitted signal number; (c)The total error representing the deviation relative to the number of generated signal events. data. In most of the cases, CombineFit show the smallest total errors.

CONCLUSIONS

A new approach for signal number estimation in low- statistics measurements, CombineFit , that use analytical func- tions to simultaneously fit the templates and data by combin- ing the likelihood of data and every component is proposed and tested. CombineFit improves fitting stability and accu- racy compared to traditional methods by reducing bias from

∼ 30% to <2% and lowering errors by 35% (with 10 events in 269

the background template), and further reduce the bias to 1%

in the unbinned scenario while maintaining accuracy across 271

different configurations. This method has been successfully applied in the time- variation analysis of cosmic antiproton flux with AMS, pro-

viding unique information for understanding the particle 275

transport in the Solar System[ ]. This method can be also applied to other experiments in search for new physics.

= 200 : 200 in the target data. error(%) ratio

Nb : Ns = 200 : 200

Ntemp CombineFit UCombineFit RooFit Ntemp = 200 : 200 = 50 : 200 = 200 : 50 in the target data. error(%) ratio

Nb : Ns = 50 : 200

Ntemp CombineFit UCombineFit RooFit Ntemp = 200 : 50 = 200 : 200 in the target data (binned scenario) error(%) Ntemp Ntemp CombineFit CombineFit CombineFit CombineFit = 200 : 200 in the target data (unbinned scenario) error(%) Ntemp Ntemp UCombineFit RooFit UCombineFit RooFit UCombineFit RooFit UCombineFit RooFit

F. James, Statistical Methods in Experimental Physics (2nd Edition) , World Scientific Publishing Company, 2006. ISBN:

9789813101845. Available at:

J. A. Nelder and R. W. M. Wedderburn, “Generalized Lin- ear Models,” Journal Royal Statistical Society.

Series (General), vol. 135, no. 3, pp. 370–384, 1972. Available at:

M. Aguilar, G. Ambrosi, H. Anderson, et al. (AMS Collab- oration), "Antiprotons and Elementary Particles over a So- lar Cycle: Results from the Alpha Magnetic Spectrometer", Physical Review Letters, vol. 134, 051002, 2025. Available 10.1103/PhysRevLett.134.051002 S. Baker and R. Cousins, Clarification of the use of CHI- square and likelihood functions in fits to histograms, Nu- clear Inst. and Methods in Physics Research, 221, (1984) 437-442. Available at: com/science/article/pii/0167508784900164

[5] R. Barlow and C. Beeston, Fitting using finite 297

Monte Carlo samples, Computer Physics Communi- 298

cations 77 (1993) 219—228 . Available at: https: pii/001046559390005W R. Brun and F. Rademakers, ROOT — An object oriented data analysis framework, Nuclear Inst. and Methods in Physics Re- search, A (1997) 389. Available at: 10.1016/S0168-9002(97)00048-X M. Aguilar, L. Ali Cavasonza, G. Ambrosi, et al. (AMS Col- laboration), "The Alpha Magnetic Spectrometer (AMS) on the International Space Station: Part II — Results from the First Seven Years", Physics Reports, vol. 894, pp. 1-116,

2021. Available at:

com/science/article/pii/S0370157320303434 G. Aad, B. Abbott, K. Abeling, et al.(ATLAS Collabora- tion), "Precise measurements of W- and Z-boson transverse momentum spectra with the ATLAS detector using pp colli- sions at TeV and 13 TeV", 1126 (2024), Available at: epjc/s10052-024-13414-0 G. Aad et al. (ATLAS Collaboration), Combined measure- ments of Higgs boson production and decay using up to

80 fb − 1 of proton-proton collision data at √ s =

13 TeV

320

collected with the ATLAS experiment, Phys. Rev. D (2020) no.1, 012002, Available at: 10.1103/PhysRevD.101.012002 A. M. Sirunyan et al. (CMS Collaboration), Combined mea- surements of Higgs boson couplings in proton–proton col-

lisions at √ s = 13 TeV,” Eur. Phys. J. C 79 , no.5, 421 326

(2019). Available at: article/10.1140/epjc/s10052-019-6909-y M. Aguilar et al. (AMS Collaboration), Antiproton Flux, Antiproton-to-Proton Flux Ratio, and Properties of Elementary Particle Fluxes in Primary Cosmic Rays Measured with the Al- pha Magnetic Spectrometer on the International Space Station, Phys. Rev. Lett. , 091103 (2016). Available at: https:

J.S. Conway, Incorporating Nuisance Parameters in Likeli- hoods for Multisource Spectra arXiv:1103.0354. Available at:

H. Dembinski, M. Schmelling and R Waldi, A new maximum- likelihood method for template fits, Eur. Phys. J. C (2022) 82:1043. Available at: epjc/s10052-022-11019-z C. A. Argüelles, A. Schneider and T. Yuan, A binned like- lihood for stochastic models, J. High Energ. Phys. (2019) 2019:30. Available at:

JHEP06(2019)030 F. James and M. Roos, Minuit: A System for Func-

tion Minimization and

Analysis

of the Parameter Er- 347

rors and Correlations, Comput. Phys. Commun. (1975) 10:343–367. Available at: 0010-4655(75)90039-9 S. Das, A simple alternative to the Crystal Ball function, arXiv:1603.08591v1. Available at: 10.48550/arXiv.1603.08591 Verkerke Kirkby, RooFit toolkit modeling arXiv:physics/0306116.

Available article/pii/0167508784900164

Submission history

A Novel Approach for Signal Number Estimation in Low-Statistics Measurements