ChinaRxiv

Unraveling the Black Box of Neural Networks: A Dynamic Extremum Mapper

Shengjian Chen

Submitted 2025-10-13 | ChinaXiv: chinaxiv-202510.00066

Note: Figures in this paper have not yet been translated.

Abstract

We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a new algorithm that is significantly different from back-propagation algorithm, which mainly obtains the values of parameters by solving a system of linear equations. Some difficult situations, such as gradient vanishing and overfitting, can be simply explained and dealt with in this framework.

Full Text

Preamble

Unraveling the Black Box of Neural Networks: A Dynamic Extremum Mapper

Shengjian Chen
Intelligent Robotics Center, Jihua Laboratory
Foshan, 528200, China
chshengj@alumni.sysu.edu.cn, chensj@jihualab.ac.cn

Abstract

We argue that neural networks are not black boxes, and that their generalization capability stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a novel algorithm that differs significantly from backpropagation, which primarily obtains parameter values by solving a system of linear equations. Within this framework, challenging issues such as gradient vanishing and overfitting can be explained and addressed straightforwardly.

Keywords: neural network, generalization, black box, extreme increment, homogeneous system of linear equations, large language model

1 Introduction

Although artificial intelligence models based on neural networks have been extensively studied and widely applied—achieving prediction accuracy in image recognition, natural language processing, text processing, and question answering that far surpasses traditional machine learning algorithms—research on their underlying principles remains limited, and they are still generally regarded as black boxes. As model parameters increase dramatically from ANNs to CNNs, RNNs, and further to GPT and LLMs \cite{Wu et al., 2025}, system complexity rises sharply while stability becomes increasingly vulnerable. Without understanding the model's logic, we cannot quickly identify root causes and resolve issues when malfunctions occur. While neural network algorithms can be confidently deployed in domains with low real-time requirements, such as image classification and AI-generated artwork, high-stakes, real-time domains—particularly safety-critical applications like autonomous driving \cite{Kiran et al., 2021}—demand greater attention to the fundamental principles of neural networks, clarifying the conditions under which they succeed and fail so that AI can better serve human society.

Despite the prohibitive complexity of modern neural network architectures, scholars continue striving to explore their working principles. Buhrmester et al. \cite{Buhrmester et al., 2021} investigated recently popular explanation methods that attempt to interpret neural networks by analyzing input-output connections. Black-box explainers, characterized by their ability to reveal model interactions without accessing internal structures, are primarily divided into ante-hoc systems with global, model-agnostic features \cite{Lipton, 2018} and post-hoc systems with local, model-specific features \cite{Ribeiro et al., 2016}. Oh et al. \cite{Oh et al., 2019} analyzed neural networks from a reverse engineering perspective, finding them extremely vulnerable to various attacks and noting that the boundary between white-box and black-box models is not clearly defined. Tishby and Zaslavsky \cite{Tishby and Zaslavsky, 2015} proposed the Information Plane, arguing that neural networks primarily optimize the Information Bottleneck between compression and prediction at each layer, with Shwartz-Ziv and Tishby \cite{Shwartz-Ziv and Tishby, 2017} later proving its effectiveness. While these works provide valuable references for foundational neural network research, significant challenges remain.

Current researchers appear overly focused on engineering-based explanations of neural network behavior while neglecting theoretical or mathematical perspectives. Although neural networks may seem complex, their structure is clear—composed essentially of identical neurons—making them particularly amenable to mathematical analysis. It is necessary to revisit the pioneering work of Cybenko \cite{Cybenko, 1989} and Hornik et al. \cite{Hornik et al., 1989}, who proved that feedforward neural networks can approximate any continuous function on a compact set. Specifically, a feedforward neural network with a single hidden layer containing sufficient neurons and using the sigmoid activation function can approximate any complex function with arbitrary accuracy, establishing a fundamental mathematical principle for neural networks. The limitation of this work is its lack of a method for finding the specific function for a given dataset or determining whether that function is optimal. Our work addresses these deficiencies.

Specifically, our contributions include: (1) We present the main characteristics of an ideal machine learning model and derive general model training steps, discussed primarily in Section 2. (2) We examine whether neural networks satisfy these ideal characteristics, demonstrating mathematically that neural networks achieve generalization primarily by mapping a dataset to the local extrema of the function. We further propose a model training algorithm distinct from backpropagation (BP), called the extremum-increment (EI) algorithm, discussed in Sections 3 and 4. (3) Based on the EI algorithm, we can readily explain the causes of common problems such as vanishing/exploding gradients and overfitting, and provide corresponding solutions, discussed in Section 5.

2 General Characteristics of an Ideal Model

Let us temporarily set aside the concept of neural networks and consider the fundamental characteristics a model should possess to satisfy a dataset and target task. The training objective of machine learning is to obtain a function curve that precisely fits all sample inputs to their corresponding outputs. In other words, the model should clearly state the exact value for each input sample. For classification problems, for instance, the model should output "This is a cat" rather than the vague "This is very likely a cat."

2.1 Precise Mapping

Situations without identical-type samples: For visualization purposes, we limit the sample size to 3 in this discussion. As shown in Figure 1 [FIGURE:1], let the dataset be $D = {(x^{(i)}, y^{(i)})|i \in [1, 3]}$, where $(x^{(i)}, y^{(i)})$ represents the $i$-th sample, $x^{(i)}$ is the original representation of the sample, and $y^{(i)}$ is the category to which $x^{(i)}$ belongs. Our goal is to find a function $F$ for each $x^{(i)}$ such that $y^{(i)} = F(x^{(i)})$.

To reveal the true working principle of neural networks, we abandon the concepts of feature and label, instead using surface and essence to refer to $x^{(i)}$ and $y^{(i)}$, respectively. To grasp the core problem and simplify it, both surface and essence in Section 2 are represented as scalars. The function $F$ shown in Figure 1 is the ideal model we seek, as it can precisely provide the corresponding essence $y^{(i)}$ for any surface $x^{(i)}$.

Situations with identical-type samples: As shown in Figure 2 [FIGURE:2], if a new sample essentially identical to one in dataset $D$ is added—for instance, adding sample $(x^{(3,1)}, y^{(3)})$—the function curve $F$ must change shape so that the new sample falls exactly on the function curve. At this point, a local maximum appears on the function curve between samples $(x^{(3,1)}, y^{(3)})$ and $(x^{(3)}, y^{(3)})$.

Similarly, as shown in Figure 3 [FIGURE:3], if new samples sharing the essence $y^{(3)}$—such as $(x^{(3,2)}, y^{(3)})$, $(x^{(3,3)}, y^{(3)})$, and $(x^{(3,4)}, y^{(3)})$—are continuously added, the function curve must further change shape to accommodate these new samples, thereby forming multiple local minima and maxima. If function $F$ can achieve such shape alteration, it possesses the ability to precisely map any surface to its essence, meaning it has true generalization capability.

2.2 Weakened Mapping

Obtaining the aforementioned ideal function typically requires enormous computation. For a function with a limited number of parameters, the degree of curve shape change is constrained, and the number of extreme values cannot be increased arbitrarily. How then should we handle situations where a sample's surface changes only slightly while its essence remains unchanged? A natural approach is to expand the essence from a single point to an interval, allowing samples with slightly different surfaces but the same essence to be concentrated within this interval. As shown in Figure 4 [FIGURE:4], we add sample $(x^{(3,5)}, y^{(3)})$ where the distance between $x^{(3,5)}$ and $x^{(3)}$ is sufficiently small. We adjust the precise mapping function $F$ to the approximate fitting function $F^$, making the difference between $F^(x^{(3,5)})$ and $F^(x^{(3)})$ as small as possible. When $|F^(x^{(3,5)}) - F^(x^{(3)})|$ is sufficiently small, we can approximately consider that all surfaces falling within the interval $[F^(x^{(3,5)}), F^(x^{(3)})]$ share the essence $y^{(3)}$. We then call function $F^$ a weakened model of function $F$.

Interval partition: Each sample consists of both a surface and an essence. A surface is typically a one-dimensional vector or multi-dimensional matrix. Once the algorithm for generating the surface is determined—for example, using a two-dimensional matrix to represent a grayscale image where each element's value ranges from 0 to 255—the surface becomes fixed and cannot be further modified. The essence, however, is different. It is usually just an abstract concept and can be represented by any scalar or vector. As shown in Figure 4, if the shape of function $F$'s curve is restricted, each essence requires a tolerance interval. How is this interval selected? One method is to divide the range of function $F$ into $N$ intervals of equal length, where $N$ is the total number of essence types. Each interval is then assigned to an essence, and surfaces falling within the same interval share that essence. When using this method, the range of the objective function's values should be finite.

2.3 From N-Classification to Binary Classification

A problem arises with the interval partition method: when there are many essence types and the function $F$'s value range is limited to a small interval—for example, when each element in a neural network's output layer is constrained to $(0, 1)$—overlap becomes likely. One solution is to reduce the number of essence types, thereby expanding each partition. However, this introduces two new questions: to what extent should essence types be reduced, and how should excluded essences be handled?

To address these issues, we can reduce the number of essences to exactly one type. That is, the target model changes from an $N$-classification function $F$ to $N$ binary classification functions ${F_j|j \in [1, N]}$, where the $j$-th binary classification function $F_j$ only determines whether the input sample belongs to the $j$-th essence type. For any given sample $(x^{(i)}, y^{(i)})$ where $i > 0$, the ideal objective function $F_j$ satisfies:

$$
F_j(x^{(i)}) = \begin{cases}
1 & y^{(i)} = j \
0 & y^{(i)} \neq j
\end{cases}
$$

The weakened objective function $F_j^*$ satisfies:

$$
F_j^(x^{(i)}) \in \begin{cases}
[LB^, UB^], \frac{LB^ + UB^}{2} \approx 1 & y^{(i)} = j \
[LB^, UB^], \frac{LB^ + UB^*}{2} \approx 0 & y^{(i)} \neq j
\end{cases}
$$

where $LB$ and $UB$ are the lower and upper limits of function $F_j$, and $LB^$ and $UB^$ are the corresponding limits of $F_j^$. For the ideal function $F_j$, each given sample is adjusted to be an extremum point. Figure 5 [FIGURE:5] illustrates a binary classification function $F_3$. We adjust $F_3$'s parameters so that all third-essence samples are adjusted to the function's upper limit, while all other-essence samples are adjusted to the lower limit. Correspondingly, the weakened function $F_3^$ uses the midpoint of the value range as the dividing line. The same parameter adjustment applies to other binary classification functions (such as $F_1$, $F_2$).

2.4 General Training Process of an Ideal Model

In summary, the ideal training process for all machine learning models that are essentially classification problems can be summarized as follows: (1) Transform the $N$-class objective function $F$ into a family of binary classification functions ${F_j|j \in [1, N]}$ and initialize all parameters. (2) For each $F_j$, adjust parameters so that each training surface $x^{(i)}$ is exactly one of the function's extrema. (3) For each $F_j$, adjust parameters so that training samples of the $j$-th essence become local maxima, while non-$j$-th essence samples become local minima. (4) Adjust parameters to make local maxima become global maxima and local minima become global minima.

Section 3 demonstrates that neural networks can be decomposed into binary classification functions, with corresponding surfaces mapped to local extrema by finding the general solution of a homogeneous system of linear equations, thereby validating Steps 1 and 2. Section 4 presents a method for mapping surfaces to global extrema by enumerating particular solutions of the homogeneous system, thus confirming Steps 3 and 4.

3.1 Model Decomposition

Any neural network—whether a traditional artificial neural network, a convolutional neural network, or a recurrent neural network—consists of three components: an input vector with a fixed number of elements, an intermediate processing layer with undetermined parameters, and an output vector whose length equals the number of essence types. To reduce computational complexity and focus on the core working process, we conduct derivative analysis only on fully connected neural networks. Additionally, mainstream neural networks often use the softmax function in the output layer, which is merely a normalization operation added to the sigmoid function. To simplify operations, we directly use the sigmoid function as the output layer, so both hidden and output layers employ the sigmoid function.

Furthermore, we remove biases as they are irrelevant to the model's essential attributes but make calculations cumbersome and reduce readability. Based on this simplification, we analyze the structure of a fully connected neural network for an $N$-classification problem.

Figure 6 [FIGURE:6] schematically illustrates a fully connected neural network expressed through numerical relationships rather than graphical representation. Each sample is denoted as $(x, y)$, where the surface $x$ is an $m$-dimensional column vector $x = (x_1, x_2, \ldots, x_m)^T$, and $y \in [1, l_n]$ is the essence corresponding to $x$, with $l_n$ representing the number of elements in the output vector (the number of essence types).

The neural network comprises $n$ layers with identical processing methods, where the first $n-1$ layers are hidden layers and the $n$-th layer is the output layer. The total number of elements in the $u$-th layer (with the input vector as the 0-th layer) is denoted $l_u$, and the $v$-th element in the $u$-th layer is denoted $h^{[u]}_v(x)$, where $v \in [1, l_u]$. As shown in Figure 6, despite the dazzling neuronal connections, a neural network is actually a set of $l_n$ composite functions ${h^{[n]}_v(x)|v \in [1, l_n]}$, all sharing the same hidden layers.

For samples belonging to the $v$-th essence ($v \in [1, l_n]$), the neural network's target output vector is $(0, \ldots, h^{[n]}_v(x) = 1, \ldots, 0)^T$. For all other essence types, the target output vector is $(\omega, \ldots, h^{[n]}_v(x) = 0, \ldots, \omega)^T$, where one $\omega$ is 1 and the others are 0. In this simplified model, it is worth noting that when the sigmoid function is used as output, the upper and lower limits of $h^{[n]}_v(x)$ can only approach 1 and 0 asymptotically. When we transform the seemingly $l_n$-dimensional output vector into $l_n$ scalars, the entire model becomes clear: each composite function $h^{[n]}_v(x)$ is actually a binary classification problem for the $v$-th essence.

Therefore, a neural network with a multi-dimensional output vector can be regarded as a collection of multiple binary classification functions, as shown in Figure 7 [FIGURE:7]. We can analyze each function $h^{[n]}_v(x)$ separately and then integrate them to obtain the neural network's characteristics.

3.2 Extreme Points of the Model

Specifically, the expression for function $h^{[u]}_v(x)$ satisfies:

$$
h^{[u]}v(x) = S\left(\sum_k(x)\right), \quad u > 1}^{l_{u-1}} w^{[u]}_{v,k} \cdot h^{[u-1]
$$

$$
h^{[u]}v(x) = S\left(\sum \cdot x_k\right), \quad u = 1}^{m} w^{[u]}_{v,k
$$

where $S(\theta) = \frac{1}{1 + e^{-\theta}}$ is the sigmoid function, and $w^{[u]}{v,k}$ represents the parameters between the $(u-1)$-th and $u$-th layers, identical to traditional neural network parameters. To enhance readability, let $z^{[u]}_v(x) = \sum_v(x)$:}^{l_{u-1}} w^{[u]}_{v,k} \cdot h^{[u-1]}_k(x)$. Taking the partial derivative of $h^{[u]

$$
\frac{\partial h^{[u]}_v(x)}{\partial x_t} = S'(z^{[u]}_v(x)) \cdot \frac{\partial z^{[u]}_v(x)}{\partial x_t} = S(z^{[u]}_v(x)) \cdot (1 - S(z^{[u]}_v(x))) \cdot \frac{\partial z^{[u]}_v(x)}{\partial x_t}
$$

where $t \in [1, m]$, using the derivative property $S'(\theta) = S(\theta) \cdot (1 - S(\theta))$. Let $c^{[u]}_v(x) = S(z^{[u]}_v(x)) \cdot (1 - S(z^{[u]}_v(x)))$, then:

$$
\frac{\partial h^{[u]}v(x)}{\partial x_t} = c^{[u]}_v(x) \cdot \sum}^{l_{u-1}} w^{[u]}_{v,k} \cdot \frac{\partial h^{[u-1]}_k(x)}{\partial x_t
$$

For an extreme point, $\frac{\partial h^{[u]}v(x)}{\partial x_t} = 0$. Since $c^{[u]}_v(x) > 0$, we have $\sum = 0$. Starting from the output layer ($u = n$) and taking partial derivatives with respect to all components of $x$, we obtain the following system of equations:}^{l_{u-1}} w^{[u]}_{v,k} \cdot \frac{\partial h^{[u-1]}_k(x)}{\partial x_t

$$
L(n, v) = \begin{cases}
\sum_{k=1}^{l_{n-1}} w^{[n]}{v,k} \cdot \frac{\partial h^{[n-1]}_k(x)}{\partial x_1} = 0 \
\sum}^{l_{n-1}} w^{[n]{v,k} \cdot \frac{\partial h^{[n-1]}_k(x)}{\partial x_2} = 0 \
\vdots \
\sum = 0}^{l_{n-1}} w^{[n]}_{v,k} \cdot \frac{\partial h^{[n-1]}_k(x)}{\partial x_m
\end{cases}
$$

When a surface $x$ is given, this system constitutes a homogeneous linear system of $m$ equations with $l_{n-1}$ independent variables ${w^{[n]}{v,k}|k \in [1, l}]}$ and $m \cdot l_{n-1}$ coefficients ${\frac{\partial h^{[n-1]k(x)}{\partial x_t}|k \in [1, l_v(x)$ for any $v \in [1, l_n]$. This is the primary reason neural networks possess strong generalization ability, and the black box begins to be unveiled.}], t \in [1, m]}$. Let the rank of the coefficient matrix of $L(n, v)$ be $r(n, v)$. When $r(n, v)$ is less than the number of unknowns $l_{n-1}$, the linear system has infinitely many solutions. Since $r(n, v) \leq m$, as long as the number of neurons in the last hidden layer $l_{n-1}$ exceeds $m$ when designing a neural network, we can always find infinitely many parameter combinations that make surface $x$ an extremum point of the binary classification function $h^{[n]

The curve shapes of other binary classification functions can be adjusted simultaneously. Let:

$$
L(n, -) = L(n, 1) \cup L(n, 2) \cup \cdots \cup L(n, l_n)
$$

When given a surface $x$, $L(n, -)$ forms a homogeneous linear system of $m \cdot l_n$ equations with $l_n \cdot l_{n-1}$ variables ${w^{[n]}{v,k}|v \in [1, l_n], k \in [1, l}]}$ and $m \cdot l_{n-1}$ coefficients ${\frac{\partial h^{[n-1]k(x)}{\partial x_t}|k \in [1, l_v(x)$ satisfies the ideal termination condition:}], t \in [1, m]}$. Any solution of $L(n, -)$ makes surface $x$ an extremum point of each binary classification function. We can then select a particular solution such that when surface $x$ belongs to the $v$-th essence, the corresponding extremum is a maximum, and when $x$ belongs to other essences, the extremum is a minimum. That is, $h^{[n]

$$
h^{[n]}_v(x) = \begin{cases}
1 & y = v \
0 & y \neq v
\end{cases}, \quad y \in [1, l_n]
$$

If finding this particular solution proves difficult, constraints can be relaxed by adopting a weakened termination condition:

$$
h^{[n]}_v(x) \in \begin{cases}
(0.5, 1] & y = v \
[0, 0.5) & y \neq v
\end{cases}, \quad y \in [1, l_n]
$$

3.3 Continuous Optimization of Parameter Combinations

The above discussion covers only the case with a single training sample. How should we proceed when the number of samples increases? Let:

$$
L(n, v, x^{(i)}) = \begin{cases}
\sum_{k=1}^{l_{n-1}} w^{[n]}{v,k} \cdot \frac{\partial h^{[n-1]}_k(x)}{\partial x_1}\bigg| = 0 \}
\sum_{k=1}^{l_{n-1}} w^{[n]}{v,k} \cdot \frac{\partial h^{[n-1]}_k(x)}{\partial x_2}\bigg| = 0 \}
\vdots \
\sum_{k=1}^{l_{n-1}} w^{[n]}{v,k} \cdot \frac{\partial h^{[n-1]}_k(x)}{\partial x_m}\bigg| = 0}
\end{cases}
$$

Then:

$$
L(n, -, x^{(i)}) = L(n, 1, x^{(i)}) \cup L(n, 2, x^{(i)}) \cup \cdots \cup L(n, l_n, x^{(i)})
$$

When training the neural network with dataset $\Phi = {(x^{(i)}, y^{(i)})|i \in [1, \phi]}$, we are actually solving the following homogeneous linear system:

$$
L(n, -, \Phi) = L(n, -, x^{(1)}) \cup L(n, -, x^{(2)}) \cup \cdots \cup L(n, -, x^{(\phi)})
$$

If $L(n, -, \Phi)$ has infinitely many solutions, a particular solution meeting the conditions can be found from the general solution. Otherwise, parameters between the $(n-2)$-th and $(n-1)$-th layers must be introduced. That is, we need to expand the partial derivatives in system $L(n, v)$ again.

Substituting $\frac{\partial h^{[n-1]}k(x)}{\partial x_t} = c^{[n-1]}_k(x) \cdot \sum$ into $L(n, v)$ and simplifying yields:}^{l_{n-2}} w^{[n-1]}_{k,p} \cdot \frac{\partial h^{[n-2]}_p(x)}{\partial x_t

$$
L(n-1, v) = \begin{cases}
\sum_{k=1}^{l_{n-1}} w^{[n]}{v,k} \cdot \sum}^{l_{n-2}} w^{[n-1]{k,p} \cdot \frac{\partial h^{[n-2]}_p(x)}{\partial x_1} = 0 \
\sum}^{l_{n-1}} w^{[n]{v,k} \cdot \sum}^{l_{n-2}} w^{[n-1]{k,p} \cdot \frac{\partial h^{[n-2]}_p(x)}{\partial x_2} = 0 \
\vdots \
\sum}^{l_{n-1}} w^{[n]{v,k} \cdot \sum = 0}^{l_{n-2}} w^{[n-1]}_{k,p} \cdot \frac{\partial h^{[n-2]}_p(x)}{\partial x_m
\end{cases}
$$

Similarly, we can obtain system $L(n-1, -)$, which forms a homogeneous nonlinear system of $m \cdot l_n$ equations with $l_{n-1} \cdot l_n + l_{n-2} \cdot l_{n-1}$ independent variables ${w^{[n]}{v,k}|v \in [1, l_n], k \in [1, l}]}$ and ${w^{[n-1]{k,p}|k \in [1, l}], p \in [1, l_{n-2}]}$. Although $L(n-1, -)$ appears nonlinear, its highly regular structure allows solution methods for homogeneous linear equations to be applied (for instance, $w^{[n]{v,k} \cdot w^{[n-1]}$ can be treated as a single entity). We then simply need to find the particular solution of system $L(n-1, -, \Phi)$ that meets the requirements.

By solving homogeneous equations layer by layer, the dataset can be mapped to the neural network.

4.1 General Training Method

From the above discussion, we have derived a preliminary model training framework, which we call the EI algorithm. Its main steps differ significantly from current neural network training methods such as backpropagation. First, BP uses gradient updates to approximate ideal parameter values, while EI attempts to directly obtain parameter values by solving systems of equations. Second, BP updates all parameters each iteration, whereas EI only updates some parameters. Debugging all training samples to the model's extremum points is the key to the entire framework. This subsection discusses algorithmic details in greater depth.

Table 1 [TABLE:1] shows neural network parameter states under the EI algorithm at each round, where $W^{[u]} = {w^{[u]}{v,k}|v \in [1, l_u], k \in [1, l$ are updated. This process continues, solving $L(n-2, -, \Phi)$, and so on.}]}$, "init" indicates parameters remain at initial values, and "update" indicates parameters are updated in the current round. In the first round, we solve system $L(n, -, \Phi)$. If a solution exists, only parameters $W^{[n]}$ need updating. Otherwise, in the second round, we solve $L(n-1, -, \Phi)$. If a solution exists, parameters $W^{[n]}$ and $W^{[n-1]

Algorithm 4.1 presents the main steps for precisely mapping a dataset to the neural network model. Symbols retain their previous meanings unless otherwise specified. In the algorithm's initial stage, we manually label the sample set $\Phi$. If sample $(x^{(i)}, y^{(i)})$ is classified as the $j$-th essence ($j \in [1, l_n]$), then $(x^{(i)}, y^{(i)}) = (x^{(i)}, j)$. Subsequently, we initialize parameter set $W$ to non-zero real numbers.

Algorithm 1: Precise Mapping from Input to Output

Input: $\Phi = {(x^{(i)}, y^{(i)})|i \in [1, \phi]}$
Output: $W = {w^{[u]}{v,k}|u \in [1, n], v \in [1, l_u], k \in [1, l]}$

function FittingCurve()
    Init(W)
    for u ∈ [1, n-1], v ∈ [1, l_u], t ∈ [1, m], i ∈ [1, φ] do
        Calculate(∂h^{[u]}_v(x)/∂x_t|_{x=x^{(i)}})  // for calculating W^{[u:n]}
    end for
    for u ∈ [1, n-1], v ∈ [1, l_u], i ∈ [1, φ] do
        Calculate(h^{[u]}_v(x^{(i)}))
    end for
    u ← n
    while u ≥ 1 do
        W^{[u:n]} ← L(u, -, Φ)  // the general solution, equals to {W^{[j]}|j ∈ [u, n]}
        W^{[u:n]} ← Polarize({h^{[n]}_v(x)|v ∈ [1, l_n]}, W^{[u:n]}, Φ)  // the particular solution
        if W^{[u:n]} ≠ {} ∧ W^{[u:n]} ≠ {0} then
            W ← Update(W^{[u:n]})
            break
        end if
        u ← u - 1
    end while
    if u ≥ 1 then
        return W
    else
        return Error()
    end if
end function

Like the BP algorithm, parameter updates proceed layer by layer from the last hidden layer to the first. We first calculate values and partial derivatives of all hidden layer neurons, then solve for the general solution $W^{[u:n]} = {W^{[j]}|j \in [u, n]}$ of system $L(u, -, \Phi)$. Next, we select a particular solution $W^{[u:n]}$ satisfying the termination condition from $W^{[u:n]}$—an operation we call polarization. If a particular solution $W^{[u:n]}$ is found, parameters $W$ are updated. If no particular solution is found, it indicates that parameter set $W^{[u:n]}$ cannot precisely map sample set $\Phi$ to the neural network, requiring introduction of parameters $W^{[u-1]}$ to find the particular solution $W^{[u-1:n]}$ of system $L(u-1, -, \Phi)$. If no suitable particular solution is found after traversing all neural network parameters, we must consider adjusting the network structure (e.g., increasing hidden layers or nodes per layer).

4.2 Reducing Computational Complexity

In Algorithm 4.1, the polarization time for selecting a particular solution $W^{[u:n]}$ from the general solution $W^{[u:n]}$ is uncertain because we lack knowledge of the particular solution's characteristics and must verify each general solution instance through enumeration. To reduce training time, we can relax the model training termination condition. Specifically, when sample $(x^{(i)}, y^{(i)})$ belongs to the $v$-th essence, the $v$-th binary classification function $h^{[n]}_v(x^{(i)})$ only needs to be much larger than any other binary classification function $h^{[n]}_q(x^{(i)})$, without requiring these values to be maxima or minima. This corresponds to the following weakened condition:

$$
\frac{h^{[n]}v(x^{(i)})}{\sum = v}^{l_n} h^{[n]}_j(x^{(i)})} > 1 - \alpha, \quad v \in [1, l_n], y^{(i)
$$

$$
\frac{h^{[n]}q(x^{(i)})}{\sum q \in [1, l_n], q \neq v}^{l_n} h^{[n]}_j(x^{(i)})} < \beta, \quad \text{any
$$

where $\alpha$ and $\beta$ are sufficiently small positive real numbers. This describes neural networks using the softmax function as the output layer. Thus, a neural network with softmax can be viewed as a weakened version of an ideal model.

4.3 Reducing Computational Scale

If each training sample corresponds to an extreme point on the model curve—i.e., by adding equation set $L(n, -, x^{(i)})$—the required network parameter scale becomes extremely large and training time increases significantly. Can we reduce the number of equation sets? To address this, we propose the concept of surface neighborhood. In a further weakened neural network, only a portion of samples need to be extreme points; other samples only need to satisfy the weakened termination condition. Which samples can have relaxed restrictions? An intuitive idea is that only representatives of adjacent samples need to satisfy strict conditions.

Let $A = (x^{(a)}, y^{(a)})$ and $B = (x^{(b)}, y^{(b)})$ be two samples in dataset $\Phi = {(x^{(i)}, y^{(i)})|i \in [1, \phi]}$, where $a, b \in [1, \phi]$. The distance between these samples is defined as:

$$
D_s(A, B) = \sqrt{\frac{2}{\text{dim}(x)} \sum_{j=1}^{\text{dim}(x)} (x^{(a)}_j - x^{(b)}_j)^2}
$$

If $A$ and $B$ are samples of the same essence ($y^{(a)} = y^{(b)}$) and satisfy the proximity criterion:

$$
D_s(A, B) < \gamma
$$

where $\text{dim}(x)$ represents sample surface dimension and $\gamma$ is a sufficiently small positive real number, then samples $A$ and $B$ of the same essence are considered to be within each other's neighborhood. Due to function continuity, the function values of samples $A$ and $B$ are close on each binary classification function. Thus, one sample need not be sent to the algorithm for training but only requires function value verification. A further weakened training algorithm can be adjusted as follows: (1) Manually classify the training sample set and designate it as the major category. (2) Use a numerical algorithm (e.g., clustering) to divide samples of each major category into minor categories, each with a central sample. (3) Train the model on all central samples. (4) After training, verify whether predicted values of all non-central samples on the neural network meet specified accuracy requirements. If accuracy requirements are satisfied, the algorithm terminates; otherwise, mark non-central samples that fail to meet requirements as central samples and repeat steps 3 and 4.

5.1 Gradient Vanishing/Explosion

Gradient vanishing/explosion is a common and challenging issue in neural network training, particularly in deep networks. To mitigate these effects, scholars have proposed various methods such as batch normalization \cite{Santurkar et al., 2018} and LSTM architectures \cite{Yu et al., 2019}. In the BP algorithm, gradient vanishing/explosion is typically regarded as an abnormal issue to be avoided.

Regarding gradient vanishing, as discussed in Sections 3 and 4, after network parameter initialization, the number of parameter updates required varies with sample size. If a particular solution $W^{[u:n]}$ can be found from the general solution $W^{[u:n]}$, then parameters $W^{[1:u-1]}$ of earlier hidden layers can remain at their initial values. According to the neural network characteristics revealed by the EI algorithm, gradient vanishing is an inevitable result.

Gradient explosion is similar. In the EI algorithm, when solving system $L(1, -, \Phi)$, cases may arise where no solution exists—i.e., the solution value is infinite—corresponding to gradient explosion in the BP algorithm. If the EI algorithm is adopted, simply increasing the number of hidden layers or parameters per layer suffices.

5.2 Overfitting

The overfitting problem \cite{Santos and Papa, 2022} appears different from gradient vanishing/explosion but is essentially caused by the same operational process. In the EI algorithm, if system $L(1, -, \Phi)$ has solutions, but when increasing the number of samples, system $L(1, -, \Phi_\Delta)$ may have no solution (where $\Phi \subset \Phi_\Delta$), the neural network with its current parameter scale can only accommodate a limited number of samples $\Phi$, manifesting as the overfitting phenomenon in the BP algorithm. This is an inherent characteristic of neural networks: only a limited number of extreme values exist under parameter constraints. Rather than calling it overfitting, we might say it fits exactly right.

The BP algorithm reduces model dependence on training samples by adding noise to samples and network parameters \cite{Ying, 2019}. This method resembles the clustering operation described in Section 4.3, enabling a fixed-structure neural network to accommodate more samples, but often at the cost of model accuracy. Another approach is to increase the number of hidden layers or parameters per layer—i.e., increase independent variables in system $L(1, -, \Phi_\Delta)$—thereby accommodating more samples without sacrificing accuracy, at the cost of increased training time.

5.3 Adding Noise

During neural network training, robustness is often enhanced by adding noise to existing samples and retraining \cite{Xia et al., 2022}. We observe that after adding noise, even when humans perceive minimal difference between original and modified samples, machine prediction accuracy drops sharply. This phenomenon can be explained by the neighborhood concept. Let the initial sample be $A = (x, y)$ where $x = (x_1, x_2, \ldots, x_m)^T$, and the noisy sample be $A_\Delta = (x_\Delta, y_\Delta)$ where $x_\Delta = (x_1 + \Delta x_1, x_2 + \Delta x_2, \ldots, x_m + \Delta x_m)^T$. Then:

$$
D_s(A, A_\Delta) = \sqrt{\frac{2}{\text{dim}(x)} \sum_{j=1}^{\text{dim}(x)} (\Delta x_j)^2}
$$

The noisy sample may significantly deviate from the original sample's neighborhood. If it also falls outside the neighborhood of other same-essence samples, the neural network cannot correctly process it ($y_\Delta \neq y$). With too many noisy samples, model convergence becomes difficult because we can add random noise arbitrarily. This is why we call a neural network's input vector a "surface"—there is a significant difference between what a neural network perceives and what humans see.

5.4 Shallow vs. Deep Networks

From the discussions in Sections 3 and 4, the number of samples a neural network can precisely fit is primarily positively correlated with the total number of network parameters, not necessarily with network depth. If the number of samples is limited, we can directly adopt a network structure with only one hidden layer. According to the condition for homogeneous linear equations to have general solutions, a single-hidden-layer network's parameters should exceed the product of sample number, surface dimension, and essence types. If the number of samples is large and can increase dynamically, we can adopt a "tilted trapezoidal" network structure where the last hidden layer has the most parameters, decreasing successively toward the first hidden layer. This minimizes computation of invalid equation sets in the EI algorithm.

5.5 Probability

The traditional view holds that a neural network's output layer provides the probability that an input surface belongs to different essences. We argue this view is not entirely accurate, at least not in the strict statistical sense. In statistics, a random event's probability is defined as the ratio of certain outcomes to total possible outcomes. No matter how large our training set, we cannot exhaust or nearly exhaust the entire sample space, and no clear relationship exists between a finite sample set and infinite sample space. For instance, we can add various noises to existing samples, easily expanding the sample set several-fold or infinitely. Additionally, as shown in Figure 8 [FIGURE:8], the training sample set does not necessarily occupy all extreme points of the trained binary classification function $h^{[n]}_v(x)$. Unoccupied maximum points are not necessarily occupied by $v$-th essence samples, and minimum points are not necessarily occupied by non-$v$-th essence samples. In extreme cases, even if a sample makes $h^{[n]}_v(x) = 1$ hold, it may still be a non-$v$-th essence sample, though this is rare.

6.1 Polarization

Beyond enumeration, we have not yet proposed an efficient algorithm for finding a particular solution from a general solution. This is key to the practical applicability of the EI algorithm. Polarization can be summarized as the following problem:

Problem 1: Given a set of binary classification functions ${h^{[n]}_v(x)|v \in [1, l_n]}$, how can we flip their local extreme points and arbitrarily expand their values?

For example, for parabolic function $y = a(x - b)^2$, changing parameter $a$'s sign flips its extreme point, while increasing $|a|$ increases the extreme value. Furthermore, aside from $L(n, -, \Phi)$, systems $L(u, -, \Phi)$ with $u \in [1, n-1]$ are atypical homogeneous linear equations. Whether their unique structure facilitates obtaining particular solutions warrants further investigation.

6.2 The Output Layer

For computational convenience and demonstration, we adopted the sigmoid function as the output layer processing unit. Although Section 4.2 demonstrated that neural networks using the softmax function correspond to weakened models, complete partial differential analysis of the softmax function remains worth discussing.

6.3 Activation Functions

Our analysis assumes neural networks are continuous functions—i.e., hidden layer neurons use the continuous sigmoid function. How should analysis proceed if other functions are adopted, particularly non-differentiable functions like ReLU?

6.4 Saddle Points

Our discussion assumes that a sample satisfying the system ${\frac{\partial h^{[n]}_v(x)}{\partial x_t} = 0|t \in [1, m]}$ is an extreme point of the binary classification function. For multivariate functions, zero first-order partial derivatives do not necessarily imply an extreme point—it could be a saddle point. This is not problematic if we can find global maxima or minima using the polarization algorithm; otherwise, this remains a topic worthy of discussion.

6.5 Alternative Functions

Our analysis reveals that neural networks' strong generalization ability depends on dynamic variability of their function curves, particularly dynamic adjustment of extreme points. Can other functions with similar properties provide equally strong generalization? For example, the sine function has infinitely many extreme points with range limited to a finite interval, and a polynomial's number of extreme points is positively correlated with its degree. These seemingly simple functions may possess unexpected generalization capabilities.

7 Summary

From a mathematical perspective, we identify the reason for neural networks' strong generalization capabilities, supplementing the limitations in Cybenko \cite{Cybenko, 1989} and Hornik et al. \cite{Hornik et al., 1989}. We also present the corresponding EI algorithm, which differs from the BP algorithm. Without an effective polarization method, the EI algorithm can at least serve as an important submodule of the BP algorithm—i.e., we can first initialize parameters using an EI algorithm general solution instance, then train using BP. If an efficient polarization algorithm is discovered, the EI algorithm could become a strong competitor to BP, particularly when the most ideal model is required.

Acknowledgments and Disclosure of Funding

I am deeply grateful to my supervisor, Professor Yang Lihua, who provided initial guidance and assistance on the basic research direction when I began my research journey. Without his guidance, I might have chosen a different path and missed this study. This research was funded by the Science and Technology Research Project of Key Areas in Nanhai District, Foshan City (Grant No. 2230032004637), for which I express my gratitude.

References

V. Buhrmester, D. Münch, and M. Arens. Analysis of explainers of black box deep neural networks for computer vision: A survey. Machine Learning and Knowledge Extraction, 3(4):966–989, 2021.

G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.

B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.

Z. C. Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, 2018.

S. J. Oh, B. Schiele, and M. Fritz. Towards reverse-engineering black-box neural networks. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pages 121–144, 2019.

M. T. Ribeiro, S. Singh, and C. Guestrin. "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.

C. F. G. D. Santos and J. P. Papa. Avoiding overfitting: A survey on regularization methods for convolutional neural networks. ACM Computing Surveys (CSUR), 54(10s):1–25, 2022.

S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization? Advances in Neural Information Processing Systems, 31, 2018.

R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.

N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.

J. Wu, S. Yang, R. Zhan, Y. Yuan, L. S. Chao, and D. F. Wong. A survey on LLM-generated text detection: Necessity, methods, and future directions. Computational Linguistics, pages 1–66, 2025.

W. Xia, Y. Zhang, Y. Yang, J.-H. Xue, B. Zhou, and M.-H. Yang. GAN inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3121–3138, 2022.

X. Ying. An overview of overfitting and its solutions. In Journal of Physics: Conference Series, volume 1168, page 022022. IOP Publishing, 2019.

Y. Yu, X. Si, C. Hu, and J. Zhang. A review of recurrent neural networks: LSTM cells and network architectures. Neural Computation, 31(7):1235–1270, 2019.

Submission history

[v1] 2025-10-13

Abstract

Full Text

Preamble

Abstract

1 Introduction

2 General Characteristics of an Ideal Model

2.1 Precise Mapping

2.2 Weakened Mapping

2.3 From N-Classification to Binary Classification

2.4 General Training Process of an Ideal Model

3.1 Model Decomposition

3.2 Extreme Points of the Model

3.3 Continuous Optimization of Parameter Combinations

4.1 General Training Method

4.2 Reducing Computational Complexity

4.3 Reducing Computational Scale

5.1 Gradient Vanishing/Explosion

5.2 Overfitting

5.3 Adding Noise

5.4 Shallow vs. Deep Networks

5.5 Probability

6.1 Polarization

6.2 The Output Layer

6.3 Activation Functions

6.4 Saddle Points

6.5 Alternative Functions

7 Summary

Acknowledgments and Disclosure of Funding

References

Submission history

Access Paper

Citation

Share

Related Papers

Feedback

Unraveling the Black Box of Neural Networks: A Dynamic Extremum Mapper