A Nonparallel Support Tensor Machine for Binary Classification Based on Large Margin Distribution and Iterative Optimization
Du Zhuolin, Song Yisheng
Submitted 2025-07-17 | ChinaXiv: chinaxiv-202507.00342

Abstract

Based on the tensor-based large margin distribution and the nonparallel support tensor machine, we establish a novel classifier for binary classification problem in this paper, termed the Large Margin Distribution based NonParallel Support Tensor Machine (LDM-NPSTM). The proposed classifier has the following advantages: First, it utilizes tensor data as training samples, which helps to comprehensively preserve the inherent structural information of high-dimensional data, thereby improving classification accuracy. Second, this classifier not only considers traditional empirical risk and structural risk but also incorporates the marginal distribution information of the samples, further enhancing its classification performance. To solve this classifier, we use alternative projection algorithm. Specifically, building on the formulation where in the proposed LDM-NPSTM, the parameters defining the separating hyperplane form a tensor (tensorplane) constrained to be the sum of rank-one tensors, the corresponding optimization problem is solved iteratively using alternative projection algorithm. In each iteration, the parameters related to the projections along a single tensor mode are estimated by solving a typical Support Vector Machine-type optimization problem. Finally, the efficiency and performance of the proposed model and algorithm are verified through theoretical analysis and some numerical examples.

Full Text

Preamble

A Nonparallel Support Tensor Machine for Binary Classification based on Large Margin Distribution and Iterative Optimization*

Zhuolin Du, Yisheng Song†
School of Mathematical Sciences, Chongqing Normal University, Chongqing, 401331, P.R. China.

Email: duzhuolin728@163.com (Du); yisheng.song@cqnu.edu.cn (Song)

Abstract

Based on the tensor-based large margin distribution and the nonparallel support tensor machine, we establish a novel classifier for binary classification problems in this paper, termed the Large Margin Distribution based NonParallel Support Tensor Machine (LDM-NPSTM). The proposed classifier has the following advantages: First, it utilizes tensor data as training samples, which helps to comprehensively preserve the inherent structural information of high-dimensional data, thereby improving classification accuracy. Second, this classifier not only considers traditional empirical risk and structural risk but also incorporates the marginal distribution information of the samples, further enhancing its classification performance. To solve this classifier, we use an alternative projection algorithm. Specifically, building on the formulation where the parameters defining the separating hyperplane form a tensor (tensorplane) constrained to be the sum of rank-one tensors, the corresponding optimization problem is solved iteratively using an alternative projection algorithm. In each iteration, the parameters related to the projections along a single tensor mode are estimated by solving a typical Support Vector Machine-type optimization problem. Finally, the efficiency and performance of the proposed model and algorithm are verified through theoretical analysis and numerical examples.

Keywords. Nonparallel support tensor machine; margin distribution; CANDECOMP/PARAFAC (CP) decomposition.

AMS subject classifications. 62H30, 15A63, 90C55.

Introduction

A significant number of real-world datasets, especially those involving image data, are frequently represented in tensor format. For instance, a grayscale face image [10] can be modeled as a second-order tensor (or matrix), while color images [12, 26], grayscale video sequences [13], gait contour sequences [22], and hyperspectral cubes [28] are typically expressed as third-order tensors. Additionally, color video sequences are often represented as fourth-order tensors [18, 38]. The tensor-based data representation, with its multi-dimensional structure that complicates the capture of spatial and temporal relationships, high dimensionality prone to the curse of dimensionality and overfitting, and substantial storage and computational requirements, uniquely complicates feature extraction and representation, thereby rendering it a fundamental challenge in the design of classifier models.

One of the most representative and successful classification algorithms is Support Vector Machines (SVM) [6, 30, 31], which have been successfully applied to a variety of real-world pattern recognition problems, such as text classification [15, 1], image classification [29, 27], feature extraction [21, 33, 9], web mining [3], and function estimation [4, 25]. The central idea of SVM is to find the optimal separating hyperplane between positive and negative examples. The optimal hyperplane is defined as the one giving maximum margin between the training examples that are closest to the hyperplane. Different from traditional SVM, in 2007, Jayadeva et al. [16] proposed Twin SVM (TWSVM), which also aims at generating two nonparallel planes such that each plane is closer to one of the two classes and is as far as possible from the other. Notably, the formulation of TWSVMs is very much in line with standard SVMs. However, TWSVMs seek to solve two dual QPPs of smaller size rather than solving a single dual QPP with a large number of parameters in conventional SVM. As a result, the algorithm achieves a processing speed roughly fourfold faster compared to traditional SVM [16]. While the aforementioned classification methods focus on maximizing the minimum margin, research by Gao et al. [11, 39] has indicated that doing so does not necessarily guarantee improved generalization performance. Instead, the distribution of margins has been shown to play a more critical role. Here, the margin distribution is defined by the margin mean and margin variance. Therefore, in order to improve the generalization performance of SVM, Zhou et al. [41] characterized margin distribution through its mean and variance, leading to the development of the Large Margin Distribution Machine (LDM), which builds upon the SVM framework. The effectiveness of LDM has been proved in theory and experiments.

In recent years, there has been a growing interest in extending traditional vector or matrix-based machine learning algorithms to better handle tensor data [5, 14]. This shift is motivated by the need to effectively process high-dimensional datasets, such as those encountered in image and video analysis. In 2005, Tao et al. [35] proposed a Supervised Tensor Learning (STL) scheme by replacing vector inputs with tensor inputs and decomposing the corresponding weight vector into a rank-1 tensor, which is trained by the alternating projection optimization method. Based on this learning scheme, in 2007, Tao et al. [36] further extended the standard linear SVM to a tensorial format known as the Support Tensor Machine (STM). This adaptation allows for more effective classification of tensor data by leveraging its inherent structure. Following this development, Zhang et al. [42] generalized the vector-based learning algorithm TWSVM to the tensor-based method Twin STM (TWSTM), and implemented the classifier for microcalcification clusters detection. By comparison with TWSVM, the tensor version reduces the overfitting problem significantly. Additionally, Khemchandani et al. developed a least squares variant of STM, termed Proximal STM (PSTM) [17], where the classifier is obtained by solving a system of linear equations rather than a quadratic programming problem at each iteration of the PSTM algorithm as compared to the STM algorithm. This modification enhances computational efficiency while maintaining classification performance. Tensor-based algorithms on the other hand decompose the whole problem into several smaller and simpler subproblems, each defined over specific tensor modes and characterized by lower dimensionality. This decomposition has been shown to reduce the degree of overfitting that appears in vector-based learning techniques, particularly when few training samples are available [36].

In this paper, we propose a novel framework, termed Nonparallel Support Tensor Machine based on Large Margin Distribution (LDM-NPSTM), aimed at further enhancing the generalization performance of the Twin Support Tensor Machine (TWSTM). Drawing on the strengths of Large Margin Distribution theory [41] and TWSTM [42], our approach integrates their core principles to address classification challenges more effectively. Specifically, we characterize the margin distribution using first-order (margin mean) and second-order (margin variance) statistics, with the core objective of maximizing the margin mean while minimizing the margin variance to improve classification robustness. To ensure a more rigorous model structure, we incorporate a regularization term into the LDM-NPSTM framework, balancing empirical risk and structural complexity. For model optimization, we adopt an iterative solution based on CANDECOMP/PARAFAC (CP) decomposition. In each iteration, parameters corresponding to projections along a single tensor mode are estimated by solving a typical SVM-type optimization problem. Notably, the inverse matrix involved in the dual problem is inherently nonsingular, eliminating the need for additional assumptions and simplifying the computational process.

The remainder of this paper is structured as follows: In Section 2, we introduce the notations consistently used throughout the paper and provide a concise overview of fundamental concepts, including those related to SVM, TWSVM, TWSTM, and LDM. Section 3 elaborates on our proposed framework, the LDM-NPSTM, with detailed formulations and a discussion of its key advantages. Experimental results that demonstrate the effectiveness of the LDM-NPSTM are discussed in Section 4. Finally, concluding remarks are given in Section 5.

2 Preliminaries

In this section, we first introduce some notation and basic definitions used throughout the paper, and then briefly review related works.

2.1 Notation and Basic Definitions

An m-th order tensor is defined as a collection of measurements indexed by m indices, with each index corresponding to a mode. Vectors are considered first-order tensors, while matrices represent second-order tensors [19].

In this paper, we utilize lowercase letters (e.g., x) to denote scalars, boldface lowercase letters (e.g., x) and boldface capital letters (e.g., X) to represent vectors and matrices, respectively. Tensors of order 3 or higher will be denoted by boldface Euler script calligraphic letters (e.g., $\mathcal{X}$). Furthermore, we denote the set of all mth-order n-dimensional real tensors as $\mathcal{T}{m,n}$. The i-th element of a vector $\mathbf{x} \in \mathbb{R}^n$ is denoted by $x_i$, $i = 1, 2, \ldots, n$. Similarly, the elements of an m-th order tensor $\mathcal{X}$ will be denoted by $x$, where $i_j = 1, 2, \ldots, n_j$ for $j = 1, \ldots, m$. Moreover, we summarize some notations used throughout the paper in Table 1 [TABLE:1].

In the following, we introduce some notation and definitions of tensors and matrices in the area of multilinear algebra [7, 19].

Definition 2.1. (Inner product) Given tensors $\mathcal{X}, \mathcal{Y} \in \mathbb{R}^{I_1 \times \cdots \times I_M}$, the inner product of $\mathcal{X}$ and $\mathcal{Y}$ is defined as
$$
\langle \mathcal{X}, \mathcal{Y} \rangle := \sum_{i_1=1}^{I_1} \sum_{i_2=1}^{I_2} \cdots \sum_{i_M=1}^{I_M} x_{i_1i_2\ldots i_M} y_{i_1i_2\ldots i_M}. \tag{2.1}
$$

Definition 2.2. (Frobenius norm) The Frobenius norm of a tensor $\mathcal{A} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M}$ is defined as
$$
|\mathcal{A}|_F := \sqrt{\langle \mathcal{A}, \mathcal{A} \rangle}.
$$

Remark 2.1. Given two same-sized tensors $\mathcal{A} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M}$ and $\mathcal{B} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M}$, the distance between tensors $\mathcal{A}$ and $\mathcal{B}$ is defined as $|\mathcal{A} - \mathcal{B}|_F$. Note that the Frobenius norm of the difference between two tensors equals the Euclidean distance of their vectorized representations [23].

Definition 2.3. (Outer product) We use $\otimes$ to denote tensor outer product; that is, for any two tensors $\mathcal{A} \in \mathcal{T}{m,n}$ and $\mathcal{B} \in \mathcal{T}$ is given by:
$$
\mathcal{A} \otimes \mathcal{B} = (a_{i_1\ldots i_m}b_{i_1\ldots i_p}) \in \mathcal{T}{m+p,n}. \tag{2.3}
$$
According to this definition, it is easy to check that
$$
\underbrace{\mathbf{x} \otimes \mathbf{x} \otimes \cdots \otimes \mathbf{x}}
.}} = (x_{i_1} \cdots x_{i_k}) \in \mathcal{T}_{k,n
$$

Definition 2.4. (CP decomposition) Given $\mathcal{X} \in \mathbb{R}^{I_1 \times \cdots \times I_M}$, if there exist $\mathbf{u}r^{(1)} \in \mathbb{R}^{I_1}$, $\mathbf{u}_r^{(2)} \in \mathbb{R}^{I_2}$, $\ldots$, $\mathbf{u}_r^{(M)} \in \mathbb{R}^{I_M}$ such that
$$
\mathcal{X} = \sum
}^R \mathbf{u}_r^{(1)} \otimes \mathbf{u}_r^{(2)} \otimes \cdots \otimes \mathbf{u}_r^{(M)} \tag{2.4
$$
where $R$ is a positive integer, we call (2.4) a tensor CANDECOMP/PARAFAC (CP) decomposition of $\mathcal{X}$.

Definition 2.5. (Matricization) The matricization (also known as unfolding or flattening of a tensor) is the reordering of the tensor elements into a matrix. The n-mode matricization of a tensor $\mathcal{X} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M}$, denoted by $\mathbf{X}{(n)} \in \mathbb{R}^{I_n \times (\prod$, arranges the n-mode fibers to become the columns of the final matrix. Each tensor element $(i_1, i_2, \ldots, i_M)$ maps to the matrix element $(i_n, j)$, where} I_k)
$$
j = 1 + \sum_{\substack{k=1 \ k \neq n}}^M (i_k - 1)J_k \quad \text{with} \quad J_k = \prod_{\substack{l=1 \ l \neq n}}^{k-1} I_l.
$$
A more general treatment of matricization can be found in [20].

Definition 2.6. (Matrix Kronecker product) The Kronecker product of matrices $\mathbf{A} \in \mathbb{R}^{I \times J}$ and $\mathbf{B} \in \mathbb{R}^{K \times L}$ is denoted by $\mathbf{A} \otimes \mathbf{B}$. The result is a matrix of size $(IK) \times (JL)$ and defined as
$$
\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix}
a_{11}\mathbf{B} & a_{12}\mathbf{B} & \cdots & a_{1J}\mathbf{B} \
a_{21}\mathbf{B} & a_{22}\mathbf{B} & \cdots & a_{2J}\mathbf{B} \
\vdots & \vdots & \ddots & \vdots \
a_{I1}\mathbf{B} & a_{I2}\mathbf{B} & \cdots & a_{IJ}\mathbf{B}
\end{bmatrix}.
$$

Definition 2.7. (Matrix Khatri-Rao product) Given matrices $\mathbf{A} \in \mathbb{R}^{I \times K}$ and $\mathbf{B} \in \mathbb{R}^{J \times K}$, their Khatri-Rao product is denoted by $\mathbf{A} \odot \mathbf{B}$. The result is a matrix of size $(IJ) \times K$ defined as
$$
\mathbf{A} \odot \mathbf{B} = [\mathbf{a}_1 \otimes \mathbf{b}_1 \quad \mathbf{a}_2 \otimes \mathbf{b}_2 \quad \cdots \quad \mathbf{a}_K \otimes \mathbf{b}_K].
$$

Remark 2.2. If matrices $\mathbf{A}$ and $\mathbf{B}$ of Definition 2.7 are vectors, i.e., $\mathbf{a}$ and $\mathbf{b}$, then the Khatri-Rao and Kronecker products are identical, i.e., $\mathbf{a} \otimes \mathbf{b} = \mathbf{a} \odot \mathbf{b}$.

2.2 Related Works

Support Vector Machines (SVMs) form a class of supervised machine learning algorithms that train the classifier function using pre-labeled data. Specifically, for a given training set ${(\mathbf{x}i, y_i) \mid i = 1, \cdots, m}$, where data points $\mathbf{x}_i \in \mathbb{R}^n$ and class labels $y_i \in {-1, 1}$, the objective of the support vector machine problem is to identify a hyperplane $\mathbf{w}^\top\mathbf{x} + b = 0$, where $\mathbf{w} \in \mathbb{R}^n$ and $b \in \mathbb{R}$, in such a way that the two different classes of data points are separated with maximal separation margin and minimal classification loss. The standard SVM problem can be formulated as the following convex quadratic program [37]:
$$
\begin{aligned}
\min
^m \xi_i \}, b, \boldsymbol{\xi}} \quad & \frac{1}{2}|\mathbf{w}|^2 + C \sum_{i=1
\text{s.t.} \quad & y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0, \quad i = 1, 2, \ldots, m. \tag{2.5}
\end{aligned}
$$
where $\xi_i$ is a slack variable, and $C > 0$ is a penalty parameter that represents the loss weight.

Moreover, the standard SVM model requires solving a single large-scale optimization problem, which can be computationally intensive. To address this issue, Khemchandani et al. proposed Twin SVM (TWSVM) [16]. TWSVM generates two non-parallel planes such that each plane is closer to one of the two classes and is as far as possible from the other. This approach allows TWSVM to solve a pair of smaller-sized quadratic programming problems (QPPs) rather than a single large QPP, resulting in computational speed approximately four times faster than traditional SVMs. The optimization problems in TWSVM can be formulated as the following pair of quadratic programming problems:
$$
\begin{aligned}
\min_{\mathbf{w}^{(1)}, b^{(1)}, \mathbf{q}} \quad & \frac{1}{2}(\mathbf{A}\mathbf{w}^{(1)} + \mathbf{e}1b^{(1)})^\top(\mathbf{A}\mathbf{w}^{(1)} + \mathbf{e}_1b^{(1)}) + c_1\mathbf{e}_2^\top\mathbf{q} \
\text{s.t.} \quad & -(\mathbf{B}\mathbf{w}^{(1)} + \mathbf{e}_2b^{(1)}) + \mathbf{q} \geq \mathbf{e}_2, \quad \mathbf{q} \geq 0 \tag{2.6}
\end{aligned}
$$
$$
\begin{aligned}
\min
\}^{(2)}, b^{(2)}, \mathbf{q}} \quad & \frac{1}{2}(\mathbf{B}\mathbf{w}^{(2)} + \mathbf{e}_2b^{(2)})^\top(\mathbf{B}\mathbf{w}^{(2)} + \mathbf{e}_2b^{(2)}) + c_2\mathbf{e}_1^\top\mathbf{q
\text{s.t.} \quad & -(\mathbf{A}\mathbf{w}^{(2)} + \mathbf{e}_1b^{(2)}) + \mathbf{q} \geq \mathbf{e}_1, \quad \mathbf{q} \geq 0 \tag{2.7}
\end{aligned}
$$
where $c_1, c_2 > 0$ are penalty parameters, $\mathbf{e}_1$ and $\mathbf{e}_2$ are vectors of ones of appropriate dimensions, and $\mathbf{q} \in \mathbb{R}^n$ is the vector of slack variables to deal with linearly nonseparable problems.

Both SVM and TWSVM are vector-based learning algorithms that accept vectors as inputs. In practice, real-world image and video data are more naturally represented as matrices (second-order tensors) or higher-order tensors. Therefore, Zhang et al. [42] generalized the vector-based learning algorithm TWSVM to the tensor-based method Twin Support Tensor Machines (TWSTM), which accepts general tensors as input. The following formulation for TWSTM can be established:
$$
\begin{aligned}
\min_{\mathbf{w}^{(1)}k, b^{(1)}, \mathbf{q}} \quad & \frac{1}{2}\left|\sum}^M \mathcal{X} \times_k \mathbf{w}^{(1)k + \mathbf{e}_1b^{(1)}\right|^2 + c_1\mathbf{e}_2^\top\mathbf{q} \
\text{s.t.} \quad & -\left(\sum
}^M \mathcal{Y} \times_k \mathbf{w}^{(1)k + \mathbf{e}_2b^{(1)}\right) + \mathbf{q} \geq \mathbf{e}_2, \quad \mathbf{q} \geq 0 \tag{2.8}
\end{aligned}
$$
$$
\begin{aligned}
\min
}^{(2)k, b^{(2)}, \mathbf{q}} \quad & \frac{1}{2}\left|\sum}^M \mathcal{Y} \times_k \mathbf{w}^{(2)k + \mathbf{e}_2b^{(2)}\right|^2 + c_2\mathbf{e}_1^\top\mathbf{q} \
\text{s.t.} \quad & \left(\sum
}^M \mathcal{X} \times_k \mathbf{w}^{(2)}_k + \mathbf{e}_1b^{(2)}\right) + \mathbf{q} \geq \mathbf{e}_1, \quad \mathbf{q} \geq 0 \tag{2.9
\end{aligned}
$$
where $c_1$ and $c_2$ are penalty parameters, $\mathbf{e}_1$ and $\mathbf{e}_2$ are vectors of appropriate dimensions, and $\mathbf{q} \in \mathbb{R}^n$ is the vector of slack variables to deal with linearly nonseparable problems.

The objective function of the above model is based on minimizing the margin. From the perspective of structural risk, Gao et al. [11] have verified that the margin distribution is more important than minimum margins in optimizing generalization performance. By characterizing the margin distribution in terms of margin mean and margin variance, Zhou et al. proposed the Large Margin Distribution Machine (LDM) on the basis of SVM, which focuses on optimizing the distances from the center of the other category. The following formulation for LDM can be established:
$$
\begin{aligned}
\min_{\mathbf{w}, \boldsymbol{\xi}} \quad & \mathbf{w}^\top\mathbf{w} + \lambda_1\hat{\gamma} - \lambda_2\bar{\gamma} + C \sum_{i=1}^m \xi_i \
\text{s.t.} \quad & y_i\mathbf{w}^\top\mathbf{x}_i \geq 1 - \xi_i, \quad \xi_i \geq 0, \tag{2.10}
\end{aligned}
$$
where $\lambda_1$ and $\lambda_2$ are parameters trading off the margin variance $\hat{\gamma}$, the margin mean $\bar{\gamma}$, and model complexity.

3 Proposed LMD-NPSTM

In this section, LDM-NPSTM is proposed. Specifically, Subsection 3.1 introduces the structure of the model, optimization strategies, and associated dual problems. Moreover, the detailed implementation algorithm of LDM-NPSTM is shown in Subsection 3.2.

3.1 Model Construction and Optimization

Consider the binary classification problem in tensor space, where the training set is defined as $T_m = {(\mathcal{X}_p, y_p) \mid p = 1, 2, \ldots, m_1} \cup {(\mathcal{Y}_q, y_q) \mid q = 1, 2, \ldots, m_2}$. Here, $\mathcal{X}_p, \mathcal{Y}_q \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M}$ denote the feature tensor of the $p$-th and $q$-th sample, respectively; $y_p = 1$ and $y_q = -1$ are their corresponding class labels; and $m = m_1 + m_2$ (where $m_1$ and $m_2$ are the numbers of positive and negative samples in $T_m$, respectively). Let $\mathbf{y}_1 = [1, \cdots, 1]^\top \in \mathbb{R}^{m_1 \times 1}$ be the label vector for all positive samples, where each element corresponds to the label of a positive instance, and $\mathbf{y}_2 = [-1, \cdots, -1]^\top \in \mathbb{R}^{m_2 \times 1}$ denote the label vector for all negative samples, with each element corresponding to the label of a negative instance. The LDM-NPSTM identifies two non-parallel hyperplanes in the feature space:
$$
f_1(\mathcal{X}) = \langle \mathcal{W}_1, \mathcal{X} \rangle = 0, \quad f_2(\mathcal{Y}) = \langle \mathcal{W}_2, \mathcal{Y} \rangle = 0,
$$
where $\mathcal{W}_1, \mathcal{W}_2 \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M}$.

Notably, inspired by the work of Zhou et al. [41], the bias term does not affect the overall derivation process. Furthermore, unlike the conventional approach that calculates margin mean and variance using the entire dataset, the LDM-NPSTM separates these calculations into positive-class and negative-class components [40]. Specifically, these metrics are associated with the distances to their respective hyperplanes, as elaborated below.

The distance from an individual data point to the hyperplane is defined by:
$$
\gamma_p^+ = y_p \frac{|\langle \mathcal{W}_2, \mathcal{X}_p \rangle|}{|\mathcal{W}_2|_F}, \quad p = 1, \ldots, m_1,
$$
$$
\gamma_q^- = y_q \frac{|\langle \mathcal{W}_1, \mathcal{Y}_q \rangle|}{|\mathcal{W}_1|_F}, \quad q = 1, \ldots, m_2,
$$
where $y_p^1 \in {1, -1}$ and $y_q^2 \in {1, -1}$ denote the labels of positive/negative samples, respectively.

The margin means are defined as follows:
$$
\bar{\gamma}^+ = \frac{1}{m_1} \sum_{p=1}^{m_1} \gamma_p^+ = \frac{1}{m_1} \sum_{p=1}^{m_1} \frac{|\langle \mathcal{W}2, \mathcal{X}_p \rangle|}{|\mathcal{W}_2|_F},
$$
$$
\bar{\gamma}^- = \frac{1}{m_2} \sum
.}^{m_2} \gamma_q^- = \frac{1}{m_2} \sum_{q=1}^{m_2} \frac{|\langle \mathcal{W}_1, \mathcal{Y}_q \rangle|}{|\mathcal{W}_1|_F
$$

The margin variances are defined as follows:
$$
\hat{\gamma}^+ = \frac{1}{m_1-1} \sum_{p=1}^{m_1} (\gamma_p^+ - \bar{\gamma}^+)^2 = \frac{1}{m_1-1} \sum_{p=1}^{m_1} \left(\frac{|\langle \mathcal{W}2, \mathcal{X}_p \rangle|}{|\mathcal{W}_2|_F} - \bar{\gamma}^+\right)^2,
$$
$$
\hat{\gamma}^- = \frac{1}{m_2-1} \sum
}^{m_2} (\gamma_q^- - \bar{\gamma}^-)^2 = \frac{1}{m_2-1} \sum_{q=1}^{m_2} \left(\frac{|\langle \mathcal{W}_1, \mathcal{Y}_q \rangle|}{|\mathcal{W}_1|_F} - \bar{\gamma}^-\right)^2. \tag{3.1
$$

To effectively capture complex data relationships and enhance classification performance, we assume the weights in subsequent classifiers form a tensor $\mathcal{W}$. This tensor can be decomposed into a sum of $R$ rank-one tensors, as defined by the CP decomposition outlined in Definition 2.4, i.e.,
$$
\mathcal{W} = \sum_{r=1}^R \mathbf{u}_r^{(1)} \otimes \mathbf{u}_r^{(2)} \otimes \cdots \otimes \mathbf{u}_r^{(M)} \tag{3.3}
$$
where $\mathbf{u}_r^{(j)} \in \mathbb{R}^{I_j}$, $j = 1, 2, \ldots, M$, and $M$ is the number of tensor modes. The weight tensor $\mathcal{W}$ generalizes the weight vector $\mathbf{w}$ in SVMs, where $\mathbf{w}$ represents the normal vector perpendicular to the separating hyperplane. The tensor form extends traditional vector-based weights to capture higher-dimensional structural information.

For mode $j$ ($j = 1, \ldots, M$), we stack the rank-one components ${\mathbf{u}r^{(j)}}^R$ into a matrix:
$$
\mathbf{U}^{(j)} = [\mathbf{u}1^{(j)}, \mathbf{u}_2^{(j)}, \ldots, \mathbf{u}_R^{(j)}] \in \mathbb{R}^{I_j \times R}.
$$
The $j$-th matricization of $\mathcal{W}$ (i.e., reshaping the tensor into a matrix by unfolding along the $j$-th mode) is given by:
$$
\mathbf{W}
} = \mathbf{U}^{(j)} \left(\mathbf{U}^{(M)} \odot \cdots \odot \mathbf{U}^{(j+1)} \odot \mathbf{U}^{(j-1)} \odot \cdots \odot \mathbf{U}^{(1)}\right)^\top = \mathbf{U}^{(j)} \left(\mathbf{U}^{(-j)}\right)^\top. \tag{3.4
$$
Similarly, the $j$-th matricization of samples $\mathcal{X}$ is given by $\mathbf{X}_{(j)} = \mathbf{V}^{(j)} \left(\mathbf{V}^{(-j)}\right)^\top$. For samples $\mathcal{Y}$, the matricization follows the same structural form (consistent with $\mathcal{X}$).

Further, consider the tensor inner product property:
$$
\langle \mathcal{W}, \mathcal{W} \rangle = \text{Tr}\left(\mathbf{W}{(j)}\mathbf{W}}^\top\right) = \text{vec}\left(\mathbf{W{(j)}\right)^\top \text{vec}\left(\mathbf{W}}\right), \tag{3.5
$$
where $\text{Tr}(\cdot)$ denotes the trace operator, and $\text{vec}(\cdot)$ denotes vectorization.

Analogous to Equation (3.5), for individual sample tensors $\mathcal{X}p$ and $\mathcal{Y}_q$, their inner products with weight tensors $\mathcal{W}_1, \mathcal{W}_2$ satisfy:
$$
\langle \mathcal{W}_1, \mathcal{X}_p \rangle = \text{Tr}\left(\mathbf{W}
}^\top \mathbf{X{p(j)}\right), \quad \langle \mathcal{W}_2, \mathcal{X}_p \rangle = \text{Tr}\left(\mathbf{W}}^\top \mathbf{X{p(j)}\right),
$$
$$
\langle \mathcal{W}_1, \mathcal{Y}_q \rangle = \text{Tr}\left(\mathbf{W}
}^\top \mathbf{Y{q(j)}\right), \quad \langle \mathcal{W}_2, \mathcal{Y}_q \rangle = \text{Tr}\left(\mathbf{W}}^\top \mathbf{Y}_{q(j)}\right). \tag{3.6
$$

To guarantee that the matrices in the LDM-NPSTM dual problem are nonsingular, we add a regularization term to maximize some margin. The formula is described as $|\mathcal{W}_i|_F^2$, $i = 1, 2$. (3.7) By utilizing (3.7), the LDM-NPSTM dual problem can be derived without any additional assumptions and modifications.

LDM-NPSTM seeks a pair of tensors $\mathcal{W}1$ and $\mathcal{W}_2$ simultaneously maximizing the mean of positive and negative margin while minimizing the variance of margin, i.e., focusing on two optimization tasks that follow:
$$
\begin{aligned}
\min
1, \boldsymbol{\xi}_2} \quad & \frac{1}{2}\sum}^{m_1} \langle \mathcal{W1, \mathcal{X}_p \rangle^2 + |\mathcal{W}_1|_F^2 + \lambda_1\hat{\gamma}^- - \lambda_3\bar{\gamma}^- + c_3\mathbf{e}_2^\top\boldsymbol{\xi}_2 \
\text{s.t.} \quad & -\sum
}^{m_2} \langle \mathcal{W1, \mathcal{Y}_q \rangle + \xi} \geq 1, \quad \xi_{2q} \geq 0, \quad q = 1, \ldots, m_2, \tag{3.8
\end{aligned}
$$
$$
\begin{aligned}
\min_{\mathcal{W}2, \boldsymbol{\xi}_1} \quad & \frac{1}{2}\sum}^{m_2} \langle \mathcal{W2, \mathcal{Y}_q \rangle^2 + |\mathcal{W}_2|_F^2 + \lambda_2\hat{\gamma}^+ - \lambda_4\bar{\gamma}^+ + c_4\mathbf{e}_1^\top\boldsymbol{\xi}_1 \
\text{s.t.} \quad & -\sum
}^{m_1} \langle \mathcal{W2, \mathcal{X}_p \rangle + \xi} \geq 1, \quad \xi_{1p} \geq 0, \quad p = 1, \ldots, m_1, \tag{3.9
\end{aligned}
$$
where $\lambda_1, \lambda_2, \lambda_3, \lambda_4$ are hyperparameters that balance margin variance, margin mean, and model complexity; $c_1$ and $c_2$ control regularization strength, while $c_3$ and $c_4$ weight the penalty on slack variables $\boldsymbol{\xi}_1$ and $\boldsymbol{\xi}_2$; $\mathbf{e}_1$ and $\mathbf{e}_2$ are all-ones vectors matching the dimensions of $\boldsymbol{\xi}_1$ and $\boldsymbol{\xi}_2$.

However, the optimization problems (3.8)-(3.9) are non-convex in the tensor parameters $\mathcal{W}1$ and $\mathcal{W}_2$. To address this, we adopt an alternating optimization scheme. At each iteration, we fix all tensor modes except the $j$-th, solving for $\mathbf{W}$ while keeping other modes constant.}$ and $\mathbf{W}_{2(j)

More specifically, taking (3.5) and (3.6) into consideration, at the iterations for the $j$-th mode we solve the following optimization problems:
$$
\begin{aligned}
\min_{\mathbf{W}{1(j)}, \boldsymbol{\xi}_2} \quad & \frac{1}{2}\sum}^{m_1} \left(\text{Tr}\left(\mathbf{W{1(j)}^\top \mathbf{X}}\right)\right)^2 + \text{Tr}\left(\mathbf{W{1(j)}^\top \mathbf{W}}\right) + \lambda_1\hat{\gamma}^- - \lambda_3\bar{\gamma}^- + c_3\mathbf{e2^\top\boldsymbol{\xi}_2 \
\text{s.t.} \quad & -\sum
}^{m_2} \text{Tr}\left(\mathbf{W{1(j)}^\top \mathbf{Y}}\right) + \xi_{2q} \geq 1, \quad \xi_{2q} \geq 0, \quad q = 1, \ldots, m_2, \tag{3.10
\end{aligned}
$$
$$
\begin{aligned}
\min_{\mathbf{W}{2(j)}, \boldsymbol{\xi}_1} \quad & \frac{1}{2}\sum}^{m_2} \left(\text{Tr}\left(\mathbf{W{2(j)}^\top \mathbf{Y}}\right)\right)^2 + \text{Tr}\left(\mathbf{W{2(j)}^\top \mathbf{W}}\right) + \lambda_2\hat{\gamma}^+ - \lambda_4\bar{\gamma}^+ + c_4\mathbf{e1^\top\boldsymbol{\xi}_1 \
\text{s.t.} \quad & -\sum
}^{m_1} \text{Tr}\left(\mathbf{W{2(j)}^\top \mathbf{X}}\right) + \xi_{1p} \geq 1, \quad \xi_{1p} \geq 0, \quad p = 1, \ldots, m_1. \tag{3.11
\end{aligned}
$$

Under our assumption/constraint that the tensors $\mathcal{W}1$ and $\mathcal{W}_2$ are written as a sum of rank-one tensors as in (3.3), we replace $\mathbf{W}}$ in the above equations. We should mention here that the initial values for the rank-one tensors $\mathbf{u1$ and $\mathbf{u}_2$, and subsequently for their matricized forms $\mathbf{U}^{(j)}_1$ and $\mathbf{U}^{(j)}_2$, are randomly chosen. Then (3.10) and (3.11) are rewritten as:
$$
\begin{aligned}
\min
}^{(j)1, \boldsymbol{\xi}_2} \quad & \frac{1}{2}\sum}^{m_1} \left(\text{Tr}\left(\left(\mathbf{U}^{(j)1 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{X}}\right)\right)^2 + \text{Tr}\left(\left(\mathbf{U}^{(j)1 \mathbf{U}^{(-j)\top}\right)^\top \left(\mathbf{U}^{(j)}_1 \mathbf{U}^{(-j)\top}\right)\right) \
& + \lambda_1\hat{\gamma}^- - \lambda_3\bar{\gamma}^- + c_3\mathbf{e}_2^\top\boldsymbol{\xi}_2 \
\text{s.t.} \quad & -\sum
}^{m_2} \text{Tr}\left(\left(\mathbf{U}^{(j)1 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{Y}}\right) + \xi_{2q} \geq 1, \quad \xi_{2q} \geq 0, \quad q = 1, \ldots, m_2. \tag{3.12
\end{aligned}
$$
$$
\begin{aligned}
\min_{\mathbf{U}^{(j)}2, \boldsymbol{\xi}_1} \quad & \frac{1}{2}\sum}^{m_2} \left(\text{Tr}\left(\left(\mathbf{U}^{(j)2 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{Y}}\right)\right)^2 + \text{Tr}\left(\left(\mathbf{U}^{(j)2 \mathbf{U}^{(-j)\top}\right)^\top \left(\mathbf{U}^{(j)}_2 \mathbf{U}^{(-j)\top}\right)\right) \
& + \lambda_2\hat{\gamma}^+ - \lambda_4\bar{\gamma}^+ + c_4\mathbf{e}_1^\top\boldsymbol{\xi}_1 \
\text{s.t.} \quad & -\sum
}^{m_1} \text{Tr}\left(\left(\mathbf{U}^{(j)2 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{X}}\right) + \xi_{1p} \geq 1, \quad \xi_{1p} \geq 0, \quad p = 1, \ldots, m_1. \tag{3.13
\end{aligned}
$$

To simplify the trace operations in (3.12)-(3.13), we introduce a change of variables leveraging the positive definiteness of factor matrices. Define $\mathbf{A} = \mathbf{U}^{(-j)\top} \mathbf{U}^{(-j)}$, which is a positive definite matrix. Introduce the transformed factor matrix: $\tilde{\mathbf{U}}_1 = \mathbf{U}^{(j)}_1 \mathbf{A}^{1/2}$. The regularization trace in (3.12) becomes:
$$
\text{Tr}\left(\left(\mathbf{U}^{(j)}_1 \mathbf{U}^{(-j)\top}\right)^\top \left(\mathbf{U}^{(j)}_1 \mathbf{U}^{(-j)\top}\right)\right) = \text{Tr}\left(\tilde{\mathbf{U}}_1^\top \tilde{\mathbf{U}}_1\right) = \text{vec}(\tilde{\mathbf{U}}_1)^\top \text{vec}(\tilde{\mathbf{U}}_1). \tag{3.14}
$$

Define transformed data matrices for mode $j$:
$$
\tilde{\mathbf{X}}p = \mathbf{X}} \mathbf{U}^{(-j)} \mathbf{A}^{-1/2}, \quad \tilde{\mathbf{Y}q = \mathbf{Y}.} \mathbf{U}^{(-j)} \mathbf{A}^{-1/2
$$

The data fidelity trace in (3.12) simplifies to:
$$
\text{Tr}\left(\left(\mathbf{U}^{(j)}1 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{X}}\right) = \text{Tr}\left(\tilde{\mathbf{U}1^\top \tilde{\mathbf{X}}_p\right) = \text{vec}(\tilde{\mathbf{U}}_1)^\top \text{vec}(\tilde{\mathbf{X}}_p), \tag{3.15}
$$
$$
\text{Tr}\left(\left(\mathbf{U}^{(j)}_1 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{Y}
}\right) = \text{Tr}\left(\tilde{\mathbf{U}}_1^\top \tilde{\mathbf{Y}}_q\right) = \text{vec}(\tilde{\mathbf{U}}_1)^\top \text{vec}(\tilde{\mathbf{Y}}_q). \tag{3.16
$$

Then, combining (3.1) with (3.2) and using the transformed variables (3.14)-(3.16), the subproblem (3.12) becomes:
$$
\begin{aligned}
\min_{\tilde{\mathbf{U}}1, \boldsymbol{\xi}_2} \quad & \frac{1}{2} \sum}^{m_1} \left(\text{vec}(\tilde{\mathbf{U}1)^\top \text{vec}(\tilde{\mathbf{X}}_p)\right)^2 + \text{vec}(\tilde{\mathbf{U}}_1)^\top \text{vec}(\tilde{\mathbf{U}}_1) \
& + \frac{\lambda_1}{m_2-1} \sum
}^{m_2} \left(\text{vec}(\tilde{\mathbf{U}1)^\top \text{vec}(\tilde{\mathbf{Y}}_q) - \bar{\gamma}^-\right)^2 + c_3\mathbf{e}_2^\top\boldsymbol{\xi}_2 \
\text{s.t.} \quad & -\text{vec}(\tilde{\mathbf{U}}_1)^\top \text{vec}(\tilde{\mathbf{Y}}_q) + \xi
} \geq 1, \quad \xi_{2q} \geq 0, \quad q = 1, \ldots, m_2. \tag{3.17
\end{aligned}
$$

Similarly, let us define $\mathbf{B} = \mathbf{U}^{(-j)\top} \mathbf{U}^{(-j)}$, where $\mathbf{B}$ is a positive definite matrix. Then, $\tilde{\mathbf{U}}_2 = \mathbf{U}^{(j)}_2 \mathbf{B}^{1/2}$, and we have:
$$
\text{Tr}\left(\left(\mathbf{U}^{(j)}_2 \mathbf{U}^{(-j)\top}\right)^\top \left(\mathbf{U}^{(j)}_2 \mathbf{U}^{(-j)\top}\right)\right) = \text{Tr}\left(\tilde{\mathbf{U}}_2^\top \tilde{\mathbf{U}}_2\right) = \text{vec}(\tilde{\mathbf{U}}_2)^\top \text{vec}(\tilde{\mathbf{U}}_2). \tag{3.18}
$$

The transformed data matrices are:
$$
\bar{\mathbf{Y}}q = \mathbf{Y}} \mathbf{U}^{(-j)} \mathbf{B}^{-1/2}, \quad \bar{\mathbf{X}p = \mathbf{X}.} \mathbf{U}^{(-j)} \mathbf{B}^{-1/2
$$

We have:
$$
\text{Tr}\left(\left(\mathbf{U}^{(j)}2 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{Y}}\right) = \text{vec}(\tilde{\mathbf{U}2)^\top \text{vec}(\bar{\mathbf{Y}}_q), \tag{3.19}
$$
$$
\text{Tr}\left(\left(\mathbf{U}^{(j)}_2 \mathbf{U}^{(-j)\top}\right)^\top \mathbf{X}
}\right) = \text{vec}(\tilde{\mathbf{U}}_2)^\top \text{vec}(\bar{\mathbf{X}}_p). \tag{3.20
$$

Then, combining (3.1) with (3.2), (3.13) is written as:
$$
\begin{aligned}
\min_{\tilde{\mathbf{U}}2, \boldsymbol{\xi}_1} \quad & \frac{1}{2} \sum}^{m_2} \left(\text{vec}(\tilde{\mathbf{U}2)^\top \text{vec}(\bar{\mathbf{Y}}_q)\right)^2 + \text{vec}(\tilde{\mathbf{U}}_2)^\top \text{vec}(\tilde{\mathbf{U}}_2) \
& + \frac{\lambda_2}{m_1-1} \sum
}^{m_1} \left(\text{vec}(\tilde{\mathbf{U}2)^\top \text{vec}(\bar{\mathbf{X}}_p) - \bar{\gamma}^+\right)^2 + c_4\mathbf{e}_1^\top\boldsymbol{\xi}_1 \
\text{s.t.} \quad & -\text{vec}(\tilde{\mathbf{U}}_2)^\top \text{vec}(\bar{\mathbf{X}}_p) + \xi
} \geq 1, \quad \xi_{1p} \geq 0, \quad p = 1, \ldots, m_1. \tag{3.21
\end{aligned}
$$

Theorem 3.1. The optimal solution $\text{vec}(\tilde{\mathbf{U}}_1)^$ of (3.17) and $\text{vec}(\tilde{\mathbf{U}}_2)^$ of (3.21) can be expressed succinctly as $\text{vec}(\tilde{\mathbf{U}}_1) = \mathbf{V}^{(j)}\boldsymbol{\beta}_1$ and $\text{vec}(\tilde{\mathbf{U}}_2) = \mathbf{V}^{(j)}\boldsymbol{\beta}_2$, where $\boldsymbol{\beta}_1, \boldsymbol{\beta}_2 \in \mathbb{R}^m$ are coefficient vectors.

Proof. To analyze the objective function, we decompose the vectorized factor matrix $\text{vec}(\tilde{\mathbf{U}})$ using the span of sample modes. By the projection theorem, $\text{vec}(\tilde{\mathbf{U}})$ can be decomposed into a part that lives in the span of $\text{vec}(\mathbf{V}^{(j)})$ and an orthogonal part vector, i.e.,
$$
\text{vec}(\tilde{\mathbf{U}}) = \sum_{l=1}^m \beta_l \text{vec}(\mathbf{V}^{(j)}) + \boldsymbol{\eta} = \mathbf{V}^{(j)}\boldsymbol{\beta} + \boldsymbol{\eta}, \quad l = 1, 2, \ldots, m,
$$
where $\boldsymbol{\eta}$ is a vector satisfying $(\mathbf{V}^{(j)})^\top \boldsymbol{\eta} = \mathbf{0}$ [41].

The objective function of (3.17) includes a quadratic term in $\text{vec}(\tilde{\mathbf{U}})$:
$$
\text{vec}(\tilde{\mathbf{U}})^\top \text{vec}(\tilde{\mathbf{U}}) = (\mathbf{V}^{(j)}\boldsymbol{\beta} + \boldsymbol{\eta})^\top (\mathbf{V}^{(j)}\boldsymbol{\beta} + \boldsymbol{\eta}) = \boldsymbol{\beta}^\top (\mathbf{V}^{(j)})^\top \mathbf{V}^{(j)}\boldsymbol{\beta} + \boldsymbol{\eta}^\top \boldsymbol{\eta}. \tag{3.22}
$$
By the orthogonality $(\mathbf{V}^{(j)})^\top \boldsymbol{\eta} = \mathbf{0}$, the cross term vanishes. Define $\mathbf{K} = (\mathbf{V}^{(j)})^\top \mathbf{V}^{(j)}$, then
$$
\text{vec}(\tilde{\mathbf{U}})^\top \text{vec}(\tilde{\mathbf{U}}) = \boldsymbol{\beta}^\top \mathbf{K}\boldsymbol{\beta} + |\boldsymbol{\eta}|^2.
$$
Since $|\boldsymbol{\eta}|^2 \geq 0$, the quadratic term is minimized when $\boldsymbol{\eta} = \mathbf{0}$. This implies $\text{vec}(\tilde{\mathbf{U}}) = \mathbf{V}^{(j)}\boldsymbol{\beta}$.

A parallel decomposition applies to $\text{vec}(\tilde{\mathbf{U}}_2)$: $\text{vec}(\tilde{\mathbf{U}}_2) = \mathbf{V}^{(j)}\boldsymbol{\beta}_2 + \boldsymbol{\eta}_2$, with $(\mathbf{V}^{(j)})^\top \boldsymbol{\eta}_2 = \mathbf{0}$, leading to the same conclusion: $\boldsymbol{\eta}_2 = \mathbf{0}$ minimizes the quadratic term.

Thus, setting $\boldsymbol{\eta}_i = 0$ ($i = 1, 2$) does not affect the other terms but strictly reduces the fourth term of the objective function. Therefore, Theorem 3.1 is established.

Based on Theorem 3.1, the transformation of the original problem into the Wolfe dual problem can be derived.

Theorem 3.2. The original problem (3.8) can be transformed into the Wolfe dual problem as follows:
$$
\begin{aligned}
\max_{\boldsymbol{\alpha}_1} \quad & -\frac{1}{2}\boldsymbol{\alpha}_1^\top \mathbf{H}_1\boldsymbol{\alpha}_1 + \left(\frac{\lambda_3}{2}\mathbf{y}_2\right)^\top \mathbf{H}_1\mathbf{y}_2 + \mathbf{e}_2^\top\boldsymbol{\alpha}_1 \
\text{s.t.} \quad & \mathbf{0} \leq \boldsymbol{\alpha}_1 \leq c_3\mathbf{s}_3,
\end{aligned}
$$
where $\mathbf{H}_1 = \mathbf{K}_2\mathbf{G}_1^{-1}\mathbf{K}_2^\top$, $\mathbf{G}_1 = \mathbf{K}_1^\top\mathbf{K}_1 + c_1\mathbf{K} + \frac{2\lambda_1}{m_2-1}\mathbf{K}_2^\top\mathbf{K}_2$, $\boldsymbol{\alpha}_1$ is a Lagrangian multiplier vector, and $\boldsymbol{\beta}_1$ has the following representation:
$$
\boldsymbol{\beta}_1 = \mathbf{G}_1^{-1}\left(\frac{\lambda_3}{2}\mathbf{K}_2^\top\mathbf{y}_2 - \mathbf{K}_2^\top\boldsymbol{\alpha}_1\right).
$$

Proof. According to (3.22) from Theorem 3.1, we have:
$$
\sum_{p=1}^{m_1} \left(\text{vec}(\tilde{\mathbf{U}}1)^\top \text{vec}(\tilde{\mathbf{X}}_p)\right)^2 = \sum}^{m_1} \left(\boldsymbol{\beta}_1^\top (\mathbf{V}^{(j)})^\top \text{vec}(\tilde{\mathbf{X}}_p)\right)^2 = \boldsymbol{\beta}_1^\top \mathbf{K}_1^\top\mathbf{K}_1\boldsymbol{\beta}_1, \tag{3.23
$$
where $\mathbf{K}_1 = (\mathbf{V}^{(j)})^\top \tilde{\mathbf{X}}$.

Similarly,
$$
\sum_{q=1}^{m_2} \text{vec}(\tilde{\mathbf{U}}1)^\top \text{vec}(\tilde{\mathbf{Y}}_q) = \mathbf{y}_2^\top (\mathbf{V}^{(j)})^\top \text{vec}(\tilde{\mathbf{Y}}) = \mathbf{y}_2^\top \mathbf{M}_1\boldsymbol{\beta}_1, \tag{3.24}
$$
where $\mathbf{M}_1 = (\mathbf{V}^{(j)})^\top \tilde{\mathbf{Y}}$, and
$$
\sum
}^{m_2} \left(\text{vec}(\tilde{\mathbf{U}}_1)^\top \text{vec}(\tilde{\mathbf{Y}}_q)\right)^2 = \boldsymbol{\beta}_1^\top \mathbf{M}_1^\top\mathbf{M}_1\boldsymbol{\beta}_1. \tag{3.25
$$

Substituting (3.23), (3.24), and (3.25) into (3.17), we obtain the following final matrix form:
$$
\begin{aligned}
\min_{\boldsymbol{\beta}_1, \boldsymbol{\xi}_2} \quad & \frac{1}{2}\boldsymbol{\beta}_1^\top \mathbf{K}\boldsymbol{\beta}_1 + \frac{\lambda_1}{m_2-1}\boldsymbol{\beta}_1^\top \mathbf{M}_1^\top\mathbf{M}_1\boldsymbol{\beta}_1 - \frac{\lambda_3}{2}\mathbf{y}_2^\top \mathbf{M}_1\boldsymbol{\beta}_1 + c_3\mathbf{e}_2^\top\boldsymbol{\xi}_2 \
\text{s.t.} \quad & -\mathbf{M}_1\boldsymbol{\beta}_1 + \boldsymbol{\xi}_2 \geq \mathbf{e}_2, \quad \boldsymbol{\xi}_2 \geq \mathbf{0}. \tag{3.26}
\end{aligned}
$$

Equation (3.26) can be written in a more concise form as shown below:
$$
\begin{aligned}
\min_{\boldsymbol{\beta}_1, \boldsymbol{\xi}_2} \quad & \frac{1}{2}\boldsymbol{\beta}_1^\top \mathbf{G}_1\boldsymbol{\beta}_1 + c_3\mathbf{e}_2^\top\boldsymbol{\xi}_2 \
\text{s.t.} \quad & -\mathbf{M}_1\boldsymbol{\beta}_1 + \boldsymbol{\xi}_2 \geq \mathbf{e}_2, \quad \boldsymbol{\xi}_2 \geq \mathbf{0}, \tag{3.27}
\end{aligned}
$$
where $\mathbf{G}_1 = \mathbf{K}_1^\top\mathbf{K}_1 + c_1\mathbf{K} + \frac{2\lambda_1}{m_2-1}\mathbf{M}_1^\top\mathbf{M}_1$ is a symmetric nonnegative definite matrix.

The Lagrangian function of the optimization problem in (3.27) is:
$$
\mathcal{L}_1(\boldsymbol{\beta}_1, \boldsymbol{\xi}_2, \boldsymbol{\alpha}_1, \boldsymbol{\delta}_1) = \frac{1}{2}\boldsymbol{\beta}_1^\top \mathbf{G}_1\boldsymbol{\beta}_1 + c_3\mathbf{e}_2^\top\boldsymbol{\xi}_2 - \boldsymbol{\alpha}_1^\top(-\mathbf{M}_1\boldsymbol{\beta}_1 + \boldsymbol{\xi}_2 - \mathbf{e}_2) - \boldsymbol{\delta}_1^\top\boldsymbol{\xi}_2, \tag{3.28}
$$
where $\boldsymbol{\alpha}_1, \boldsymbol{\delta}_1 \in \mathbb{R}^{m_2}$ are Lagrangian multiplier vectors.

According to dual theorem and Karush-Kuhn-Tucker (KKT) conditions, the minimum of the Lagrangian function in (3.28) with respect to $\boldsymbol{\beta}_1, \boldsymbol{\xi}_2$ equals the maximum of the function with respect to $\boldsymbol{\alpha}_1$. By satisfying the necessary conditions for the optimal solution for the Lagrange function, i.e., $\partial\mathcal{L}_1/\partial\boldsymbol{\beta}_1 = \mathbf{0}$ and $\partial\mathcal{L}_1/\partial\boldsymbol{\xi}_2 = \mathbf{0}$, we obtain:
$$
\mathbf{G}_1\boldsymbol{\beta}_1 = \mathbf{M}_1^\top\boldsymbol{\alpha}_1, \quad c_3\mathbf{e}_2 - \boldsymbol{\alpha}_1 - \boldsymbol{\delta}_1 = \mathbf{0} \Rightarrow \mathbf{0} \leq \boldsymbol{\alpha}_1 \leq c_3\mathbf{e}_2. \tag{3.29}
$$

Since $\mathbf{G}_1$ is nonsingular, $\boldsymbol{\beta}_1$ can be deduced from (3.29) as:
$$
\boldsymbol{\beta}_1 = \mathbf{G}_1^{-1}\left(\frac{\lambda_3}{2}\mathbf{M}_1^\top\mathbf{y}_2 - \mathbf{M}_1^\top\boldsymbol{\alpha}_1\right). \tag{3.30}
$$

Substituting (3.29) into the Lagrangian function (3.28), the Wolfe dual form of model (3.27) is:
$$
\begin{aligned}
\max_{\boldsymbol{\alpha}_1} \quad & -\frac{1}{2}\boldsymbol{\alpha}_1^\top \mathbf{H}_1\boldsymbol{\alpha}_1 + \left(\frac{\lambda_3}{2}\mathbf{y}_2\right)^\top \mathbf{H}_1\mathbf{y}_2 + \mathbf{e}_2^\top\boldsymbol{\alpha}_1 \
\text{s.t.} \quad & \mathbf{0} \leq \boldsymbol{\alpha}_1 \leq c_3\mathbf{e}_2, \tag{3.31}
\end{aligned}
$$
where $\mathbf{H}_1 = \mathbf{M}_1\mathbf{G}_1^{-1}\mathbf{M}_1^\top$.

Similarly, we derive the Wolfe dual form of model (3.21) using Theorem 3.1 as indicated below:
$$
\begin{aligned}
\max_{\boldsymbol{\alpha}2} \quad & -\frac{1}{2}\boldsymbol{\alpha}_2^\top \mathbf{H}_2\boldsymbol{\alpha}_2 + \left(\frac{\lambda_4}{2}\mathbf{y}_1\right)^\top \mathbf{H}_2\mathbf{y}_1 + \mathbf{e}_1^\top\boldsymbol{\alpha}_2 \
\text{s.t.} \quad & \mathbf{0} \leq \boldsymbol{\alpha}_2 \leq c_4\mathbf{e}_1, \tag{3.32}
\end{aligned}
$$
where $\boldsymbol{\alpha}_2$ is a nonnegative Lagrangian multiplier vector, $\mathbf{H}_2 = \mathbf{M}_2\mathbf{G}_2^{-1}\mathbf{M}_2^\top$, and $\boldsymbol{\beta}_2$ is expressed as:
$$
\boldsymbol{\beta}_2 = \mathbf{G}_2^{-1}\left(\frac{\lambda_4}{2}\mathbf{M}_2^\top\mathbf{y}_1 - \mathbf{M}_2^\top\boldsymbol{\alpha}_2\right),
$$
where $\mathbf{G}_2 = \mathbf{K}_2^\top\mathbf{K}_2 + c_2\mathbf{K} + \frac{2\lambda_2}{m_1-1}\mathbf{M}_2^\top\mathbf{M}_2$ is a symmetric nonnegative definite matrix, $\mathbf{K}_2 = \bar{\mathbf{Y}}
}^\top \mathbf{V}^{(j)}$, and $\mathbf{M2 = \bar{\mathbf{X}}$.}^\top \mathbf{V}^{(j)

Theorem 3.3. By utilizing Theorem 3.2, we obtain the Wolfe dual problems and the parameter vectors $\boldsymbol{\beta}1$ and $\boldsymbol{\beta}_2$ related to the optimal hyperplanes. Then, the expression for the decision function of LDM-NPSTM can be constructed as follows:
$$
f(\mathcal{X}) = \arg\min
} \frac{|\langle \mathcal{Wi, \mathcal{X} \rangle|}{|\mathcal{W}_i|_F} = \arg\min} \frac{|\mathbf{Ki\boldsymbol{\beta}_i|}{\sqrt{\boldsymbol{\beta}_i^\top \mathbf{K}\boldsymbol{\beta}_i}},
$$
$$
f(\mathcal{Y}) = \arg\min
} \frac{|\langle \mathcal{Wi, \mathcal{Y} \rangle|}{|\mathcal{W}_i|_F} = \arg\min.} \frac{|\mathbf{M}_i\boldsymbol{\beta}_i|}{\sqrt{\boldsymbol{\beta}_i^\top \mathbf{K}\boldsymbol{\beta}_i}
$$

Proof. Obviously, the decision function is defined as the class of the hyperplane that is closer to the input point, and the concrete proof is similar to the traditional TWSVMs [16].

3.2 Algorithm and Pseudocode

In this section, we address problems (3.8) and (3.9) by proposing an alternating projection algorithm. For subproblems (3.10) and (3.11), we leverage their dual counterparts (3.31) and (3.32) to derive an efficient solution method. The subsequent content is structured as follows: we first delineate the iterative procedure of the algorithm (summarized concisely in Algorithm 1) and then rigorously establish its convergence properties through theoretical analysis.

Algorithm 1 Alternating Projection for LMD-NPSTM
Input: The set of training tensors ${\mathcal{X}p}}^{m_1}$, ${\mathcal{Yq}$ and their corresponding labels $y_i \in {+1, -1}$, $N=5000$.}^{m_2} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_M

Output: The parameters of the classification tensors $\mathcal{W}_1$ and $\mathcal{W}_2$.

  1. Initialize $\mathcal{W}_1$ and $\mathcal{W}_2$ written as a sum of random rank-one tensors and set $k = 0$.
  2. While $|W_i^{(k)} - W_i^{(k-1)}|/|W_i^{(k-1)}| > \epsilon$ ($i = 1, 2$) or $k < N$ do:
  3. For $j = 1, \ldots, M$ do (number of modes):
    • Update variables $\boldsymbol{\alpha}_1$ in (3.31) and $\boldsymbol{\alpha}_2$ in (3.32) using the QPP toolkit.
    • $\boldsymbol{\beta}_1 \leftarrow \boldsymbol{\alpha}_1$, $\boldsymbol{\beta}_2 \leftarrow \boldsymbol{\alpha}_2$.
    • $\text{vec}(\tilde{\mathbf{U}}_1) \leftarrow \boldsymbol{\beta}_1$, $\text{vec}(\tilde{\mathbf{U}}_2) \leftarrow \boldsymbol{\beta}_2$.
    • Calculate $\mathbf{U}^{(j)}_1$ and $\mathbf{U}^{(-j)}$ by (3.14), $\mathbf{U}^{(j)}_2$ and $\mathbf{U}^{(-j)}$ by (3.18).
  4. End for
  5. $\mathbf{W}{1(j)}^{(k)} \leftarrow \mathbf{U}^{(j)}_1 \left(\mathbf{U}^{(-j)}\right)^\top$, $\mathbf{W}\right)^\top$ ($i = 1, 2$).}^{(k)} \leftarrow \mathbf{U}^{(j)}_2 \left(\mathbf{U}^{(-j)
  6. Finish calculation of $\mathcal{W}_1^{(k)}$ and $\mathcal{W}_2^{(k)}$.
  7. End while

The convergence of the algorithm is given below.

Theorem 3.4. Assume that ${\mathcal{W}i^{(k)}, \boldsymbol{\xi}}}^{(k){j=1}^M$ are the sequences generated in Algorithm 1. Here, $\boldsymbol{\xi}}^{(j)}$ represents the solution to the $j$-th subproblem of problems (3.8) and (3.9), respectively. Then, the sequences ${f_i(\mathcal{Wi^{(k)}, \boldsymbol{\xi})}$ are each monotonically non-increasing and converge to limit points.}^{(k)

Proof. We define the index set $I = {1, 2}$, where for each $i \in I$, the notation $-i$ denotes the complementary element in $I$ (i.e., $-i = 2$ if $i = 1$, and vice versa). We aim to minimize problems (3.8) and (3.9). Specifically, we consider minimizing certain functions $f_i(\mathbf{U}^{(1)}i, \ldots, \mathbf{U}^{(M)}_i, \boldsymbol{\xi}$. The alternating optimization procedure can be written as:})$ that are of the form $f_i: \mathbb{R}^{I_1 \times R} \times \cdots \times \mathbb{R}^{I_M \times R} \times \mathbb{R}^{m_i} \to \mathbb{R
$$
\left(\mathbf{U}^{(j)}_i\right)^ = \arg\min_{\mathbf{U}^{(j)}i \in \mathbb{R}^{I_j \times R}, \boldsymbol{\xi}}} g_j\left(\mathbf{U}^{(1)i, \ldots, \mathbf{U}^{(j-1)}_i, \mathbf{U}^{(j)}_i, \mathbf{U}^{(j+1)}_i, \ldots, \mathbf{U}^{(M)}_i, \boldsymbol{\xi}\right),
$$
where ${\mathbf{U}^{(l)}i}_i$. The superscript $}$ denotes the set of all variables except $\mathbf{U}^{(j)
$ indicates the optimal solutions at iteration $j$.

In each iteration $k$, the function $g_j$ is computed using $\mathbf{U}^{(1)}_i, \ldots, \mathbf{U}^{(j-1)}_i$ from the current iteration $k$ and $\mathbf{U}^{(j+1)}_i, \ldots, \mathbf{U}^{(M)}_i$ obtained from the previous iteration $k-1$.

Given an initialization of ${\mathcal{W}i^{(0)}, \boldsymbol{\xi}}^{(0)}}$, the alternating projection generates a sequence of ${\mathcal{Wi^{(k)}, \boldsymbol{\xi}}$. By construction, each subproblem $g_j$ is a minimization step that guarantees:}^{(k)
$$
g_j\left(\mathbf{U}^{(1)}i(k), \ldots, \mathbf{U}^{(j)}_i(k), \mathbf{U}^{(j+1)}_i(k-1), \ldots, \mathbf{U}^{(M)}_i(k-1), \boldsymbol{\xi}}(k)\right) \leq g_j\left(\mathbf{U}^{(1)i(k), \ldots, \mathbf{U}^{(j)}_i(k-1), \mathbf{U}^{(j+1)}_i(k-1), \ldots, \mathbf{U}^{(M)}_i(k-1), \boldsymbol{\xi}(k-1)\right).
$$

Specifically, for any $j$ and $k$:
$$
f_i\left(\mathcal{W}i^{(k)}, \boldsymbol{\xi}}^{(k)}\right) \leq g_{j-1}\left(\mathbf{U}^{(j-1)i(k), \boldsymbol{\xi}}^{(k)}\right) \leq f_i\left(\mathcal{Wi^{(k-1)}, \boldsymbol{\xi}\right).}^{(k-1)
$$

Therefore, the following holds:
$$
\bar{\gamma}i = g_1\left(\mathbf{U}^{(1)}_i(1), \boldsymbol{\xi}}^{(1)}\right) \geq \cdots \geq g_M\left(\mathbf{U}^{(M)i(1), \boldsymbol{\xi}}^{(1)}\right) \geq \cdots \geq g_1\left(\mathbf{U}^{(1)i(k), \boldsymbol{\xi}\right) \geq \cdots.}^{(k)
$$

Since $f_i$ is bounded below, the sequence $f_i$ is monotonically non-increasing. By the Monotone Convergence Theorem, it converges to a unique limit point $\gamma_i^$. Formally, as $k \to \infty$, $\lim_{k \to \infty} g_j\left(\mathbf{U}^{(j)}i(k), \boldsymbol{\xi}\right)^}^{(k) = \gamma_i^*$.

The overall algorithm $\Omega$ can be decomposed into sub-algorithms $\Omega = \Omega_1 \circ \cdots \circ \Omega_M$, where each $\Omega_j$ corresponds to solving $g_j$. $\Omega$ is a closed algorithm since all updates are performed by continuous functions. All sub-algorithms decrease the value of $f_i$, therefore it is clear that $\Omega$ is monotonic with respect to $f_i$. By the properties of monotonic algorithms combined with the closed nature ensured by continuous updates, $\Omega$ converges.

4 Numerical Experiments

To verify the effectiveness of the proposed optimization model for linear binary classification problems, we selected several types of datasets from Kaggle public databases and conducted a series of numerical experiments. These experiments allowed us to thoroughly investigate the model's performance in terms of classification accuracy and computational efficiency.

All numerical experiments were implemented in MATLAB 9.0 on a personal computer with AMD Ryzen 7 4800H CPU 2.90GHz and 16 GB random-access memory (RAM). For simplicity, we use TC and FC to denote the numbers of true classification and false classification, respectively, and use ACCU to denote classification accuracy, i.e., ACCU = TC/(TC + FC). The numerical analysis will focus on classification accuracy.

In our numerical experiments, we choose 5 widely-used classifiers, i.e., the classifiers developed respectively by [6], [32], [36], [42], and [34] to make a numerical comparison with the classifier established in this paper. To ensure the reliability of the experimental results, we employed 10-fold cross-validation and repeated the experiment ten times, with the final results averaged to minimize random variance.

The following table lists some information of these models when conducting the numerical experiments, where $\Theta = {2^{-5}, 2^{-3}, \ldots, 2^7}$.

Table 2 [TABLE:2]: Classification models, parameters, and references

Method Model Algorithm Number of Parameters Parameter Set Reference LIBSVM SVM Algorithm 1 2 $C \in \Theta$ [6] TWSVM TWSVM Algorithm 1 2 $c_1, c_2 \in \Theta$ [32] STL STM Algorithm 1 2 $C \in \Theta$ [36] TWSTM TWSTM Algorithm 1 2 $c_1, c_2 \in \Theta$ [42] TBSTM TBSTM Algorithm 1 3 $c_1, c_2, \lambda \in \Theta$ [34] LDM-NPSTM LDM-NPSTM Algorithm 1 6 $\lambda_1, \lambda_2, \lambda_3, \lambda_4, c_3, c_4 \in \Theta$ This paper

4.1 Numerical Experiments on Skin Lesions

In this experiment, we selected four types of skin cancer with significant pathological differences for evaluation, namely actinic keratosis, dermatofibroma, squamous cell carcinoma, and vascular lesion. All sample images were converted to grayscale before the experiments. Related information about these skin cancer datasets is described in Table 3 [TABLE:3], where Size denotes the number of samples, Features denotes the number of features, and Ratio denotes this category's proportion of the total. To further assess classification performance between different skin cancer types, all possible pairwise combinations of the four skin cancer categories are constructed, and binary classification tasks are conducted accordingly. The numerical results for the datasets are given in Table 4 [TABLE:4], where W, T, and L denote the number of wins, ties, and losses against another classifier, respectively.

Table 3: Details of skin cancer datasets

Data Set Features Size Ratio Actinic keratosis 600×450 Dermatofibroma 600×450 Squamous cell carcinoma 600×450 Vascular lesion 600×450

Table 4: Numerical comparisons of 6 solvers for skin cancer datasets

Data Sets HSVM TWSVM STL TWSTM TBSTM LDM-NPSTM W/T/L (Acti, Derm) 53.69±1.85 53.69±1.63 62.70±0.89 63.50±1.73 62.79±4.59 63.79±2.40 6/0/0 (Acti, Squa) 62.70±0.94 65.72±4.38 43.38±7.84 71.62±0.66 67.53±2.99 88.35±1.79 6/0/0 (Acti, Vasc) 43.15±1.20 56.28±6.27 43.75±4.42 79.85±2.41 90.84±1.47 92.24±1.95 6/0/0 (Derm, Squa) 41.80±1.50 55.41±1.18 67.31±0.23 67.31±0.23 67.31±4.02 74.54±0.28 5/0/1 (Derm, Vasc) 59.51±4.23 55.34±1.67 65.82±5.31 55.41±1.18 76.45±1.11 78.94±3.71 6/0/0 (Squ, Vasc) 53.13±8.84 43.75±4.42 81.34±5.21 68.67±0.29 67.85±4.17 83.98±0.41 6/0/0

Table 4 indicates that LDM-NPSTM generally achieves higher classification accuracy across the datasets compared to HSVM, TWSVM, STL, TWSTM, and TBSTM. It also tends to exhibit lower time costs, smaller standard deviations, and fewer losses in the W/T/L statistics, suggesting better overall effectiveness, efficiency, and robustness.

4.2 Numerical Experiments on Color Image Datasets

In this subsection, we evaluate the performance of the proposed algorithm on color image datasets, focusing on classification accuracy and computational efficiency. A dataset of horse breed images is employed, where each image is represented as a third-order tensor $\mathcal{X}_i \in \mathbb{R}^{n_1 \times n_2 \times 3}$, with the last dimension corresponding to the RGB channels. Related information about the dataset is described in Table 5 [TABLE:5].

Table 5: Details of horse breeds datasets

Dataset Features Size Ratio Akhal-Teke 256×256×3 Appaloosa 256×256×3 Orlov Trotter 256×256×3 Vladimir Heavy 256×256×3 Percheron 256×256×3 Arabian 256×256×3 Friesian 256×256×3

The experimental results are reported in Table 6 [TABLE:6]. As observed from the table, in the RGB image classification tasks, the LDM-NPSTM algorithm generally exhibits better overall performance compared to the other methods. Across most datasets, it achieves relatively higher average classification accuracies and lower time costs. In particular, according to the W/T/L statistics, LDM-NPSTM records the fewest losses among all methods, further indicating its effectiveness and advantages in image classification tasks.

Table 6: Numerical comparisons of 6 solvers for horse breeds datasets

Dataset HSVM TWSVM STL TWSTM TBSTM LDM-NPSTM W/T/L (Akha, Appa) 55.73±11.18 55.73±11.18 54.02±3.49 59.09±9.55 55.53±4.50 59.78±5.47 6/0/0 (Akha, Orlo) 66.50±10.48 65.22±13.95 56.52±6.74 57.25±6.97 60.14±6.97 63.04±5.61 6/0/0 (Akha, Vlad) 48.81±1.68 68.75±17.68 73.96±2.55 76.04±4.70 76.04±2.55 76.56±3.13 6/0/0 (Akha, Perc) 61.43±26.26 65.36±17.56 65.36±4.75 67.16±8.98 67.32±1.60 67.65±1.96 5/0/1 (Akha, Arab) 60.48±5.98 46.92±1.53 61.08±6.92 55.69±4.45 62.44±7.27 59.00±2.97 6/0/0 (Akha, Frie) 55.63±4.89 55.25±10.25 63.89±6.76 65.92±9.84 74.81±10.88 74.54±3.39 6/0/0 (Appa, Orlo) 50.00±3.07 55.63±11.33 58.66±7.62 62.55±8.13 68.72±12.34 68.29±2.88 6/0/0 (Appa, Vlad) 59.38±13.26 72.14±11.11 76.98±10.60 73.33±4.62 73.25±2.77 79.40±5.09 6/0/0 (Appa, Perc) 40.20±9.71 57.54±1.82 56.74±6.40 63.85±7.60 65.99±3.13 63.14±4.12 6/0/0 (Appa, Arab) 53.17±7.31 55.47±4.28 55.47±4.28 59.85±5.78 62.81±7.44 62.70±5.10 6/0/0 (Appa, Frie) 81.75±8.13 69.30±5.78 69.30±5.78 84.75±8.97 79.58±8.94 88.04±4.16 6/0/0 (Orlo, Vlad) 56.25±8.84 69.29±5.74 69.29±5.74 79.52±10.48 73.89±2.42 74.64±2.62 6/0/0 (Orlo, Perc) 53.52±15.98 67.77±7.62 67.77±7.62 62.44±12.22 64.64±2.28 65.63±3.47 6/0/0 (Orlo, Arab) 68.75±8.84 57.61±5.79 57.61±5.79 62.88±8.24 56.23±5.26 66.52±4.38 6/0/0 (Orlo, Frie) 75.00±8.71 64.23±6.96 64.23±6.96 63.75±11.18 68.75±14.73 69.17±6.56 6/0/0 (Vlad, Perc) 66.67±11.79 68.33±6.58 68.33±6.58 68.33±9.63 71.85±5.69 74.17±10.67 6/0/0 (Vlad, Arab) 60.42±14.73 78.93±5.11 78.93±5.11 75.90±9.83 83.33±10.21 82.38±2.98 5/0/1 (Vlad, Frie) 68.33±9.13 74.86±7.65 74.86±7.65 73.89±11.99 71.74±10.02 77.81±3.29 6/0/0 (Perc, Arab) 68.58±8.84 68.25±2.50 68.25±2.50 66.23±9.96 67.32±3.86 73.20±3.15 6/0/0 (Perc, Frie) 66.67±8.33 67.37±6.09 67.37±6.09 67.37±4.98 69.12±9.28 71.81±4.61 6/0/0 (Arab, Frie) 75.42±6.48 62.42±6.90 62.42±6.90 60.36±8.78 72.61±6.69 69.50±6.35 6/0/0 W/T/L 14/0/7 19/0/2 13/0/8 19/0/2 18/0/3 21/0/0

Next, the numerical results are statistically compared using the Friedman test and the Nemenyi post hoc test [8]. The Friedman test is first applied to the ACCU values of different classifiers to evaluate whether significant differences exist. Let $R_j$ denote the rank of the $j$-th algorithm over $N$ datasets. Under the null hypothesis that all algorithms perform equivalently, their average ranks $\bar{R}j$ should be equal. The Friedman statistic is computed as:
$$
\chi_F^2 = \frac{12N}{K(K+1)}\left[\sum
\right].}^K \bar{R}_j^2 - \frac{K(K+1)^2}{4
$$

If the null hypothesis is rejected, the Nemenyi post hoc test is conducted to identify specific differences. Two classifiers are considered significantly different if the absolute difference in their average ranks exceeds the critical difference (CD), given by:
$$
\text{CD} = q_\alpha \sqrt{\frac{K(K+1)}{6N}},
$$
where $K$ and $N$ denote the number of algorithms and datasets, respectively, and $q_\alpha$ is the critical value of the Studentized range statistic based on a given significance level, which can be obtained from the threshold table of $q_\alpha$ (see Table 5 in [8]).

In the numerical experiments of this subsection, 6 classifiers are evaluated across 21 datasets, and their rankings based on ACCU are reported in Table 7 [TABLE:7]. Using the average ranks $\bar{R}_j$ provided in the last row of the table, the Friedman test yields a $P$-value of $1.78 \times 10^{-7}$. Given the commonly adopted significance level $\alpha = 0.05$, this value is significantly less than the significance level, indicating significant differences among the classifiers. Substituting $K = 6$, $N = 21$, and $\alpha = 0.05$ into the formula for critical difference (CD) results in CD = 0.5081. Based on this, the Nemenyi post hoc comparison results are shown on the left side of Fig. 4 [FIGURE:4]. It is evident that the absolute differences in the mean ranks between our model and other models exceed the CD value, indicating statistically significant differences in ACCU between LDM-NPSTM and most other models.

Similarly, Table 8 [TABLE:8] presents the rankings of the CPU times for the 6 classifiers across the 27 datasets. In this case, the Friedman test yields a $P$-value of $3.03 \times 10^{-17}$, with the same CD value of 0.5081. The corresponding Nemenyi post hoc results are shown on the right side of Fig. 4. It can be observed that the absolute differences in mean ranks between LDM-NPSTM and all other models exceed the CD value except for HSVM, demonstrating significant statistical differences in CPU time between LDM-NPSTM and most other models.

Table 7: Average rank of 6 solvers for ACCU

Dataset HSVM TWSVM STL TWSTM TBSTM LDM-NPSTM (Acti, Derm) 6 5 4 3 2 1 (Acti, Squa) 6 5 4 3 2 1 (Acti, Vasc) 6 5 4 3 2 1 (Derm, Squa) 6 5 3.5 3.5 2 1 (Derm, Vasc) 6 5 4 3 2 1 (Squ, Vasc) 6 5 4 3 2 1 (Akha, Appa) 6 5 4 3 2 1 (Akha, Orlo) 6 5 4 3 2 1 (Akha, Vlad) 6 5 4 3 2 1 (Akha, Perc) 6 5 4 3 2 1 (Akha, Arab) 6 5 4 3 2 1 (Akha, Frie) 6 5 4 3 2 1 (Appa, Orlo) 6 5 4 3 2 1 (Appa, Vlad) 6 5 4 3 2 1 (Appa, Perc) 6 5 4 3 2 1 (Appa, Arab) 6 5 4 3 2 1 (Appa, Frie) 6 5 4 3 2 1 (Orlo, Vlad) 6 5 4 3 2 1 (Orlo, Perc) 6 5 4 3 2 1 (Orlo, Arab) 6 5 4 3 2 1 (Orlo, Frie) 6 5 4 3 2 1 (Vlad, Perc) 6 5 4 3 2 1 (Vlad, Arab) 6 4.5 4.5 3 2 1 (Vlad, Frie) 6 5 4 3 2 1 (Perc, Arab) 6 5 4 3 2 1 (Perc, Frie) 6 5 4 3 2 1 (Arab, Frie) 6 5 4 3 2 1 Average Rank 6.0 4.96 4.04 3.0 2.0 1.0

Table 8: Average rank of 6 solvers for CPU time

Dataset HSVM TWSVM STL TWSTM TBSTM LDM-NPSTM (Acti, Derm) 3 4 5 6 2 1 (Acti, Squa) 3 4 5 6 2 1 (Acti, Vasc) 3 4 5 6 2 1 (Derm, Squa) 3 4 5 6 2 1 (Derm, Vasc) 3 4 5 6 2 1 (Squ, Vasc) 3 4 5 6 2 1 (Akha, Appa) 3 4 5 6 2 1 (Akha, Orlo) 3 4 5 6 2 1 (Akha, Vlad) 3 4 5 6 2 1 (Akha, Perc) 3 4 5 6 2 1 (Akha, Arab) 3 4 5 6 2 1 (Akha, Frie) 3 4 5 6 2 1 (Appa, Orlo) 3 4 5 6 2 1 (Appa, Vlad) 3 4 5 6 2 1 (Appa, Perc) 3 4 5 6 2 1 (Appa, Arab) 3 4 5 6 2 1 (Appa, Frie) 3 4 5 6 2 1 (Orlo, Vlad) 3 4 5 6 2 1 (Orlo, Perc) 3 4 5 6 2 1 (Orlo, Arab) 3 4 5 6 2 1 (Orlo, Frie) 3 4 5 6 2 1 (Vlad, Perc) 3 4 5 6 2 1 (Vlad, Arab) 3 4 5 6 2 1 (Vlad, Frie) 3 4 5 6 2 1 (Perc, Arab) 3 4 5 6 2 1 (Perc, Frie) 3 4 5 6 2 1 (Arab, Frie) 3 4 5 6 2 1 Average Rank 3.0 4.0 5.0 6.0 2.0 1.0

Figure 4 [FIGURE:4]: Nemenyi post hoc test with ACCU (left) and CPU time (right)

4.3 Numerical Experiments on Small Sample Classification Problems

In this experiment, we used the musical instruments dataset to validate the effectiveness of our model on small sample classification problems. The musical instruments dataset is an image dataset that includes 30 object categories of different musical instruments. Note that the number of images in each category differs significantly, with about 70 to 200 images per category. However, the size of the images in each category is equal to 200×300×3. We selected six types of musical instruments, namely drums, harp, piano, sax, sitar, and violin, for binary classification experiments, with each category consisting of 60 color images.

Table 9 [TABLE:9] lists the classification results of the five comparison methods and the proposed model. We observe that our model achieves almost similar or better performance than the baseline in all class pairs, which indicates that our model has potential benefits when dealing with small sample classification problems.

Table 9: Numerical comparisons of 6 solvers for musical instruments datasets

Dataset HSVM TWSVM STL TWSTM TBSTM LDM-NPSTM W/T/L (drums, harp) 56.25±20.62 74.66±5.79 54.02±3.49 59.09±9.55 55.53±4.50 59.78±5.47 6/0/0 (drums, piano) 73.17±6.03 63.73±5.98 56.52±6.74 57.25±6.97 60.14±6.97 63.04±5.61 6/0/0 (drums, sax) 78.54±5.61 73.18±6.02 73.96±2.55 76.04±4.70 76.04±2.55 76.56±3.13 5/0/1 (drums, sitar) 60.84±6.00 55.25±10.25 65.36±4.75 67.16±8.98 67.32±1.60 67.65±1.96 5/0/1 (drums, violin) 58.92±5.98 55.63±11.33 61.08±6.92 55.69±4.45 62.44±7.27 59.00±2.97 6/0/0 (harp, piano) 56.25±8.84 72.14±11.11 58.66±7.62 62.55±8.13 68.72±12.34 68.29±2.88 6/0/0 (harp, sax) 53.52±15.98 57.54±1.82 76.98±10.60 73.33±4.62 73.25±2.77 79.40±5.09 6/0/0 (harp, sitar) 68.75±8.84 57.61±5.79 56.23±5.26 62.88±8.24 56.23±5.26 66.52±4.38 6/0/0 (harp, violin) 75.00±8.71 64.23±6.96 68.33±6.58 68.33±9.63 71.85±5.69 74.17±10.67 6/0/0 (piano, sax) 66.67±11.79 68.33±6.58 78.93±5.11 75.90±9.83 83.33±10.21 82.38±2.98 5/0/1 (piano, sitar) 60.42±14.73 78.94±3.71 78.94±3.71 62.81±8.39 76.02±9.70 69.17±6.56 6/0/0 (piano, violin) 68.33±9.13 83.98±0.41 83.98±0.41 73.89±11.99 71.74±10.02 77.81±3.29 6/0/0 (sax, sitar) 56.25±20.62 63.79±2.40 63.79±2.40 63.75±11.18 70.83±17.68 71.84±6.09 6/0/0 (sax, violin) 73.17±6.03 88.35±1.79 88.35±1.79 72.92±14.73 56.25±8.84 73.75±6.97 6/0/0 (sitar, violin) 78.54±5.61 92.24±1.95 92.24±1.95 83.33±5.89 58.33±5.11 86.67±6.11 6/0/0 W/T/L 11/0/4 13/0/2 12/0/3 10/0/5 12/0/3 15/0/0

5 Conclusion

In this paper, we proposed a novel LDM-NPSTM for binary classification tasks, which integrates both distributional and sampling information of training data within a tensor-based learning framework. Specifically, the separating hyperplane parameters in LDM-NPSTM form a tensorplane decomposed into a sum of rank-one tensors via CP decomposition, enabling efficient exploitation of multiway structural information. The corresponding optimization problems were solved iteratively using an alternating projection strategy, where each subproblem along a specific tensor mode was formulated as a standard SVM-type convex optimization problem. The efficiency of the proposed method was illustrated on problems of skin lesion datasets, color image datasets, and small sample classification.

References

[1] Amayri, O., Nizar, B.: A study of spam filtering using support vector machines. Artificial Intelligence Review. 34, 73-108 (2010).

[2] Burges, C, JC.: A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery. 2(2), 121-167 (1998).

[3] Bollegala, D., Yutaka, M., Mitsuru, I.: A web search engine-based approach to measure semantic similarity between words. IEEE Transactions on knowledge and Data Engineering. 23(7), 977-990 (2010).

[4] Cao, L., Francis, E.: Support vector machine with adaptive parameters in financial time series forecasting. IEEE Transactions on neural networks. 14(6), 1506-1518 (2003).

[5] Chen C, et al. Kernelized support tensor train machines. Pattern Recognition. 122, 108337 (2022).

[6] Cortes, C., Vladimir V.: Support-vector networks. Machine learning. 20, 273-297 (1995).

[7] De, L., Lieven.: Signal processing based on multilinear algebra. Leuven: Katholieke Universiteit Leuven. (1997).

[8] Demšar J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. 7(1), 1-30 (2006).

[9] Dhanjal, C., Steve, R., John S.: Efficient sparse kernel feature extraction based on partial least squares. IEEE Transactions on Pattern Analysis and Machine Intelligence. 31(8), 1347-1361 (2008).

[10] Etemad, K., Chellappa, R.: Discriminant analysis for recognition of human face images. Journal of the Optical Society of America A, 14(8), 1724-1733 (1997).

[11] Gao, W., Zhou, Z. H.: On the doubt about margin explanation of boosting. Artificial Intelligence. 203, 1-18 (2013).

[12] Gavrila D M. The visual analysis of human movement: A survey. Computer vision and image understanding. 73(1), 82-98 (1999).

[13] Green, R. D., Guan, L.: Quantifying and recognizing human movement patterns from monocular video images-part II: applications to biometrics. IEEE Transactions on Circuits and Systems for Video Technology. 14(2), 191-198 (2004).

[14] He Z, et al. Support tensor machine with dynamic penalty factors and its application to the fault diagnosis of rotating machinery with unbalanced data. Mechanical systems and signal processing. 141, 106441 (2020).

[15] Isa, D., Lee, L. H., Kallimani, V. P., Rajkumar, R. Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data engineering. 20(9), 1264-1272 (2008).

[16] Khemchandani, R., Chandra, S.: Twin support vector machines for pattern classification. IEEE Transactions on pattern analysis and machine intelligence. 29(5), 905-910 (2007).

[17] Khemchandani, R., Karpatne, A., Chandra, S.: Proximal support tensor machines, International Journal of Machine Learning and Cybernetics, 4, 703-712 (2013).

[18] Kim M, et al. Moving object segmentation in video sequences by user interaction and automatic object tracking. Image and Vision Computing. 19(5), 245-260 (2001).

[19] Kolda, T. G., Bader, B. W.: Tensor decompositions and applications. SIAM review 51 (3), 455-500 (2009).

[20] Kolda, T. G. Multilinear operators for higher-order decompositions (No. SAND2006-2081). Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States) (2006).

[21] Li, Y., Guan, C.: Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm. Machine Learning. 71, 33-53 (2008).

[22] Lu, H., Plataniotis, K. N., Venetsanopoulos, A. N.: multilinear principal component analysis of tensor objects. 18th International Conference on Pattern Recognition. 2, 776-779 (2006).

[23] Lu, H., Plataniotis, K. N., Venetsanopoulos, A. N.: A taxonomy of emerging multilinear discriminant analysis solutions for biometric signal recognition. Biometrics: Theory, Methods, and Applications. 21-45 (2009).

[24] Mangasarian, O. L., Wild, E. W.: Multisurface proximal support vector classification via generalized eigenvalues. IEEE transactions on pattern analysis and machine intelligence. 28(1), 69-74 (2006).

[25] Narwaria, M., Lin, W.: Objective image quality assessment based on support vector regression. IEEE Transactions on Neural Networks. 21(3), 515-519 (2010).

[26] Plataniotis K N. Color image processing and applications. Measurement Science and Technology. 12(2), 222-222 (2001).

[27] Rahman, M. M., Antani, S. K., Thoma, G. R. A learning-based similarity fusion and filtering approach for biomedical image retrieval using SVM classification and relevance feedback. IEEE Transactions on information technology in biomedicine. 15(4), 640-646 (2011).

[28] Renard, N., Bourennane, S.: Dimensionality reduction based on tensor modeling for classification methods. IEEE Transactions on Geoscience and Remote Sensing. 47(4), 1123-1131 (2009).

[29] Sahbi, H., Audibert, J. Y., Keriven, R.: Context-dependent kernels for object classification. IEEE transactions on pattern analysis and machine intelligence. 33(4), 699-708 (2010).

[30] Sain, S. R.: The nature of statistical learning theory. 409-409 (1996).

[31] Vapnik, V.: The nature of statistical learning theory. Springer science business media. (2013).

[32] Shao, Y. H., et al.: Improvements on twin support vector machines. IEEE transactions on neural networks. 22(6), 962-968 (2011).

[33] Shen, K. Q., et al.: Feature selection via sensitivity analysis of SVM probabilistic outputs. Machine Learning. 70, 1-20 (2008).

[34] Shi, H., et al.: Twin bounded support tensor machine for classification. International Journal of Pattern Recognition and Artificial Intelligence, 30(01), 1650002 (2016).

[35] Tao, D., Li, X., Hu, W., Maybank, S., Wu, X.: Supervised tensor learning. In Fifth IEEE International Conference on Data Mining. IEEE. pages 8-pp (2005).

[36] Tao, D., Li, X., Wu, X., Hu, W., Maybank, S. J.: Supervised tensor learning. Knowledge and Information Systems. 13(1), 1-42 (2007).

[37] Vapnik, V. N.: An overview of statistical learning theory. IEEE transactions on neural networks. 10(5), 988-999 (1999).

[38] Wang H, Ahuja N. Compact representation of multidimensional data using tensor rank-one decomposition. vectors, 1(5), 44-47 (2004).

[39] Wang, L., et al.: A refined margin analysis for boosting algorithms via equilibrium margin. The Journal of Machine Learning Research. 12, 1835-1863 (2011).

[40] Zhang L, et al. A novel dual-center-based intuitionistic fuzzy twin bounded large margin distribution machines. IEEE Transactions on Fuzzy Systems. 31(9): 3121-3134 (2023).

[41] Zhang, T., Zhou, Z. H. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 313-322 (2014).

[42] Zhang, X., Gao, X., Wang, Y. Twin support tensor machines for MCs detection. Journal of Electronics (China). 26(3), 318-325 (2009).

Submission history

A Nonparallel Support Tensor Machine for Binary Classification Based on Large Margin Distribution and Iterative Optimization