nach oben

Complex & Intelligent Systems

Open Access 09.04.2024 | Original Article

UCDCN: a nested architecture based on central difference convolution for face anti-spoofing

verfasst von: Jing Zhang, Quanhao Guo, Xiangzhou Wang, Ruqian Hao, Xiaohui Du, Siying Tao, Juanxiu Liu, Lin Liu

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

The significance of facial anti-spoofing algorithms in enhancing the security of facial recognition systems cannot be overstated. Current approaches aim to compensate for the model’s shortcomings in capturing spatial information by leveraging spatio-temporal information from multiple frames. However, the additional branches to extract inter-frame details increases the model’s parameter count and computational workload, leading to a decrease in inference efficiency. To address this, we have developed a robust and easily deployable facial anti-spoofing algorithm. In this paper, we propose Central Difference Convolution UNet++ (UCDCN), which takes advantage of central difference convolution and improves the characterization ability of invariant details in diverse environments. Particularly, we leverage domain knowledge from image segmentation and propose a multi-level feature fusion network structure to enhance the model’s ability to capture semantic information which is beneficial for face anti-spoofing tasks. In this manner, UCDCN greatly reduces the number of model parameters as well as achieves satisfactory metrics on three popular benchmarks, i.e., Replay-Attack, Oulu-NPU and SiW.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

In the following section, we will delve into the significance of face anti-spoofing, highlighting the shortcomings of traditional approaches and early deep learning methods. Furthermore, we will explore the use of auxiliary supervision, central difference convolution, and semantic information as potential solutions to enhance the effectiveness of face anti-spoofing techniques.

The importance of face anti-spoofing

With the continuous development of science and technology, biological information is increasingly being used in password systems as a replacement for traditional character passwords that require memorization. Due to their unique characteristics, human faces, as one of the most important biological features of human beings, have been widely used in various interactive systems, such as mobile phone unlocking, account control, permission access, and mobile payment, to provide more convenient operations. However, the presence of fake or spoofed faces limits the reliability of face interactive systems. Printing faces and replaying video attacks can easily confuse existing face recognition systems, leading to incorrect judgments. To ensure the effectiveness of face recognition, it is necessary to design a robust and easy-to-deploy face anti-spoofing system.

Limitation of traditional methods and early deep learning methods

In recent years, face anti-spoofing tasks have attracted a significant amount of attention from researchers. While early research in face anti-spoofing relied on hand-crafted features, such as LBP [1], LBP-TOP [2], HOG [3, 4], and SURF [5], these methods lack robustness and generalization capabilities. Hand-crafted features are not specifically designed for face anti-spoofing tasks and may not accurately represent the underlying data. In addition, they perform poorly when faced with high-definition images or the absence of detailed invariant information.

To overcome these limitations, researchers have turned to deep learning methods [6], which have shown effectiveness in extracting discriminative features and improving the generalization capability of face anti-spoofing systems. Convolutional neural networks (CNNs) have been utilized to design face anti-spoofing networks, leveraging their powerful feature extraction capabilities. However, existing methods often rely on fine-tuning image classification networks, which may not capture the essential features distinguishing spoof and real faces accurately. Binary supervision used in these networks can lead to overfitting and difficulty in generalizing to external datasets.

Auxiliary supervision, central difference convolution and semantic information

To address these challenges, researchers have proposed the use of auxiliary supervision, such as depth map supervision [7] and RPPG [8] to guide network learning and enhance the performance of face anti-spoofing. Most high-performance models rely on multi-frame input [7, 9‐11], for the lack of spatial features extracted from a single-frame under traditional convolution. But multi-frame model is difficult to deploy in actual production environments

To get more spatial features from a single-frame, Central Difference Convolution (CDC) [12‐15] has been introduced. CDC combines traditional convolution with LBP to extract gradient information and improves the characterization ability of traditional convolution. A large number of ablation experiments have shown that central difference convolution is more suitable for face anti-spoofing tasks and performs better in describing invariant information of details. In various environments, central difference convolution is more likely to extract inherent deceptive patterns, such as lattice artifacts, than traditional convolution operations.

Furthermore, many high-performance face anti-spoofing networks use domain knowledge from other visual tasks to improve their performance, such as NAS [12] and De-X [16]. Therefore, to consider the importance of semantic information in segmentation tasks, we transfer the domain knowledge from segmentation tasks to face anti-spoofing tasks. We refer to the network of UNet++ structure to gather multi-level features of the image and enhance the ability to obtain semantic information.

Summary

In summary, the motivation of this research is driven by the importance of facial anti-counterfeiting in various interactive systems. Existing methods face challenges in terms of reliability, robustness, and generalization capabilities. The research aims to design a robust and easy-to-deploy face anti-spoofing system by leveraging auxiliary supervision, central difference convolution, and domain knowledge transfer. As a result, this article mainly includes the following contributions:

Utilizing central difference convolution instead of traditional convolution to construct a face anti-spoofing network structure improves the characterization ability of invariant details in diverse environments.

We leverage domain knowledge from image segmentation and propose a multi-level feature fusion network structure to enhance the model’s ability to capture semantic information. To the best of our knowledge, this is the first time that domain knowledge from image segmentation has been applied to face anti-spoofing. We named the designed network structure UCDCN as shown in Fig. 1.

We have redefined the loss function and training strategy to prevent overfitting.

The network structure we designed demonstrates excellent performance on both internal and external datasets with minimal training, thus demonstrating the effectiveness of our model.

In this segment, we will provide an overview of the existing research on face anti-spoofing, encompassing a range of approaches such as texture-based methods, time-based methods, and depth map auxiliary supervision.

Texture-based method

Most previous methods for face anti-spoofing based on hand-crafted features and color texture analysis, such as LBP, [1], LBP-TOP [2], HOG [3, 4] and SURF [5], rely on texture differences between live and spoof faces and are classified by traditional SVM [17] and LDA [18]. LBP, LBP-TOP, HOG, SIFT, and SURF are not specifically designed for face anti-spoofing, and their feature extraction ability is relatively limited. To overcome this, Jianwei Yang et al. [19] utilized the powerful self-extracting feature ability of CNNs and introduced it to face anti-spoofing tasks. The CNN model achieved impressive results through supervised training via binary softmax loss. However, traditional CNNs tend to extract less detailed information, such as moiré stripes, lattice artifacts, and phone borders, rather than valid and generalizable cues, since face anti-spoofing tasks contain a vast amount of detailed information. Furthermore, some studies [7] have indicated that the face anti-spoofing model using binary CNN is prone to overfitting, and its performance is susceptible to environmental changes, such as lighting changes, posture changes, deception medium, etc. Some high-precision texture-based models have abandoned the binary softmax supervision and used auxiliary supervision to guide model training to obtain generalizable cues. For instance, some researchers [20] have used learn-to-learn network to extract meta pattern to get the discriminative information instead of hand-crafted feature extraction. In addition, depth map-based supervision [7] has been explored, showing significantly improved performance compared to previous binary supervision models.

Time-based method

The time-based method is one of the earliest schemes applied in face anti-spoofing. Some studies have used multi-frame input to capture facial motion features such as blink detection [21] and lip motion [22] to achieve face anti-spoofing. However, such methods can be easily confused by some paper-cut attacks (where the eyes and lips part of the print attack are cut off). Since the structures of living faces significantly differ from these spoof faces, it is challenging to design effective time-based face anti-spoofing systems. Some researchers [23] have implemented face anti-spoofing tasks by comparing Fourier domains between consecutive frames in time series, but this approach relies heavily on accurate face localization and performs poorly for high-definition images. The time-based approach [24, 25] requires multiple frames as input, and multi-frame detection models are more challenging to deploy in production environments than single-frame detection models. Therefore, it is crucial to design robust single-frame face anti-spoofing models.

Table 1

Study contributions along with research gaps

Authors	Methods	Time	Advantages and limitations
Pan Gang et al. [21, 22]	Facial motion features based method	2007	Use multi-frame input to capture facial motion features, but can be easily confused by some paper-cut attacks
Määttä et al. [1]	Traditional texture-based method	2011	Rely on texture features but its feature extraction ability is limited
Liu, Shuying et al. [36]	Binary CNN-based method	2015	Utilize the powerful self-extracting feature ability of CNNs but may not capture the essential features distinguishing spoof and real faces accurately and lead to overfitting
Atoum Youse et al. [26, 27]	Pseudo-depth labels based method	2017	First, utilize pseudo-depth labels to guide a network, which enhance the performance of face anti-spoofing
Wang Zhuo et al. [24, 25]	Multi-frame based method	2021	Require multiple frames as input and it is challenging to deploy in production environments
This study	Depth map supervision CDC network	2023	Utilize depth labels to guide the network and CDC to get more spatial features, leverage domain knowledge and enhance the model’s ability to capture semantic information

Depth map auxiliary supervision

Estimating face depth from a single RGB facial image is a highly challenging computer vision problem that plays a crucial role in facial anti-counterfeiting. Previous research has explored different approaches to tackle this challenge. Atoum et al. [26] first utilized pseudo-depth labels to guide a multi-scale fully convolutional network, while Zitong et al. [27] proposed a pyramid supervision technique to capture both local details and global semantics. Wang et al. [28] introduced a Generative Adversarial Network (GAN) to transfer RGB face images to the depth domain. [29] provides a new method for multi-view data processing. Yahang et al. [30] introduced facial depth as well as the boundary of spoof medium, moiré pattern, reflection artifacts to imitate human decision. Wang Yu et al. [31] proposed a face anti-spoofing method based on client identity information using Siamese network and employed depth map as auxiliary information to improve the performance. Jie jiang et al. [32] employ GCBlock to better mine face depth information for auxiliary supervision. These advancements have been made possible by the widespread use of convolutional neural networks in the field of facial analysis. In recent years, face 3D reconstruction techniques, such as the 3D Morphable Model (3DMM) proposed by Booth et al. [33], have significantly contributed to the development of facial anti-counterfeiting methods. Moreover, Jianzhu et al. introduced 3DDFAv2 [34, 35], which fits dense 3D face models to face images by CNN and enables efficient processing on CPU, reducing the time required for dataset processing. In this study, we leverage 3DDFAv2 [34, 35] to generate depth maps for live faces and use flat zero matrices as ground truth for various spoof faces, including printing attacks and replay attacks . The study contributions along with research gaps is shown in Table 1.

Proposed method

In this section, we provide a detailed description of the network structure, derive the CDC formula, and thoroughly explain the loss function.

Network structure

The overview of the proposed network is as shown in Fig. 1. The entire network can be divided into two parts: the first part is the backbone, which mainly focuses on estimating the depth of the input face, and the second part is the classifier, which utilizes the estimated depth information to obtain the final classification.

Central difference convolution

The convolutional neural network (CNN) is a fundamental operation used for various computer vision tasks, such as feature extraction, dimension transformation, and scale transformation. However, the face anti-spoofing task is different from traditional image classification tasks. In face anti-spoofing, distinguishing between living and spoof faces is challenging due to the subtle differences between them. Many researchers have pointed out that traditional CNNs have difficulty capturing the crucial information that differentiates between living and spoof faces. To address this, we use the Central Difference Convolution (CDC) operation within our network. CDC obtains gradient information that traditional convolution does not have through differentiation, which can combine prior knowledge of the differences in three-dimensional aspects between real and fake faces. CDC consists of two steps, sampling and aggregation, and distinguishes from traditional CNN in the aggregation step. We express the mathematical description of CDC as Eq. 1 in this paper:

$$\begin{aligned} y({p_0}) = \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot (x(} {p_0} + {p_n})-x(p_0)) \end{aligned}$$

(1)

where $p_0$ denotes the current position on the input and output feature maps, and $p_n$ is the position computed in ${\mathcal {R}}$:

$$\begin{aligned} \begin{aligned} y({p_0})&= \underbrace{\theta \cdot \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot (x({p_0} + {p_n}) - x({p_0})} }_{{\text {center difference convolution}}} \\&\quad + \underbrace{(1 - \theta ) \cdot \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot x({p_0} + {p_n})} }_{{\text {vanilla convolution}}} \\&= \underbrace{\sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot (x({p_0} + {p_n})} }_{{\text {vanilla convolutio}}n} \\&\quad + \underbrace{\theta \cdot ( - x({p_0}) \cdot \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}))} }_{{\text {center difference convolution}}} \\ \end{aligned} \end{aligned}$$

(2)

Table 2

Notations and illustrate

Notation	Illustrate
$p_0$	Current position on the input and output feature maps
$p_n$	Position computed in R
$\theta $	Weight between CDC and the traditional convolution
$\beta $	Threshold to change between L1 and L2 losses
$\ell _{absolute} (x,y)$	Depth map loss
$K_i^{contrast}$	A convolution kernel
${\ell _{contrast}}(x,y)$	Contrast depth loss
${\mathcal {L}}_{depth}(x,y)$	Depth estimation loss
p	Prediction probability of classifier
$CE(p_t)$	Conventional binary cross-entropy loss
$\alpha _t$	Weight to balance CEloss
$FL(p_t)$	Focal loss equal to classify loss
${\mathcal {L}}(x,y)$	Total loss

The specific implementation of the central difference convolution is shown in Fig. 2. For the input feature map, it first uses the convolutional kernel to realize the traditional convolution. Then, the convolution kernel is summed according to the w and h dimensions and performs the convolutional operation on the feature map. Finally, the two feature maps are subtracted to obtain the final output feature map.

Following the research of Zitong Yu et al. [13], we used the $\theta $ parameter to obtain the weighted combination of the central difference convolution with the traditional convolution, resulting in the final central difference convolution. The mathematical formula is shown in Eq. 2. It should be noted that when $\theta =0$, the CDC operation reduces to the traditional convolution.

UCDCN

Our network is divided into two parts: a backbone responsible for extracting image features and regressing the depth information, and a classifier that uses the extracted depth information to achieve final classification. The backbone of our network draws inspiration from UNet++ and employs a multi-layer feature fusion approach. This approach combines low-level texture features with high-level abstract features and maximizes the utilization of fine-grained details captured by the Central Difference Convolution (CDC), which represents essential information. Figure 3 illustrates the structure of our network where each ConvBlock consists of two consecutive sets of CDCs, BatchNorm [37], and ReLU [38]. The CDC block includes a CDC layer followed by a Sigmoid layer. As mentioned earlier, the Sigmoid layer constrains the output values within the range of [0, 1], which corresponds to the actual pixel values in the image. This constraint aids in computing the loss function using normalized labels, facilitating accurate training of the network.

Loss function

This subsection will elaborate on the loss function ${\mathcal {L}}$ that we have designed, inspired by previous research [9, 13]. Our loss function combines depth loss and classification loss, targeting both depth estimation and classification tasks. The notations and illustrate used in the article are indicated in Table 2.

Depth map loss

Our depth loss ${\mathcal {L}}$ is comprised of two components. The first component is the SmoothL1Loss as in Eqs. 3, 4, and 5, which is commonly used in regression analysis. This loss function combines the advantages of Mean Absolute Error (MAE) and Mean Squared Error (MSE) losses, making it less sensitive to outliers than MSE and able to prevent gradient explosion in certain cases.The loss function can be represented in Eq. 3:

$$\begin{aligned} loss (x,y) = L = {\{ {l_1},...,{l_N}\} ^T} \end{aligned}$$

(3)

where

$$\begin{aligned} {l_n} = \left\{ \begin{aligned}&0.5{({x_n} - {y_n})^2}/\beta ,{\text { }}if{\Vert {x_n} - {y_n}\Vert } < \beta \\&{\Vert {x_n} - {y_n}\Vert } - 0.5\beta ,{\text { }}otherwise \\ \end{aligned} \right. \end{aligned}$$

(4)

where $\beta $ is the threshold to change between L1 and L2 losses

$$\begin{aligned} \ell _{absolute} (x,y) = \left\{ \begin{aligned}&mean(L),{\text { }}if\;reduction = mean \\&sum(L),{\text { }\text { }}if\;reduction = sum \\ \end{aligned} \right. \end{aligned}$$

(5)

and the depth map loss can be represented by Eq. 5

Contrast depth loss

SmoothL1Loss, while effective in regression tasks, does not consider the varying weights of pixel points in different regions that contribute to the overall regression. This limitation prevents the model from effectively capturing detailed information present in different facial regions. To overcome this limitation, we introduce a contrast depth loss that builds upon SmoothL1Loss and enhances the model’s capability to capture fine-grained details by incorporating region-specific contributions into the loss function. As depicted in Fig. 1, different facial regions exhibit distinct features, such as a prominent nose bulge or noticeable depressions in the cheeks and eyes. These features provide strong cues for distinguishing between genuine and fake faces. Following the approach of Zezheng Wang et al., we utilize a convolutional kernel, as shown in Fig. 4.

To incorporate the contrast depth loss into our framework, we utilize a convolutional kernel $K_i^{contrast}$to perform convolution operations on both the ground truth depth labels and the predicted depth map. Since the contrast depth loss acts as a fine-tuning method rather than a standalone loss function, we directly compute the loss value using MSE. The entire process can be represented by Eq. 6:

$$\begin{aligned} {\ell _{contrast}}(x,y) = MSE(K_i^{contrast} \odot x,K_i^{contrast} \odot y) \end{aligned}$$

(6)

Therefore, the loss function for our depth estimation is expressed as Eq. 7:

$$\begin{aligned} {{\mathcal {L}}_{depth}}(x,y) = {\ell _{absolute}}(x,y) + {\ell _{contrast}}(x,y) \end{aligned}$$

(7)

Classifier loss

Face anti-spoofing remains fundamentally a binary classification task, and thus, binary cross-entropy loss is frequently utilized to supervise this task. However, a considerable body of research [6, 39‐41] has demonstrated that binary cross-entropy loss is highly susceptible to causing overfitting in face anti-spoofing tasks, which is a key factor contributing to poor model generalization. Moreover, given the diverse range of spoofing techniques and media, the number of negative samples in face anti-spoofing datasets, i.e., the number of spoof faces, far exceeds the number of positive samples representing genuine faces. The large proportion of negative samples in the dataset accounts for a significant portion of the total loss, and is a key reason why the face anti-spoofing task tends to focus on classifying negative samples, leading the model optimization direction to deviate from our intended objective. Uneven data distribution is a common characteristic of face anti-spoofing datasets, and therefore, for the classifier, we need to address the following two challenges.

To mitigate the impact of overfitting associated with softmax on model performance.

To address the impact of uneven data distribution on model performance.

In this study, we were motivated by the application of Ref. [42] in object detection and we utilized focal loss as the classification loss function in our approach. We further elucidated the exceptional performance of focal loss in binary classification tasks by analyzing its ability to overcome the limitations of the conventional binary cross-entropy loss, as previously reported in Eq. 8:

$$\begin{aligned} CE(p,y) = \left\{ \begin{aligned}&- \log (p),{\text { }\text { }\text { }\text { }\text { }\text { }}if{\text { }}y = 1 \\&- \log (1 - p),{\text { }}otherwise \\ \end{aligned} \right. \end{aligned}$$

(8)

where $y=1$ denotes our living face sample and p is the prediction probability of the classifier, and for the friendliness of the formula representation, we define $p_t$ as follows:

$$\begin{aligned} {p_t} = \left\{ \begin{aligned}&p,{\text { }\text { }\text { }\text { }\text { }\text { }}if{\text { }}y = 1 \\&1 - p,{\text { }}otherwise \\ \end{aligned} \right. \end{aligned}$$

(9)

Thus, we can obtain $CE(p,y) = CE(p_t) = -\log (p_t)$. As in most scenarios, we add the weight parameter ${\alpha _t}$ to handle the category imbalance problem, so we get $\alpha -balanced$ CEloss, $CE({p_t}) = - {\alpha _t}\log ({p_t})$, where ${\alpha _t} \in [0,1]$. As the number of spoof faces typically constitutes a significantly larger proportion of the dataset, the model evaluation metrics can be artificially inflated if the model tends to predict spoof faces more frequently. This phenomenon can give rise to an illusion of exceptional model performance and is often the primary cause of overfitting. To address this issue and achieve a more balanced loss between living and spoof faces, we introduce a modulation factor ${(1 - {p_t})^\gamma }$, resulting in the following focal loss formulation represented by Eq. 10:

$$\begin{aligned} FL({p_t}) = - {(1 - {p_t})^\gamma }\log ({p_t}) \end{aligned}$$

(10)

Figure 5 displays the focal loss curves for various values of the $\gamma $ coefficient, which regulates the behavior of the focal loss mechanism as follows.

Table 3

The details of the datasets for face anti-spoofing

Dataset	Year	Subjects	Sessions	Live/attack	Pose range	Different express	Extra light	Spoof attacks
Replay-Attack	2012	50	1	200/1000	Frontal	No	Yes	Print, 2 Replay
Oulu-NPU	2017	55	3	1980/3960	Frontal	No	Yes	2 Print, 2 Replay
SiW	2018	165	4	1320/3300	[−90, 90]	Yes	Yes	2 Print, 4 Replay

When a sample is misclassified, there are no more than two cases. When the groundtruth is 1, its prediction result is close to 0, i.e., $p \rightarrow 0$, according to Eq. 9, at this time $p_t \rightarrow 0$, $FL(p_t)$ is a large value. When the groundtruth is 0, its prediction is close to 1, $p \rightarrow 1$, $p_t \rightarrow 0$, $FL(p_t)$ is still a large value.

The focus parameter $\gamma $ smoothly adjusts the rate at which easy examples are down weighted. When $\gamma =0$, FL is equivalent to CE, and as $\gamma $ is increased the effect of the modulating factor is likewise increased. Based on the experimental results of previous work, in the face anti-spoofing task, we also set $\gamma =2$.

Adding the ${\alpha _t}$ parameter, we can obtain the final focal loss as our classifier loss in Eq. 11:

$$\begin{aligned} FL({p_t}) = - {\alpha _t}{(1 - {p_t})^\gamma }\log ({p_t}) \end{aligned}$$

(11)

In summary, we use focal loss to enhance the ability of the model to correctly classify difficult samples and to improve the impact of unbalanced data distribution on model performance. Finally the total loss of our model can be expressed as in Eq. 12, where ${{\mathcal {L}}_{classify}}(x,y)=FL({p_t})$:

$$\begin{aligned} {\mathcal {L}}(x,y) = {{\mathcal {L}}_{depth}}(x,y) + {{\mathcal {L}}_{classify}}(x,y) \end{aligned}$$

(12)

Implementation details

Datasets

In our experiments, three databases were used, OULU-NPU [43], SiW [7] and Replay-Attack [44]. All the details of databases are in Table 3. OULU-NPU is a high-resolution database consisting of 4950 live access and spoofed videos. This database contains four protocols to validate the generality of the model. SiW contains more living subjects as well as three testing protocols. Replay-Attack is databases that contain low-resolution videos.

Pre-processing stage

Our proposed method works on cropped face images due to the different resolutions of various devices. Jianwei Yang et al. [19] point out that a certain degree of background features is helpful to the model, so we use RetinaFace [45] to crop the square area in an isotropic way and set the image size to 1.2 times the face area. The input images are normalized and scaled to 128$\times $128 in accordance with the IMAGENET [46] standard, after which the corresponding depth maps are generated using 3DDFAv2 [34], as described earlier.

Data augmentation

Unlike conventional image classification tasks, the data augmentation techniques employed for face anti-spoofing tasks require the incorporation of real-world scenarios, such as occlusions, changes in lighting conditions, variations in angle, and so on. In this regard, we utilized various data augmentation methods, including random erasing to simulate partial occlusions of the face, random brightness adjustments to simulate changes in lighting conditions, and random horizontal flips and rotations to simulate alterations in facial angles. Notably, the depth map labels only change proportionally with variations in the facial angle, and Fig. 6 provides a visualization of our augmentation strategy.

Training strategies

Our proposed method was implemented using the PyTorch framework. The training section comprises two stages: freezing the classifier during training of the backbone for depth regression, and then freezing the weights of the backbone for training the classifier. This approach offers the advantage that, at the early stage of end-to-end training, the backbone may not have sufficient ability to estimate valid depth information, and backpropagation of the classifier may affect the weight update of the backbone, thereby reducing the robustness of training to some extent. Therefore, we first train the backbone separately until its performance is good enough, then train the classifier separately, and finally perform end-to-end training to improve the model’s performance. We trained the backbone for 100,000 steps, the classifier for 2500 steps, and the end-to-end training for 10,000 steps using the POLY learning rate decay strategy. We set the learning rate to 0.0001 both in the backbone stage and the end-to-end training, and 0.001 in the classifier stage.

Table 4

The number of images on three databases

Dataset	Replay-Attack	Oulu-NPU	SiW
Training images	56,975	247,666	1,437,879
Test images	760,00	240,823	1,221,438

Table 5

The results of intra-testing on four protocols of OULU-NPU

Prot	Method	APCER (%)	BPCER (%)	ACER (%)
1	CPqD	2.9	10.8	6.9
	FAS-BAS [7]	1.6	1.6	1.6
	GRADIANT [47]	1.3	12.5	6.9
	CDCN [13]	0.4	1.7	1.0
	UCDCN-$\text {L}_2$ (Ours)	1.4	0.4	0.9
	UCDCN-$\text {L}_3$ (Ours)	2.84	2.39	2.61
2	MixedFASNet	9.7	2.5	6.1
	GRADIANT [47]	3.1	1.9	2.5
	FAS-BAS [7]	2.7	2.7	2.7
	CDCN [13]	1.5	1.4	1.5
	UCDCN-$\text {L}_2$ (Ours)	0.4	1.5	0.9
	UCDCN-$\text {L}_3$ (Ours)	2.84	2.39	2.61
3	MixedFASNet	5.3 ± 6.7	7.8 ± 5.5	6.5 ± 4.6
	GRADIANT [47]	2.6 ± 3.9	5.0 ± 5.3	3.8 ± 2.4
	FAS-BAS [7]	2.7 ± 1.3	3.1 ± 1.7	2.9 ± 1.5
	CDCN [13]	2.4 ± 1.3	2.2 ± 2.0	2.3 ± 1.4
	UCDCN-$\text {L}_2$ (Ours)	1.6 ± 0.6	2.7 ± 1.2	2.2 $\varvec{\pm }$ 0.9
	UCDCN-$\text {L}_3$ (Ours)	1.7 ± 0.7	4.6 ± 3.2	3.2 ± 1.3
4	Massy HNU	35.8 ± 35.3	8.3 ± 4.1	22.1 ± 17.6
	GRADIANT [47]	5.0 ± 4.5	15.0 ± 7.1	10.0 ± 5.0
	FAS-BAS [7]	9.3 ± 5.6	10.4 ± 6.0	9.5 ± 6.0
	CDCN [13]	4.6 ± 4.6	9.2 ± 8.0	6.9 ± 2.9
	UCDCN-$\text {L}_2$ (Ours)	4.4 ± 2.9	6.2 ± 4.4	5.3 $\varvec{\pm }$ 3.4
	UCDCN-$\text {L}_3$ (Ours)	5.1 ± 4.3	9.6 ± 7.1	7.4 ± 3.2

Table 6

The result on three databases

Dataset	Acc (%)	APCER (%)	BPCER (%)	ACER (%)
Replay-Attack	99.18%	0.813%	0.05%	0.41%
Oulu-NPU	96.35%	2.6%	1.01%	1.82%
SiW	99.61%	0.31%	0.08%	0.19%

During different training phases, we will monitor different loss values. In the backbone stage, we will monitor the depth loss ${{\mathcal {L}}_{depth}}$. In the classifier stage, we will monitor the classification loss ${{\mathcal {L}}_{classify}}$ and accuracy. Finally, during end-to-end training, we will monitor the total loss function ${\mathcal {L}}$.

Table 7

Different classifier performance on datasets. The convolution of each classifier is replaced by CDC

Dataset	Classifier	Acc (%)	APCER (%)	BPCER (%)	ACER (%)
SiW	Linear	99.61	0.31	0.08	0.19
SiW	ResNet18	99.62	0.27	0.12	0.19
SiW	MobileNet	99.47	0.48	0.05	0.27
SiW	ShuffleNet	99.55	0.39	0.06	0.23
SiW	VGG11	99.16	0.45	0.40	0.42
Replay	Linear	99.18	0.81	0.01	0.41
Replay	ResNet18	99.05	0.94	0.01	0.47
Replay	MobileNet	98.20	0.48	1.32	0.90
Replay	ShuffleNet	98.99	1.01	0.00	0.51
Replay	VGG11	97.82	0.68	1.50	1.09
OULU	Linear	96.35	2.63	1.01	1.82
OULU	ResNet18	95.12	4.41	0.47	2.44
OULU	MobileNet	95.39	2.30	2.32	2.31
OULU	ShuffleNet	94.94	4.00	1.07	2.53
OULU	VGG11	92.94	6.30	0.75	3.53

Table 8

The performance of different classifier on protocol 2

Classifier	Accuracy (%)	APCER (%)	BPCER (%)	ACER (%)
Linear	96.35	2.63	1.01	1.82
ResNet18	95.12	4.42	0.47	2.44
ShuffleNet	94.94	4.00	1.07	2.53
MobileNet	95.39	2.30	2.32	2.31
VGG11	92.94	6.30	0.75	3.53

Table 9

The four protocols on OULU-NPU

Prot	Subset	Session	Phones	Users	Attacks created using	Real	Attack	All
1	Train	1, 2	6	1–20	Printer 1, 2; Display 1, 2	240	960	1200
	Dev	1, 2	6	21–35	Printer 1, 2; Display 1, 2	180	720	900
	Test	3	6	36–55	Printer 1, 2; Display 1, 2	120	480	600
2	Train	1, 2, 3	6	1–20	Printer 1; Display 1	360	720	1080
	Dev	1, 2, 3	6	21–35	Printer 1; Display 1	270	540	810
	Test	1, 2, 3	6	36–55	Printer 2; Display 2	360	720	1080
3	Train	1, 2, 3	5	1–20	Printer 1, 2; Display 1, 2	300	1200	1500
	Dev	1, 2, 3	5	21–35	Printer 1, 2; Display 1, 2	225	900	1125
	Test	1, 2, 3	1 Phone	36–55	Printer 1, 2; Display 1, 2	60	240	300
4	Train	1,2	5	1–20	Printer 1; Display 1	200	400	600
	Dev	1, 2	5	21–35	Printer 1; Display 1	150	300	450
	Test	3	1 Phone	36–55	Printer 2; Display 2	20	40	60

Evaluation metrics

In the OULU-NPU [43] database, we followed the original protocols and metrics for a fair comparison. The following evaluation metrics were used for all databases.

The Attack Presentation Classification Error Rate ($APCER$ [47]) is used to calculate the misclassification error rate of a spoof face;

The Bona Fide Presentation Classification Error Rate (BPCER [47]) is used to measure the error rate of a living face being misclassified;

The Average Classification Error Rate (ACER [47]) is computed as the average of the APCER and the BPCER:

$$\begin{aligned} ACER=\frac{APCER+BPCER}{2} \end{aligned}$$

(13)

Experiments and results

We utilized the entire training set to train our model, and subsequently evaluated the performance on the entire test set to ascertain its effectiveness on the entire database. The number of images in the test and training sets were counted on three different databases, as shown in Table 4. All images were subjected to rigorous data clarity procedures, with each image resized to 128$\times $128. The results of our experiments using the full training and test sets are presented in Table 6. In Table 7, we provide supplementary results for different datasets using distinct classification heads. The results were classified with a threshold of 0.5, and most of the models achieved over 98% accuracy, as demonstrated in D. These results are highly encouraging.

In addition, we provide metrics under the OULU protocol in Table 5 to demonstrate the effectiveness of our proposed method. UCDCN-$\text {L}_3$ shows similar performance to the original CDCN, while UCDCN-$\text {L}_2$ exhibits higher performance than the original CDCN.

Visualization and analysis

In the proposed architecture, estimated depth maps serve as input to the classifier and are supervised by the focal loss. These maps provide information on the importance of different facial areas to the classifier. During the training process, as shown in Fig. 9, the estimated depth maps and corresponding classification results are displayed. The input image size is 128$\times $128, and the estimated depth map size is also 128$\times $128. An advantage of this approach is that the backbone can calculate the point-to-point regression loss between the estimated depth map and ground truth.

Through the visualization, it is evident that the depth map estimated by the model is highly accurate, detailed, and adaptable to various angles. This fine-grained depth information effectively reduces the classifier’s burden, leading to a robust architecture that is not affected by variations in luminance or angles.

Although the performance of the backbone affects the classifier, the classifier is able to compensate for the shortcomings of the backbone to some extent. Since the backbone cannot estimate near-perfect depth information for each image, in Fig. 8, we demonstrate the imperfect depth map estimated by the model. Therefore, it is also important to allow the classifier to autonomously determine the depth information threshold to distinguish between spoof and living faces for some imperfect depth maps. In Table 8, we present the performance of different classifiers under OULU-NPU protocol II.

Our model maximizes the utilization of the 3D shape of the face, which involves obtaining gradient information using CDC, a crucial difference from the flat spoofing face. It is important to note that CDC is not only utilized in the depth regression task, but we also replace vanilla convolution with CDC in the downstream classification task, as depth maps also possess 3D information. To gain a better understanding of the features extracted by the model in the depth regression stage, we implement a visualization of the middle layer.

As depicted in Fig. 10, we visualize the output features of the first-layer convolution, where the leftmost column presents the original input images. The top and bottom rows represent the living and spoof faces, respectively, along with their corresponding output features in the same channel of the same convolution features. It is evident that the extracted features of the living face can better differentiate the foreground from the background and provide a clearer edge description. In contrast, the spoof face is less distinguishable from the background. In addition, the model is capable of effectively emphasizes obvious gradient regions, such as eyes and mouth corners, for living faces. This further indicates that CDC can extract the most essential gradient information from living faces and distinguish them from flat spoof faces.

As shown in Fig. 11, we visualize the first layer of convolution features in the Decoder following the same order as the visualized image in Fig. 10. We observed significant differences in the output features of the living face and the spoof face in the Decoder. The second and fourth columns of the figure demonstrate that the model effectively regresses the depth information of the face in the Decoder. For the living face, the depth map is very clear and accurately portrays the nose tip and other facial parts. However, in contrast, the regression depth information for the spoof face is nearly zero.

Furthermore, the model’s regression pattern for the background remains consistent for both the living face and the spoof face. The disparity lies in the incorporation of image segmentation domain knowledge. In the case of the living face, a face-shaped region is present in the center, signifying that the utilization of depth map regression enhances the model’s emphasis on the facial region and captures features that differentiate the face from the background. As a result, the background alone has minimal impact on the model’s performance, whether it is a living face or a spoof face, thereby leading to an improvement in the model’s performance to some extent.

Model pruning

As shown in Figs. 7 and 1, our model incorporates features from multiple layers, but each layer’s contribution to the model differs during inference. In Fig. 12, we utilize Eq. 14 to calculate the variance of each hierarchical parameter in Fig. 7. This calculation aims to assess the extent of contribution of different hierarchical features to the model. It is evident that $X^{0,1}$, $X^{1,1}$ and $X^{0,2}$ have the lowest variance, which describe the intermediate layer features of the model as shown in Fig. 13, it can be found that output features of $X_{1,1}$ are very sparse and $X^{0,1}$,$X^{0,2}$ output zero features. These findings indicate that the outer model used by UCDCN for depth estimation makes a significant contribution to the model output, while the inner structure plays a minor role. This insight presents the potential for model pruning and parameter reduction:

$$\begin{aligned} {y_i} = \log \left( \frac{{{\theta _i} \cdot e}}{{\min \Theta }}\right) \end{aligned}$$

(14)

Training strategy analysis

Instead of directly conducting end-to-end training, our model is first trained in the regression part to effectively regress the depth information of the face, which is then used to achieve classification, making our model more interpretable. In contrast, end-to-end training would generate abstract input features to the classifier, which may achieve correct classification but are difficult to explain and do not align well with human thinking. The following figure illustrates the difference between end-to-end training and non-end-to-end training mechanisms. It can be observed that end-to-end training can also generate depth information, but the detailed information description is poor. This is because the reduction of classification loss during end-to-end training reduces the contribution of regression loss to the overall loss, thereby reducing the model’s capability to regress the depth map. Therefore, separate training of the regression part is utilized to obtain a more detailed depth map.

Comparison and performance

Model pruning offers two advantages: the ability to remove neurons that contribute little to the model and the ability to reduce the number of parameters, thus avoiding overfitting when dealing with small databases. The OULU-NPU protocols have different data allocation methods, resulting in a small amount of data for each protocol. To mitigate the risk of overfitting, we adopt pruning to reduce the complexity of the model. Therefore, we remove the last fused layer as shown in Fig. 7, we named it UCDCN-$\text {L}_3$, subtracting one more from UCDCN-$\text {L}_3$, we get the UCDCN-$\text {L}_2$, our purpose is to further compress the complexity and parameters of the model, the difference is that UCDCN-$\text {L}_2$ will use the parameters of UCDCN-$\text {L}_3$ as pretrained weights and iteratively train for 10,000 steps. We perform intra-testing on Oulu database. For Oulu, we follow the four protocols in Table 9 and report the APCER, BPCER and ACER in Table 5; it can be found that $L_2$ using $L_3$ as a pretrained model has better results. This further suggests that there are redundant intermediate layer neurons in $L_3$, which contain redundant information that may lead to poor model performance. In addition, $L_3$ with linear classifier has 11.26 flops and $L_2$ has 5.79 flops, and its extreme frame rate can reach 653 fps on RTX 3090 graphic card. This greatly reduces the difficulty of deployment in production environments. Exciting results can be found for our model, with improvements in different protocols compared to previous methods. This proves the validity of our proposed method.

Discussion

The datasets we employ are based on media and assume that the depth of the spoofed face is zero. However, in reality, facial changes and scenes exhibit significant complexity. This includes intricate pitch angle variations, diverse luminance changes, and various attack presentation methods such as different angles and distances, photo bending degrees, and more. Consequently, the depth of such a spoofed face is not zero; however, its depth information is significantly lower compared to that of a genuine living face.

In general, we have made improvements in data acquisition to ensure high-quality data. In addition, we have enhanced the network structure to enable the model to effectively utilize features at different levels. Furthermore, we have refined our training strategy and loss function to mitigate the risks of overfitting. We believe that methods from different domains within computer vision tasks can be adapted and applied to face anti-spoofing tasks, thereby enhancing the performance of the model. Overall, we firmly believe that depth-based face anti-spoofing is a promising and valuable area for further research.

Conclusion and future work

In this paper, we propose a novel face anti-spoofing network structure based on central difference convolution. This structure comprises a backbone for depth estimation and a classifier for binary prediction and we leverage various domain knowledge, including UNet++ multi-layer feature fusion from image segmentation, focal loss from object detection to address data imbalance and handle hard sample classification problems, as well as a face depth estimation method for constructing depth maps. Furthermore, we design a multi-tasking model training strategy that facilitates both depth map regression and classification. Our network achieves satisfactory metrics on Replay-Attack, Oulu-NPU and SiW and it is easy to deploy due to the reduction of model parameters. This easily deployable facial anti-spoofing network can provide rapid and effective anti-spoofing protection in scenarios where real-time decision-making is crucial, which is essential for applications such as financial transactions and identity verification, enabling the timely identification and interception of potential fraudulent activities.

However, our method might generalize poorly on unknown attack types and it is crucial to enhancing the generalization capacity of our methods. Recently proposed zero/few-shot learning [48‐51] can quickly adapt the model to new attacks by learning from both the predefined attacks and the collected very few samples of the new attack. Domain Effective Fast Adaptive nEt-worK (DEFAEK) [52] based on the optimization-based meta-learning paradigm can effectively and quickly adapts to new tasks. On the other hand, Adv-APG [53] regard face anti-spoofing as a unified framework with the attack and defense systems and optimize the defense system against unseen attacks via adversarial training with the attack system. To avoid identity-biased and domain-biased features, UDG-FS [54] proposed a novel Split-Rotation-Merge module to build identity-agnostic local representations. These studies give us inspiration for future research, and we will improve our model along these directions.

Acknowledgements

This research was partly supported by the National Natural Science Foundation of China (No. 61405028) and the Fundamental Research Funds for the Central Universities (University of Electronic Science and Technology of China) (No. ZYGX2019J053 and ZYGX2021YGCX020). We would like to express our gratitude to Prof. Yutang Ye and the MOEMIL laboratory staff, who provided valuable advice on the experiments conducted in this study.

Declarations

Conflicts of interest

There are no conflicts of interest in the submission of this manuscript and manuscript is approved by all the authors for publication.

We only use the data from public datasets mentioned in Sect. 4.1 and no extra data which is self-generated or self-collected is used.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jukka M, Abdenour H, Matti P (2011) Face spoofing detection from single images using micro-texture analysis. In: 2011 international joint conference on Biometrics (IJCB), pages 1–7. IEEE

de Freitas Pereira Tiago, Anjos André, De Martino José Mario, Marcel Sébastien (2012) Lbp- top based countermeasure against face spoofing attacks. In: Asian Conference on Computer Vision, pages 121–132. Springer

Jukka K, Abdenour H, Matti P (2013) Context based face anti-spoofing. In: 2013 IEEE sixth international conference on biometrics: theory, applications and systems (BTAS), pages 1–8. IEEE

Jianwei Y, Zhen L, Shengcai L, Stan LZ (2013) Face liveness detection with component dependent descriptor. In: 2013 International Conference on Biometrics (ICB), pages 1–6. IEEE

Zinelabidine B, Jukka K, Abdenour H (2016) Face antispoofing using speeded-up robust features and fisher vector encoding. IEEE Signal Process Lett 24(2):141–145

Lei L, Xiaoyi F, Zinelabidine B, Zhaoqiang X, Mingming L, Abdenour H (2016) An original face anti-spoofing approach using partial convolutional neural network. In: 2016 sixth international conference on image processing theory, tools and applications (IPTA), pages 1–6. IEEE

Yaojie L, Amin J, Xiaoming L (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 389–398

Chi YP, Siqi L, Shengping Z, Guoying Z (2019) 3d mask face anti-spoofing with remote photoplethysmography, August 13. US Patent 10,380,444

Zezheng W, Chenxu Z, Yunxiao Q, Qiusheng Z, Guojun Q, Jun W, Zhen L (2018) Exploiting temporal and depth information for multi-frame face anti-spoofing. arXiv preprint arXiv:1811.05118

10.

Xiao Y, Wenhan L, Linchao B, Yuan G, Dihong G, Shibao Z, Zhifeng L, Wei L (2019) Face anti-spoofing: Model matters, so does data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3507–3516

11.

Bofan L, Xiaobai L, Zitong Y, Guoying Z (2019) Face liveness detection by rppg features and contextual patch-based cnn. In: Proceedings of the 2019 3rd international conference on biometric engineering and applications, pages 61–68

12.

Zitong Y, Jun W, Yunxiao Q, Xiaobai L, Li Stan Z, Guoying Zhao (2020) Nas-fas: Static-dynamic central difference network search for face anti-spoofing. IEEE Trans Pattern Anal Mach Intell 43(9):3005–3023

13.

Zitong Y, Chenxu Z, Zezheng W, Yunxiao Q, Zhuo S, Xiaobai L, Feng Z, Guoying Z (2020) Searching central difference convolutional networks for face anti-spoofing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5295–5305

14.

Zitong Y, Xiaobai L, Xuesong N, Jingang S, Guoying Z (2020) Face anti-spoofing with human material perception. In: European conference on computer vision, pages 557–575. Springer

15.

Zezheng W, Zitong Y, Chenxu Z, Xiangyu Z, Yunxiao Q, Qiusheng Z, Feng Z, Zhen L (2020) Deep spatial gradient and temporal depth learning for face anti-spoofing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5042–5051

16.

Amin J, Yaojie L, Xiaoming L (2018) Face de-spoofing: Anti-spoofing via noise modeling. In: Proceedings of the European conference on computer vision (ECCV), pages 290–306

17.

Hearst Marti A, Dumais Susan T, Edgar Osuna, John Platt, Bernhard Scholkopf (1998) Support vector machines. IEEE Intell Syst their Appl 13(4):18–28CrossRef

18.

Izenman Alan Julian (2008) Modern multivariate statistical techniques. Regression, Classification and Manifold Learning 10:978MathSciNet

19.

Jianwei Y, Zhen L, Stan LZ (2014) Learn convolutional neural network for face anti-spoofing. arXiv preprint arXiv:1408.5601

20.

Cai Rizhao, Li Zhi, Wan Renjie, Li Haoliang Hu, Yongjian Kot Alex C (2022) Learning meta pattern for face anti-spoofing. IEEE Trans Inform Forensics Secur 17:1201–1213CrossRef

21.

Gang P, Lin S, Zhaohui W, Shihong L (2007) Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In: 2007 IEEE 11th international conference on computer vision, pages 1–8. IEEE

22.

Klaus Kollreider, Hartwig Fronthaler, Isaac Faraj Maycel, Josef Bigun (2007) Real-time face detection and motion analysis with application in “liveness’’ assessment. IEEE Trans Inform Forensics Secur 2(3):548–558CrossRef

23.

Jiangwei L, Yunhong W, Tieniu T, Jain AK (2004) Live face detection based on the analysis of fourier spectra. In: Biometric technology for human identification, volume 5404, pages 296–303. SPIE

24.

Zhuoyi Z, Cheng J, Xiya Z, Chang S, Yifeng Z (2021) Two-stream convolutional networks for multi-frame face anti-spoofing. arXiv preprint arXiv:2108.04032

25.

Zhuo Wang, Qiangchang Wang, Weihong Deng, Guodong Guo (2022) Learning multi-granularity temporal characteristics for face anti-spoofing. IEEE Trans Inform Forensics Secur 17:1254–1269CrossRef

26.

Yousef A, Yaojie L, Amin J, Xiaoming L (2017) Face anti-spoofing using patch and depth-based cnns. In: 2017 IEEE international joint conference on biometrics (IJCB), pages 319–328. IEEE

27.

Zitong Yu, Xiaobai Li, Jingang Shi, Zhaoqiang Xia, Guoying Zhao (2021) Revisiting pixel-wise supervision for face anti-spoofing. IEEE Trans Biometrics Behav Identity Sci 3(3):285–295CrossRef

28.

Yahang Wang, Song Xiaoning Xu, Tianyang Feng Zhenhua, Xiao-Jun Wu (2021) From rgb to depth: domain transfer network for face anti-spoofing. IEEE Trans Inform Forensics Secur 16:4280–4290CrossRef

29.

Pan Baicheng, Li Chuandong, Che Hangjun, Leung Man-Fai, Yu Keping (2023) Low-rank tensor regularized graph fuzzy learning for multi-view data processing. IEEE Transactions on Consumer Electronics

30.

Bian Ying, Zhang Peng, Wang Jingjing, Wang Chunmao, Pu Shiliang (2022) Learning multiple explainable and generalizable cues for face anti-spoofing. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2310–2314. IEEE

31.

Wang Yu, Pei Mingtao, Nie Zhengang, Qi Xinmu (2023) Face anti-spoofing based on client identity information and depth map. In: International conference on image and graphics, pages 380–389. Springer

32.

Jie J, Yunlian S (2022) Depth-based ensemble learning network for face anti-spoofing. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2954–2958. IEEE

33.

James B, Epameinondas A, Stylianos P, George T, Yannis P, Stefanos Z (2017) 3d face morphable models" in-the-wild". In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 48–57

34.

Jianzhu G, Xiangyu Z, Yang Y, Fan Y, Zhen L, Li Stan Z (2020) Towards fast, accurate and stable 3d dense face alignment. In: European Conference on Computer Vision, pages 152–168. Springer

35.

Jianzhu G, Xiangyu Z, Zhen L (2018) 3ddfa. https://github.com/cleardusk/3DDFA

36.

Shuying L, Weihong D (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 730–734

37.

Sergey I, Christian S (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pages 448–456. PMLR

38.

Fred AA (2018) Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375

39.

Anjith G, Sébastien M (2021) Cross modal focal loss for rgbd face anti-spoofing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7882–7891

40.

Litong F, Lai-Man P, Li Yuming X, Xuyuan YF, Chun-Ho CT, Kwok-Wai C (2016) Integration of image quality and motion cues for face anti-spoofing: A neural network approach. J Visual Commun Image Representation 38:451–460CrossRef

41.

Keyurkumar P, Hu H, Jain AK (2016) Cross-database face antispoofing with robust feature representation. In: Chinese Conference on Biometric Recognition, pages 611–619. Springer

42.

Tsung-Yi L, Priya G, Ross G, Kaiming H, Piotr D (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pages 2980–2988

43.

Zinelabinde B, Jukka K, Lei L, Xiaoyi F, Abdenour H (2017) Oulu-npu: A mobile face presentation attack database with real-world variations. In: 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017), pages 612–618. IEEE

44.

Ivana C, André A, Sébastien M (2012) On the effectiveness of local binary patterns in face anti-spoofing. In: 2012 BIOSIG-proceedings of the international conference of biometrics special interest group (BIOSIG), pages 1–7. IEEE

45.

Jiankang D, Jia G, Evangelos V, Irene K, Stefanos Z (2020) Retinaface: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212

46.

Jia D, Wei D, Richard S, Jia LL, Fei LF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision Pattern Recognition

47.

Zinelabdine B, Jukka K, Zahid A, Azeddine B, Djamel S, Salah Eddine B, Abdelkrim O, Fadi D, Abdelmalik T-A, Le Q, et al (2017) A competition on generalized software-based face presentation attack detection in mobile scenarios. In: 2017 IEEE international joint conference on biometrics (IJCB), pages 688–696. IEEE

48.

Yaqing W, Quanming Y, Kwok James T, Ni Lionel M (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surveys (csur) 53(3):1–34

49.

Yisheng S, Ting W, Puyu C, Mondal SK, Jyoti Prakash S (2023) A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Computing Surveys

50.

Dahyun K, Minsu C (2022) Integrative few-shot learning for classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9979–9990

51.

Yunxiao Q, Chenxu Z, Xiangyu Z, Wang Zezheng Y, Tianyu ZF, Feng Zhou, Jingping S, Zhen L (2020) Learning meta model for zero-and few-shot face anti-spoofing. Proc AAAI Conf Artificial Intell 34:11916–11923

52.

Jiun-Da L, Yue-Hua H, Po-Han H, Julianne T, Jun-Cheng C, Tanveer M, Kai-Lung H (2023) Defaek: Domain effective fast adaptive network for face anti-spoofing. Neural Netw 161:83–92CrossRef

53.

Ajian L, Zichang T, Yanyan L, Jun W (2023) Attack-agnostic deep face anti-spoofing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6335–6344

54.

Yuchen L, Yabo C, Mengran G, Chun-Ting H, Yaoming W, Wenrui D, Hongkai X (2023) Towards unsupervised domain generalization for face anti-spoofing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20654–20664

Titel: UCDCN: a nested architecture based on central difference convolution for face anti-spoofing
verfasst von: Jing Zhang
Quanhao Guo
Xiangzhou Wang
Ruqian Hao
Xiaohui Du
Siying Tao
Juanxiu Liu
Lin Liu
Publikationsdatum: 09.04.2024
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-024-01397-0

Springer Professional

UCDCN: a nested architecture based on central difference convolution for face anti-spoofing

Abstract

Publisher's Note

Introduction

The importance of face anti-spoofing

Limitation of traditional methods and early deep learning methods

Auxiliary supervision, central difference convolution and semantic information

Summary

Texture-based method

Time-based method

Depth map auxiliary supervision

Proposed method

Network structure

Central difference convolution

UCDCN

Loss function

Depth map loss

Contrast depth loss

Classifier loss

Implementation details

Datasets

Pre-processing stage

Data augmentation

Training strategies

Evaluation metrics

Experiments and results

Visualization and analysis

Model pruning

Training strategy analysis

Comparison and performance

Discussion

Conclusion and future work

Acknowledgements

Declarations

Conflicts of interest

Publisher's Note

Premium Partner

Notation	Illustrate
\(p_0\)	Current position on the input and output feature maps
\(p_n\)	Position computed in R
\(\theta \)	Weight between CDC and the traditional convolution
\(\beta \)	Threshold to change between L1 and L2 losses
\(\ell _{absolute} (x,y)\)	Depth map loss
\(K_i^{contrast}\)	A convolution kernel
\({\ell _{contrast}}(x,y)\)	Contrast depth loss
\({\mathcal {L}}_{depth}(x,y)\)	Depth estimation loss
p	Prediction probability of classifier
\(CE(p_t)\)	Conventional binary cross-entropy loss
\(\alpha _t\)	Weight to balance CEloss
\(FL(p_t)\)	Focal loss equal to classify loss
\({\mathcal {L}}(x,y)\)	Total loss

Springer Professional

Abstract

Publisher's Note

Introduction

The importance of face anti-spoofing

Limitation of traditional methods and early deep learning methods

Auxiliary supervision, central difference convolution and semantic information

Summary

Related work

Texture-based method

Time-based method

Depth map auxiliary supervision

Proposed method

Network structure

Central difference convolution

UCDCN

Loss function

Depth map loss

Contrast depth loss

Classifier loss

Implementation details

Datasets

Pre-processing stage

Data augmentation

Training strategies

Evaluation metrics

Experiments and results

Visualization and analysis

Model pruning

Training strategy analysis

Comparison and performance

Discussion

Conclusion and future work

Acknowledgements

Declarations

Conflicts of interest

Ethical and informed consent for data used

Publisher's Note

Premium Partner