学术主页Academic Profile更新 2026.05.05 周二Updated May 5, 2026
Research FocusAcademic Profile
研究方向Research Overview
张泽兴Zexing Zhang · 国防科技大学博士研究生PhD Student, National University of Defense Technology
研究基础模型驱动的智能感知、可信评估与自主决策。Research on foundation model-driven intelligent perception, trustworthy evaluation, and autonomous decision-making.
目前在国防科技大学攻读博士学位,研究聚焦基础模型驱动的智能感知与自主决策。 围绕大语言模型、推理智能体和多模态生理信号,关注可验证评估、跨模态表示学习与真实场景泛化。I am currently pursuing a PhD at the National University of Defense Technology, with research focused on foundation model-driven intelligent perception and autonomous decision-making. My work centers on large language models, reasoning agents, and multimodal physiological signals, with an emphasis on verifiable evaluation, cross-modal representation learning, and generalization in real-world scenarios.
基础模型智能感知Foundation Model-Driven Intelligent Perception可信大模型评估Trustworthy Evaluation of Large Models多模态医学信号Multimodal Medical Signals生物特征与可穿戴Biometrics and Wearables自动整理Auto-organized
国防科技大学National University of Defense Technology博士研究生PhD Student
默认按 CCF / JCR 等级排序,可按年份筛选,作者名自动高亮Sorted by CCF / JCR rank by default, filterable by year with author highlighting自动排序Auto-sorted
From Teacher Pathways to Invariant Manifolds: Consensus Subspace Distillation for TSFMs
Zexing Zhang第一作者First Author
International Conference on Machine Learning2026
CCF A 主会EIScopus已录用Accepted
摘要Abstract自动补全Auto-filled
Time-series foundation models (TSFMs) deliver strong cross-domain generalization, but their scale makes deployment costly. Knowledge distillation is a natural compression route, yet prior TSFM distillation typically imitates teacher outputs, features, or pairwise relations, and therefore remains tightly coupled to teacher-specific training trajectories while underutilizing two empirical properties: (i) high-level representations across model scales tend to converge toward a shared, approximately low-rank geometry, and (ii) layer-wise utility follows a long-tail pattern. We propose consensus subspace distillation, which reframes distillation as aligning a student to a model-agnostic geometric object: a scale-invariant low-rank consensus subspace together with its center statistics. Offline, we screen high-contribution layers via drop-layer marginal loss, estimate a shrinkage-stabilized covariance from their embeddings, and derive a truncated eigensubspace that defines a consensus projector. Online, we project student embeddings into this subspace and match the teacher’s projected mean and covariance using a lightweight mean--covariance objective, enabling stable optimization without rigid pointwise feature binding. To mitigate subset-induced bias, we further introduce a frequency-domain uncertainty injection mechanism that inflates spectral density based on characteristic-function discrepancies and injects dispersion only within the consensus directions. Across forecasting and imputation, the distilled student matches or slightly improves upon the teacher, while exhibiting a predictable trade-off under strict zero-shot classification. With MOMENT-Large as teacher, we achieve about 90% parameter reduction and substantial distillation-time savings while retaining comparable performance across multiple time-series tasks. Code and compressed weights are available at anonymous.4open.science/r/CSD-13C3/.
PPGPT: Transferring Next-Token Modeling from Language to PPG Signals
The success of large language models (LLMs) in cognitive tasks prompts the question of whether their next-token prediction (NTP) paradigm can be adapted to model physiological signals from wearable devices. A key target for this adaptation is photoplethysmography (PPG), the most prevalent sensing modality in consumer wearables for non-invasive monitoring of diverse physiological conditions. Unlike in NLP, where NTP aligns with generative objectives, physiological signal analysis involves fundamentally different tasks, such as continuous parameter estimation (regression) and discrete state recognition (classification). This disparity creates a semantic mismatch between the pre-training paradigm and the downstream tasks. To bridge this gap, we propose PPGPT, the first foundation model that reformulates NTP into next-feature token prediction (NFTP), learning hierarchical feature transition probabilities to unify pre-training and downstream objectives. PPGPT features a novel dual-stream encoder that generates feature tokens by jointly modeling temporal dynamics and local-global morphological patterns. The model is developed using a two-stage training framework: it is first pre-trained on a large-scale mixed dataset of 1.6 billion data points and then validated on our newly released BioMTL benchmark, which includes data from 172 subjects over 285 days across seven different tasks. Extensive experiments show that PPGPT significantly outperforms competing methods, achieving a 16.5% improvement in F1-score and a 25.9% reduction in Mean Absolute Error (MAE). Furthermore, the model demonstrates robust few-shot learning capabilities.
Who's Adam? Benchmarking Hallucinations in Scientific Dialogue
Zexing Zhang第一作者First Author
ACM SIGKDD Conference on Knowledge Discovery and Data Mining2026
CCF A 主会EIScopus已录用Accepted
摘要Abstract自动补全Auto-filled
LLMs and LMMs are increasingly applied in scientific dialogue, but it remains unclear whether they can reliably ground specific dialogue statements to paper-based evidence. A central challenge is paper-grounded hallucination under a paper-as-truth setting: statements that are contradicted by, not found in, or otherwise not decidable from the paper PDF. These hallucinations can be caused by both human misinterpretations and model-generated assertions, ultimately undermining the efficiency, fairness, and credibility of scientific dialogue. Existing benchmarks often overlook this issue, focusing either on subjective macro-level quality assessments or lacking cross-modal evidence localization. We introduce ADAM-Bench (Auditing Dialogue Assertions with Multimodal Evidence), a benchmark for paper-grounded hallucinations in scientific dialogue. Starting from around 27,000 papers, ADAM-Bench is a multi-layer benchmark with three tiers: Scale, Core, and Gold. ADAM-Bench pairs approximately 1 million atomic claims with over 7 million multimodal evidence objects extracted from the corresponding PDFs. We build it through a four-stage pipeline of claim atomization, candidate evidence recall, model-assisted pre-alignment, and human verification. Based on this dataset, we define two tasks: hallucination detection and minimal evidence set localization. Additionally, to avoid the brittleness introduced by single-rationale supervision, we formalize minimal evidence as a set of equivalent evidence sets and evaluate localization by best-matching against multiple gold evidence sets. We conduct a comprehensive benchmark of 34 LLMs and 10 LMMs, spanning large proprietary models (Claude-Opus-4-6, GPT-5.2) and open-source models (Qwen3-235B, GLM-4.6V 106B). Results are markedly low (25.2%–51.1%), indicating that grounding conversational hallucinations in real multimodal papers remains far from solved. We hope this benchmark will contribute to building scientific assistants that make calibrated judgments, cite minimal, auditable evidence, and mitigate the impact of hallucinations in scientific discovery evaluation.
Integrated Channel Equally-Divided and Coordinate Attention Feature Pyramid for subtype detection of lung cancer
Lung cancer, a global leading cause of incidence and mortality, demands accurate detection. Current research primarily targets early nodule detection and distinguishing benign from malignant tumors, with limited focus on lung cancer classification. Identifying various lung cancer types is vital for tailored treatments, while challenges persist in localizing small lesions and early tumors. To tackle these challenges, we propose the framework integrated the Channel Equally-Divided (CED) and the Coordinate Attention Feature Pyramid Network (CAFPN). CAFPN, an innovative feature pyramid structure, integrates the Semantic Information Enhanced (SIE) module and Coordinate Attention. The SIE module filters redundant semantic information, enriches texture features through Coordinate Attention, emphasizes shallow semantic details, and amplifies trustworthy specifics for enhanced deep semantic information. Additionally, the CED module is devised to proficiently extract local contextual information across channels for more precise feature representations. The superior performance of our proposed method was empirically validated through comparative experiments on two mainstream datasets with Competition Performance Metric (CPM) scores and Mean Average Precision (mAP) values reaching 0.942 and 99.18%, respectively, outperforming current state-of-the-art approaches.
Spore: Spatio-Temporal Collaborative Perception and representation space disentanglement for remote heart rate measurement
Zexing Zhang第一作者First Author
Neurocomputing2025
JCR Q1SCI-EEIScopusINSPEC
摘要Abstract自动补全Auto-filled
Remote Photoplethysmography (rPPG) leverages standard RGB cameras for contactless heart rate monitoring, overcoming the limitations of traditional PPG technology in telemedicine and offering a highly scalable, cost-effective health monitoring solution. Despite the advancements of current deep learning methods, which utilize spatiotemporal convolutional networks to capture subtle rPPG signals, these approaches often fail to fully exploit local similarities and global quasi-periodicity in both spatial and temporal dimensions. Additionally, non-physiological noise remains prevalent in the representation space, impeding the accurate estimation of physiological parameters across diverse representation domains. To address these measurement challenges, we propose Spore, a novel training strategy that integrates a Spatio-Temporal Cooperative Perception Network (STCPNet) and a Separable Network (SpNet). Spore effectively disentangles noise and extracts physiological signals through differential orthogonal disentanglement and parallel approximation techniques, ensuring precise measurement of heart rate. STCPNet meticulously aggregates semantic context across spatial and temporal dimensions, enhancing global-level and trend cross-correlations in a fine-grained manner. Meanwhile, the resource-efficient SpNet identifies and constructs target representation spaces by realigning the distribution of the source latent space, thereby adaptively capturing disentangled physiological signal patterns from the computationally intensive STCPNet. For validation, extensive experiments were conducted not only on multiple benchmark datasets but also through deployment testing in real-world scenarios. The results demonstrate that our proposed training strategy achieves state-of-the-art performance in heart rate measurement while maintaining resource efficiency. The code will be released at https://github.com/zacheryzhang/spore.
A general framework for generative self-supervised learning in non-invasive estimation of physiological parameters using photoplethysmography
Aligning physiological parameter labels with large-scale photoplethysmographic (PPG) data for deep learning is challenging and resource-intensive. While self-supervised representation learning (SSRL) can handle limited annotated data, the challenge lies in learning robust shared representations from vast unlabeled data and integrating various contextual cues to learn distinctive representations. To alleviate these challenges, a generative SSRL framework TS2TC is proposed to collaboratively utilize the temporal, spectrogram, and temporal-spectrogram mixed domains to explore and incorporate the unique features of PPG for universal and non-invasive physiological parameter estimation. Initially, a pretext task named Cross-Temporal Fusion Generative Anchor (CTFGA) is designed, modeling temporal dependencies and reconstructing independent segments at a coarse level to provide robust global feature extraction and local semantic contextual representation. The framework also includes sub-signals from PPG with diverse frequency scales and order derivatives reflecting hemodynamics to facilitate learning shared representations at varying semantic levels. Secondly, an advanced cognitive-inspired dual-process transfer (DPT) strategy is formulated, consisting of prior-dependent autonomous processes and posterior observation reasoning processes, to leverage the independent and integrated advantages of shared and specific representations. Furthermore, TS2TC introduces a novel bilinear temporal-spectrogram fusion method in the mixed domain, aligning latent representations from different domains, and establishing fine-grained contextual interactions at the feature level across multiple sources of information. Extensive experiments on physiological parameter estimation tasks showed that the joint performance of CTFGA and DPT outperforms standard generative learning significantly. TS2TC achieved an average 2.49% improvement in RMSE over the current state-of-the-art estimation methods with only 10% training data.
DeBeauty: A Joint Framework for Facial Beautification Removal Based on Spatial Collaborative Adaptation and Hyperplane Relocation
IEEE International Conference on Acoustics, Speech and Signal Processing2025DOI
CCF BEIScopus
摘要Abstract自动补全Auto-filled
Facial beautification removal presents a formidable inverse challenge due to the inherent diversity and unpredictability of beautification processes. Current methodologies often fall short in effectively restoring facial structural alterations and preserving texture features during makeup removal. This paper introduces DeBeauty, an innovative joint framework for facial beautification removal, comprising two primary workflows: Adversarial De-Makeup Flow (ADF) and Relocation Deformation Flow (RDF). ADF incorporates Multi-Level Perception Collaborative Discrimination (MLPCD) and Discriminator-Guided Spatial Adaptive Multi-Scale Attention (SAMA), which enhance the detection of subtle makeup and facilitate the comprehensive removal of extensive makeup while preserving facial texture. RDF introduces a Hyperplane Relocation Strategy that adjusts the latent code of the input face to align with the original structural distribution. Experimental evaluations on the newly proposed Multivariate Beautified Face dataset demonstrate that this approach effectively restores the original color and structural context of the face while preserving essential facial features, achieving state-of-the-art performance.
A survey on deep learning-based object detection for crop monitoring: pest, yield, weed, and growth applications: H. Lu et al.
Modern agriculture faces significant challenges in enhancing crop production efficiency and management. Crop monitoring has emerged as a critical component for achieving precision and intelligent agricultural management. This paper presents a comprehensive review of the latest advancements in deep learning-based object detection techniques applied to crop monitoring. Object detection methods are categorized into single-stage and two-stage approaches, further classified based on feature extraction techniques, namely CNN-based and SSM-based methods. This analysis highlights the significant contributions and limitations of these methods across four primary application domains: pest and disease detection, crop growth monitoring, yield estimation, and weed detection. Statistical data indicates that research in these domains accounts for 84% of the total studies in crop monitoring. Additionally, challenges related to data collection and processing, model selection, and optimization are discussed, along with potential solutions. A summary of publicly available datasets, commonly used evaluation metrics, and performance comparisons of mainstream models in crop monitoring research is also provided. In the future, emphasis is placed on algorithm performance optimization, improved dataset quality, and the development of customized solutions tailored for practical agricultural applications. Addressing these challenges will further advance the modernization and intelligence of agricultural practices.
MBRSTCformer: a knowledge embedded local--global spatiotemporal transformer for emotion recognition
Emotion recognition is an essential prerequisite for realizing generalized BCI, which possesses an extensive range of applications in real life. EEG-based emotion recognition has become mainstream due to its real-time mapping of brain emotional activities, so a robust EEG-based emotion recognition model is of great interest. However, most existing deep learning emotion recognition methods treat the EEG signal as a whole feature extraction, which will destroy its local stimulation differences and fail to extract local features of the brain region well. Inspired by the cognitive mechanisms of the brain, we propose the multi-brain regions spatiotemporal collaboration transformer (MBRSTCfromer) framework for EEG-based emotion recognition. First, inspired by the prior knowledge, we propose the Multi-Brain Regions Collaboration Network. The EEG data are processed separately after being divided by brain regions, and stimulation scores are presented to quantify the stimulation produced by different brain regions and feedback on the stimulation degree to the MBRSTCfromer. Second, we propose a Cascade Pyramid Spatial Fusion Temporal Convolution Network for multi-brain regions EEG features fusion. Finally, we conduct comprehensive experiments on two mainstream emotion recognition datasets to validate the effectiveness of our proposed MBRSTCfromer framework. We achieved 98.63, 98.15, and 98.58 accuracy on the three dimensions (arousal, valence, and dominance) on the DEAP dataset; and 97.66, 97.07, and 97.97 on the DREAMER dataset.
SDA-SAM: Semantic-Driven Adaptive Mixed-Precision Quantization for Segment Anything Model
Zexing Zhang共同通讯Co-corresponding
Proceedings of the International Joint Conference on Neural Networks2026
CCF CEIScopus已录用Accepted
PPG Sensor-Based Biometric Identification and Physiological Analysis via Temporal-Frequency Disentanglement with Liquid Neural Networks
Photoplethysmography (PPG) sensors support both physiological monito- ring and biometric identification, making them key components in wearable sensing systems. However, real-world applications face challenges from signal nonstationarity and physiological variability. This work proposes a temporal-frequency manifold disentanglement framework to improve the robustness and accuracy of PPG-based biometric recognition. A closed-form continuous-time (CfC) liquid neural network captures temporal and spectral features from raw PPG signals, while an orthogonal manifold projection separates identity-related and physiological representations. To support physiological analysis, we construct and release a new multiphysiological PPG dataset with synchronized annotations for body mass index (BMI), blood pressure, blood glucose, and heart rate. Our method achieves 94.12% accuracy (F1-score: 0.93), outperforming eight state-of-the-art approaches. Further analysis reveals that BMI, blood glucose, and heart rate strongly influence identity features, highlighting the need for physiologically aware modeling in sensor systems. The proposed framework enhances PPG sensor signal interpretation, offering a scalable solution for real-time biometric sensing applications.
PIKGMA: PrIori Knowledge-Guided Multimodal Alignment And Domain Adaptation For Emotion Recognition
IEEE International Conference on Computer Research and Development2025DOI
EIScopus
摘要Abstract自动补全Auto-filled
Multimodal domain adaptive emotion recognition aims to utilize the feature information of multiple modalities to achieve cross-domain emotion recognition, and how to constrain the inconsistency between inner-domain and cross-domain modality properly has been a hot spot in research. To address the above challenges, we propose a priori knowledge-guided multimodal alignment domain adaptive emotion recognition model, PIKGMA. PIKGMA is a semi-supervised multimodal physiological signal domain adaptation model for emotion recognition. It aligns multimodal emotion features draw support from inner-domain modalities alignment and cross-domain modalities alignment. Meanwhile, to better utilize the priori knowledge from the source domain (the data distribution of the source domain), we propose the priori dual-alignment bank strategy. The memory bank can be used to assist in cross-domain modality alignment and pseudo-label generation for domain adaptation. We conducted detailed experiments using two physiological signals on the DREAMER dataset, and the experimental results show that PIKGMA performs very well and is superior to existing methods.
UAVEL-YOLO: An Efficient and Lightweight Target Detection Method for Aerial Imagery Captured by UAVs
International Conference on Geology, Mapping and Remote Sensing2025DOI
EIScopus
摘要Abstract自动补全Auto-filled
To address the challenge of target detection in UAV aerial images, which is exacerbated by large scale variations and complex backgrounds, this paper proposes a novel target detection model named UAVEL-YOLO. First, we design a lightweight Self-Adaptive Multi-Scale Contextual Perception Attention (SAMCPA) mechanism that enables the network to more effectively capture contextual information in the image and focus on more important regions, thereby enhancing the perceptual and interpretive abilities of the model. Second, we propose a Dual Branch Linear Separable Kernel (DBLSK) module, which not only suppresses the exponential growth of the parameter count but also provides richer gradient flow information. Moreover, to better detect small, dense, and variably sized objects, we incorporate a P2 detection head to enhance the model’s ability to perceive small targets. The proposed model is evaluated on the VisDrone2019 dataset. Compared with YOLO11s, our model improves mAP@.5, mAP@.5:.95, Precision, and Recall by 6.5 %, 4.3 %, 3.9 %, and 5.5%, respectively, while reducing the model’s parameter count by 66.96%.
MILD: A Multimodal Biometric Recognition Framework Integrating Large Foundation Models
Chinese Conference on Biometric Recognition2024DOI
EICPCI-SScopus
摘要Abstract自动补全Auto-filled
Traditional unimodal biometric recognition technologies, wh-ile widely applied across various fields, still face limitations such as environmental interference, spoofing attacks, and individual differences, leading to insufficient accuracy and reliability. Consequently, multimodal biometric recognition technology enhances recognition performance by integrating multiple biometric features. However, effectively merging the semantic information of different modalities remains a key challenge. This paper proposes a multimodal biometric recognition framework with integrated large models (MILD). The framework incorporates foundational large models for audio, language, and images, and innovatively designs modality adapters and multimodal decoders to address the semantic alignment issue of large models. Additionally, MILD uniquely combines voiceprints, electrocardiograms (ECG), and palm prints to enhance the anti-spoofing performance of biometric recognition. Experimental results validate the effectiveness of the MILD framework in cross-modal feature fusion and accurate recognition, demonstrating the potential of foundational large models in complex scenarios, with the highest cross-dataset recognition accuracy reaching 97.65%.
MultiBioGM: a hand multimodal biometric model combining texture prior knowledge to enhance generalization ability
Zexing Zhang第一作者First Author
Chinese Conference on Biometric Recognition2023DOI
EICPCI-SScopus
摘要Abstract自动补全Auto-filled
Authentication through hand texture features is one of the crucial directions in biometric identification, and some recognition methods based on traditional machine learning or deep learning have been proposed. However, the generalization ability of these methods is not satisfying due to the different entities, backgrounds, and sensors. In this paper, based on the three modalities of fingerprint, fingervein, and palmprint, the texture prior knowledge extractor (PKE) is innovatively designed as a unified paradigm for texture extraction, aiming to improve the model generalization ability through prior knowledge. The feature vectors of texture images are obtained for matching by a knowledge embedding extractor (KEG) based on the Siamese Network. The credibility algorithm is proposed for multimodal decision-level feature fusion. Cascading PKE and KEG is our proposed multimodal biometric generalization model MultiBioGM. Experimental results on three multimodal datasets demonstrate the effectiveness of our model for biometrics, which achieves 0.098%, 0.024%, and 0.117% EERs on unobserved data.
Pre-clustered Generative Adversarial Network Model for Mongolian Font Style Transfer
International Conference on Optimization, Simulation and Control2022DOI
Scopus
摘要Abstract自动补全Auto-filled
Font style transfer has important application value in the field of data enhancement and can be used to alleviate the problem of insufficient data in fields such as handwritten character recognition, glyph inference restoration, and ancient book restoration. The complexity of traditional Mongolian has brought many challenges to character recognition and restoration of ancient books. This paper first builds a small-scale Mongolian font style dataset and a graph cluster aggregator algorithm. Secondly, an improved conditional generative adversarial neural network model with MSE loss function is proposed, and the self-built dataset is used to train the model after image aggregation. The experimental results show that the model can learn the traditional Mongolian font style and transfer it to the same semantic text in a small amount of training, and generate images with prominent style.
预印本与在投工作Preprints & Working Papers
Dynamic Incremental Learning for Non-invasive Blood Glucose Estimation from Wearable Physiological
Zexing Zhang第一作者First Author
Engineering Applications of Artificial Intelligence2026
JCR Q1SCI-EEIScopus
学术服务与交流Service & Exchanges
审稿服务、会议交流与学术报告Reviewing, conference exchanges, and academic talks
前往 AAAI 人工智能会议进行汇报交流。Presented and exchanged research at the AAAI Conference on Artificial Intelligence.
2024.12中国 · 长春China · Changchun
参加 CCF 吉林省研究生学术交流研讨会并作主题汇报,获优秀论文奖。Attended the CCF Jilin Graduate Academic Exchange Seminar and gave a keynote presentation; received the Outstanding Paper Award.
2024.10中国 · 杭州China · Hangzhou
参加中国计算机大会。Attended the China National Computer Congress.
2023.12中国 · 徐州China · Xuzhou
在中国生物特征识别大会进行汇报交流,获最佳论文提名奖。Presented and exchanged research at the Chinese Conference on Biometric Recognition; received a Best Paper Nomination.