Xiuying Wei
Google Scholar / GitHub / Email /
xiuying.wei [at] epfl.ch, weixiuying966 [at] gmail.com
                                  

Hi, I'm Xiuying Wei, currently a PhD student in the CLAIRE lab at EPFL. I feel fortunate to be advised by Prof. Caglar Gulcehre and to collaborate with Razvan Pascanu. My research mainly focuses on efficiency, with an emphasis on pretraining and architecture design these days, as well as quantization and model compression during my Master's.

Publications

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
Xiuying Wei, Caglar Gulcehre. [Paper]

Step 3: In this 4-page paper, we demonstrate that our dense RAT+ also significantly benefits existing query-aware sparsity methods (such as Quest, MoBA, and SnapKV) compared to applying them to a standard attention. Our work sheds new light on efficient LLM inference: instead of focusing solely on optimizing downstream inference-time algorithms, we can design upstream architectures that are inherently more compatible with sparse inference—offering native support for dilated patterns and superior results on query-aware sparse patterns.

RAT+: Train Dense, Infer Sparse - Recurrence Augmented Attention for Dilated Inference
Xiuying Wei, Caglar Gulcehre. [Paper] [Code]
International Conference on Machine Learning (ICML), 2026

Step 2: Compared to RAT, this paper takes a step further. Instead of pretraining a sparse architecture with fixed hyperparameter configurations (such as chunk size, dilation size, and compression ratio), we propose to pretrain densely once, then switch flexibly during inference to various dilated attention patterns (e.g., local windows) or hybrid layer/head compositions. Our dense architecture simply augments attention with recurrence and an active learning. The reason to use it: 1) It can drastically reduce KV cache with comparable accuracy —a feat that existing methods like GQA (train-from-scratch) or StreamingLLM (inference-time sparsity) cannot achieve. 2) Unlike models with fixed pretrained architectures (e.g., DeepSeek-V4), ours natively supports flexible inference-time compression.

RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre. [Paper] [Code]
Neural Information Processing Systems (NeurIPS), 2025

Step 1: RNNs compress the entire sequence into a fixed-size hidden state, which is fast but lossy, whereas attention applies no compression, making it accurate but slow. RAT introduces an intermediate architecture that splits the sequence into multiple short chunks, applies recurrence within each chunk for KV compression, and then performs inter-chunk attention. RAT(L=16) achieves good balance of accuracy and efficiency. We refer to our inter-chunk attention as dilated attention in a follow-up work.

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre.
Neural Information Processing Systems (NeurIPS), 2024

Investigate three structured linear parameterizations in transformer language models: 1)scaling law study and model size scaling, 2)efficiency and pre-merge technique, 3)optimization and self-guided training

[Paper] [Code]

Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xianguo Zhang, Ruihao Gong, Jinyang Guo, Xianglong Liu.
EMNLP23
[Paper] [Code]

Lossy and Lossless (L2) Post-training Model Size Compression
Yumeng Xue, Shihao Bai, Xiuying Wei, Ruihao Gong, Jianlei Yang
International Conference on Computer Vision (ICCV), 2023

Integrate lossless and lossy compression techniques in a post-training setting.

[Paper]

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu
Neural Information Processing Systems (NeurIPS), 2022 (Spotlight)

Identify outlier phenomenons (channel concentration and token discrepancy) for quantizing transformer language models. Propose a framework to suppress these outliers.

[Paper] [Code]

QDrop: Randomly Dropping Quantization For Extremely Low-bit Post-training quantization.
Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu
International Conference on Learning Representations (ICLR), 2022

Investigate how the activation quantization affects weight tuning. Build the relationship between activation quantization and flatness of quantized weights. Propose to randomly drop the activation quantization to achieve a flatter optimized weights.

[Paper] [Code]


Honors and Awards
Misc
I love sports, including going to the gym, swimming, and cycling (as a beginner). I also love cooking and reading a lot of novels.