Let BERT SPEAK: Training-Free Block Diffusion Language Model with BERT

Di Zhang (Fudan University, [email protected])

CODE: https://github.com/trotsky1997/Let-BERT-SPEAK/blob/main/generate.py

微信视频2025-11-06_124116_100.mp4

3 Method

3.1 Overview

We propose Blockwise Diffusion Generation (BDG) — a training-free text generation framework that transforms masked language models (MLMs) such as BERT or RoBERTa into autoregressive-like generators. Instead of fine-tuning the model for next-token prediction, BDG iteratively refines masked segments of text using the model’s native masked token prediction capability.

At each iteration, the model fills in several consecutive masked tokens (a block), evaluates token confidence, and re-masks uncertain positions, thereby forming a diffusion-like refinement process over discrete token space.

3.2 Blockwise Masked Generation

Given an input sequence ( x =$$x_1, \dots, x_t] ), we append a block of ( B ) mask tokens ([MASK]) to the sequence:

$$x' =$$x_1, \dots, x_t, \underbrace{[\text{MASK}], \dots,$$\text{MASK}]}*{B}]$$

This extended sequence is fed into the MLM to predict the probability distribution ( P*\theta(v|x') ) over the vocabulary at each masked position.

We then perform Top-k, Top-p (nucleus), and temperature-scaled sampling to select tokens for replacement:

$$ p_i = \text{softmax}\left( \frac{\text{logits}_i}{T} \right)$$

where ( T ) denotes temperature. The sampling is restricted to the top-k or top-p subset of candidate tokens to maintain diversity and coherence.

3.3 Token Sampling and Constraints

To ensure controllability, we introduce a banned-token filtering mechanism, excluding undesired words (e.g., [UNK], “bot”, or user-defined terms). Before sampling, all corresponding token IDs are masked out from the logits by assigning them (-\infty).