Looped Mamba 如何模拟通用计算机

一个标准 Mamba (BiSSM) 模型，放在循环中，可以执行任意图灵机程序。
本文档用图示解释其工作原理，对应代码 mamba_subleq.py 和 mamba_read_write.py。

目录 1. 总览：16层执行一条SUBLEQ指令 2. 数据布局：状态矩阵 $X$ 的结构 3. Mamba Block 的7步计算 4. READ：广播-比较-收集（2层） 5. WRITE：广播-比较-写入（2层） 6. 减法：二进制算术（3层FFN） 7. 取整：为什么需要以及如何实现 8. 条件跳转（2层FFN） 9. 与 Transformer 的对比

1. 总览：一条SUBLEQ指令的16层流水线

SUBLEQ$(a, b, c)$ = 单指令计算机的唯一指令：

$$\texttt{mem}[b] \;\leftarrow\; \texttt{mem}[b] - \texttt{mem}[a]$$ $$\text{if } \texttt{mem}[b] \le 0 \text{ then goto } c, \quad \text{else goto next instruction}$$

任何程序都可以编译成SUBLEQ指令序列。我们的Mamba模型每次循环执行一条SUBLEQ，用16层完成。

读指令
2层

→

读$\texttt{mem}[a]$
2层

→

取整regA
1层

→

读$\texttt{mem}[b]$
2层

→

减法
3层

→

取整regB
1层

→

写回
2层

→

跳转
2层

→

纠错
1层

BiSSM扫描层 (8层) FFN层 (8层)

2. 数据布局：状态矩阵 $X$ 的结构

整个计算机的状态是一个矩阵 $X \in \mathbb{R}^{r \times n}$，其中 $n$ 是序列长度（=磁带+指令），$r$ 是每列的维度。

列号	用途	内容
$0$	暂存区 (scratchpad)	PC、指针、寄存器、flag
$1 \sim m$	内存区	$\texttt{mem}[0] \ldots \texttt{mem}[m{-}1]$，$\pm 1$ 二进制编码
$m{+}1 \sim n{-}1$	指令区	$[p_a, p_b, p_c]$ 指令三元组

每列的行结构（总行数 $r = 10L + 3D_{\text{int}} + \max(D_{\text{int}}, 3L) + 3$）：

行块	维度	用途
`cmd`	$3L$	指令: $[p_a, p_b, p_c]$ 三个地址编码
`mem`	$D_{\text{int}}$	内存值: $\pm 1$ 二进制整数
`regA`	$D_{\text{int}}$	寄存器A: 存放 $\texttt{mem}[a]$
`regB`	$D_{\text{int}}$	寄存器B: 存放 $\texttt{mem}[b]$ / 减法结果
`ptrA, ptrB, ptrC`	$3 \times L$	地址指针: $a, b, c$ 的位置编码
`tmp`	$2L$	广播缓冲区（存储广播的地址）
`tmpD`	$\max(D_{\text{int}},3L)$	数据缓冲区（存储收集的数据）
`match`	$1$	匹配标志
`PC`	$L$	程序计数器
`pos`	$L$	位置编码（固定不变）
`is_scr / is_tape`	$1+1$	暂存区/磁带标识

整数编码：使用 $\pm 1$ 二进制补码。例如 $3 = (0011)_2$ 编码为 $[\underbrace{-1, -1}_{\text{MSB}}, \underbrace{+1, +1}_{\text{LSB}}]$。

3. Mamba Block 的7步计算

每个 Mamba 层按以下顺序处理序列中的每个位置 $t$：

$$\textbf{① 输入投影：}\quad \mathbf{x}'_t = W_{\text{in}} \, \mathbf{x}_t, \quad \mathbf{z}_t = W_z \, \mathbf{x}_t$$ $$\textbf{② SiLU激活：}\quad \mathbf{u}_t = \operatorname{SiLU}(\mathbf{x}'_t) = \mathbf{x}'_t \odot \sigma(\mathbf{x}'_t)$$ $$\textbf{③ 选择性参数：}\quad \Delta_t = \operatorname{softplus}(\mathbf{w}_\Delta^\top \mathbf{u}_t + b_\Delta), \quad B_t = W_B \, \mathbf{u}_t, \quad C_t = W_C \, \mathbf{u}_t$$ $$\textbf{④ SSM递推（每通道 $j$）：}\quad h_t^{(j)} = e^{-\Delta_t}\, h_{t-1}^{(j)} + \Delta_t \, B_t \, u_t^{(j)}, \quad \hat{y}_t^{(j)} = C_t \, h_t^{(j)}$$ $$\textbf{⑤ 门控：}\quad \mathbf{y}_t = \hat{\mathbf{y}}_t \odot \operatorname{SiLU}(\mathbf{z}_t)$$ $$\textbf{⑥ 输出投影 + 残差：}\quad \mathbf{x}_t^{\text{new}} = \mathbf{x}_t + W_{\text{out}} \, \mathbf{y}_t$$ $$\textbf{⑦ FFN（可选）：}\quad \mathbf{x}_t^{\text{final}} = \mathbf{x}_t^{\text{new}} + W_2 \, \operatorname{ReLU}(W_1 \, \mathbf{x}_t^{\text{new}} + \mathbf{b}_1) + \mathbf{b}_2$$

关键性质：$\operatorname{SiLU}(0) = 0$（精确的零）
因为 $\operatorname{SiLU}(v) = v \cdot \sigma(v)$，当 $v=0$ 时 $\operatorname{SiLU}(0) = 0 \times 0.5 = 0$。
这意味着：不活跃位置的门控输出恰好为零，没有任何泄漏。这比 Transformer 的 softmax（总有 $O(e^{-\lambda})$ 的非零权重）更强。

4. READ：广播-比较-收集

核心思想

Transformer 用注意力一步完成"根据地址取值"：$Q/K$ 匹配地址，$V$ 搬运数据。Mamba 没有注意力矩阵，但可以分解为三步：

① 广播

前向扫描 →

将地址从暂存区传播到所有位置

② 比较

FFN ↔

每个位置本地比较自己的地址和广播来的地址

③ 收集

反向扫描 ←

从匹配位置取回数据到暂存区

Layer 1：前向扫描广播地址 + FFN计算匹配

问题：SiLU 无法传输负值（$\operatorname{SiLU}(-\beta) \approx 0$）。解决：用双通道编码。

正通道 $j$：$W_{\text{in}}^{(j)} = +\beta$

$$p_j = +1: \;\operatorname{SiLU}(\beta) \approx \beta$$ $$p_j = -1: \;\operatorname{SiLU}(-\beta) \approx 0$$

负通道 $L{+}j$：$W_{\text{in}}^{(L+j)} = -\beta$

$$p_j = +1: \;\operatorname{SiLU}(-\beta) \approx 0$$ $$p_j = -1: \;\operatorname{SiLU}(\beta) \approx \beta$$

重建：$\hat{p}_j \approx \frac{\text{正通道} - \text{负通道}}{\beta} = \pm 1$

选择性门控：谁向隐藏状态 $h$ 写入数据？

$$\text{暂存区 } (\texttt{is\_scr}=1): \quad \Delta_1 \approx L_{\text{sel}}\;(\text{大}), \quad B_1 = 1 \quad\Rightarrow\quad h \;\text{接收数据}$$ $$\text{磁带位置 } (\texttt{is\_scr}=0): \quad \Delta_t \approx e^{-L_{\text{sel}}}\;(\text{极小}), \quad \boxed{B_t = 0 \;\text{（精确！})} \quad\Rightarrow\quad h \;\text{不变}$$

FFN 比较：计算匹配标志（$2L$ 个隐藏神经元）

$$D_t = \sum_{j=1}^{L} \bigl|\texttt{pos}_t^{(j)} - \texttt{tmp}_t^{(j)}\bigr| = \sum_{j=1}^{L}\Bigl[\operatorname{ReLU}\!\bigl(\texttt{pos}^{(j)} - \texttt{tmp}^{(j)}\bigr) + \operatorname{ReLU}\!\bigl(\texttt{tmp}^{(j)} - \texttt{pos}^{(j)}\bigr)\Bigr]$$ $$\texttt{match}_t = 1 - D_t \qquad\text{（通过 $W_2$ 和 $b_2$ 实现）}$$

位置类型	$D_t$	$\texttt{match}_t$
匹配 ($\texttt{pos}_t = p_{\text{target}}$)	$\approx 0$	$\approx +1$
不匹配（差1维）	$\ge 2 - \delta$	$\le -1$
不匹配（差2维）	$\ge 4 - 2\delta$	$\le -3$

注意：$\texttt{match}_t$ 不是规范化的 $[0,1]$ 标志，而是一个符号正确的选择器：正数当且仅当地址匹配。不匹配时为负数，但不影响正确性—— collect 步骤用 $\operatorname{SiLU}(\beta \cdot \texttt{match}_t)$，负值被压到 $\approx 0$。

Layer 2：反向扫描收集数据 + FFN搬运到目标

收集扫描：$\texttt{match}_t$ 控制门控

$$\text{匹配位置: } \operatorname{SiLU}(\beta \cdot \underbrace{(+1)}_{\texttt{match}}) \approx \beta \;\Rightarrow\; \Delta \text{ 大}, \; B > 0 \;\Rightarrow\; \text{数据进入 } h$$ $$\text{不匹配: } \operatorname{SiLU}(\beta \cdot \underbrace{(-1)}_{\texttt{match}}) \approx 0 \;\Rightarrow\; \boxed{B = 0 \;\text{（精确零！零泄漏）}}$$

FFN 搬运：$C$-gated 条件性移动（$C_{\text{move}}=10$，仅在暂存区执行）

$$h^{+}_k = \operatorname{ReLU}\!\bigl(C \cdot \texttt{is\_scr} + \texttt{tmpD}_k - \tfrac{C}{2}\bigr), \qquad h^{-}_k = \operatorname{ReLU}\!\bigl(C \cdot \texttt{is\_scr} - \texttt{tmpD}_k - \tfrac{C}{2}\bigr)$$ $$\texttt{dest}_k \mathrel{+}= \tfrac{1}{2}(h^{+}_k - h^{-}_k)$$

位置	$\texttt{is\_scr}$	结果
暂存区	$1$	$\frac{1}{2}\bigl[(\frac{C}{2}+\texttt{tmpD}) - (\frac{C}{2}-\texttt{tmpD})\bigr] = \texttt{tmpD}$ ✓
磁带	$0$	$\frac{1}{2}[0 - 0] = 0$ ✓

5. WRITE：广播-比较-写入

为什么 WRITE 不能在1层中完成？

条件写入需要两级 ReLU：

$$\underbrace{D_t = \textstyle\sum_j |\texttt{pos}^{(j)}_t - \texttt{tmp}^{(j)}_t|}_{\text{ReLU 深度 1：计算距离}} \qquad\longrightarrow\qquad \underbrace{\operatorname{ReLU}(-C \cdot D_t + \texttt{tmpD}_k - \texttt{mem}_k)}_{\text{ReLU 深度 2：用距离做门控}}$$

标准 Mamba FFN 只有一个隐藏层（ReLU 深度 = 1），无法在同一层先算 $D_t$ 再用 $D_t$ 做 gating。

解决方案：拆成2层。Layer 1 计算 $\texttt{match}_t$ 存入状态；Layer 2 读取 $\texttt{match}_t$ 执行条件写入。

Layer 2 的条件写入（Giannou et al., 2023）

定义 $b_t = 1 - \texttt{match}_t$（不匹配时 $b \approx 1$，匹配时 $b \approx 0$），$C_{\text{gate}} = 10$：

$$\delta_k = \operatorname{ReLU}\!\bigl(-C \cdot b_t + \texttt{tmpD}_k - \texttt{mem}_k\bigr) - \operatorname{ReLU}\!\bigl(-C \cdot b_t - \texttt{tmpD}_k + \texttt{mem}_k\bigr)$$

位置	$b_t$	ReLU 参数	$\delta_k$
不匹配	$\approx 1$	$\le -C + 4 < 0$	$0$ ✓
匹配	$\approx 0$	$\texttt{tmpD}_k - \texttt{mem}_k$	$\texttt{tmpD}_k - \texttt{mem}_k$ ✓

残差更新：$\texttt{mem}_k \leftarrow \texttt{mem}_k + \delta_k = \texttt{tmpD}_k \approx v_{\text{new}}$

6. 减法：二进制算术（3层FFN）

计算 $\texttt{regB} - \texttt{regA}$（二进制补码），三步：

5a. 位翻转

$$\texttt{flip}(b_k) = 2\operatorname{ReLU}(-b_k) - 1$$

将 $\pm 1$ 位翻转：$+1 \to -1$，$-1 \to +1$

5b. 加1（进位链）

6-neuron carry chain 实现 $\texttt{flip}(\texttt{regA}) + 1 = -\texttt{regA}$

必须减去旧值（cleanup）

5c. 二进制加法

$\texttt{regB} + (-\texttt{regA})$ 通过同样的 adder

必须 cleanup 旧 regB

6-neuron ReLU 加法器（每个 bit position $i$）

读取低位 $1 \ldots i$ 的带权和 $S_i = \sum_{j=1}^{i} \frac{2^{j-1}}{2}(a_j + b_j)$，加偏置 $s_{\text{bias}} = 2^i - 1$：

$$h_0 = \operatorname{ReLU}(S_i + s_{\text{bias}} - 2^{i-1} + 1), \quad h_1 = \operatorname{ReLU}(S_i + s_{\text{bias}} - 2^{i-1}), \quad \ldots$$ $$\texttt{output}_i = 2(h_0 - h_1 + h_2 - h_3 + h_4 - h_5) - 3$$

输出值在 $\{-2, 0, +2\}$，取整后对应 $\{-1, -1, +1\}$（二进制位 $\{0, 0, 1\}$）。

7. 取整：为什么需要 + 如何实现

round_regB：为什么必须有？

加法器输出 $\{-2, 0, +2\}$。Transformer 可以用 attention 的 $V$ 矩阵线性搬运这些值。但 Mamba 的广播必须通过 $\operatorname{SiLU}$：

$$\operatorname{SiLU}(\beta \cdot (+2)) \approx 2\beta \quad\checkmark \qquad \operatorname{SiLU}(\beta \cdot (-2)) \approx 0 \quad\checkmark \qquad \boxed{\operatorname{SiLU}(\beta \cdot 0) = 0 \;\text{（精确！）}} \quad\times$$

SiLU 门控的根本限制：值 $0$（代表 bit$=0$，即 $\pm 1$ 中的 $-1$）在广播后消失，无法与"无数据"区分。

$C$-amplified sign 取整公式（$C = 100$）

$$\operatorname{clamp}_{\pm 1}(b) = \underbrace{\min\!\bigl(1,\; C \cdot \operatorname{ReLU}(b)\bigr)}_{+1 \text{ if } b > 1/C} \;-\; \underbrace{\min\!\bigl(1,\; C \cdot \operatorname{ReLU}(-b)\bigr)}_{+1 \text{ if } b < -1/C}$$

其中 $\min(1, C \cdot \operatorname{ReLU}(x)) = \operatorname{ReLU}(Cx) - \operatorname{ReLU}(Cx - 1)$（2个ReLU神经元）。

加上2个神经元消去残差中的原值，每元素共需 6个 ReLU 神经元。

输入 $b$	$\operatorname{clamp}_{\pm 1}(b)$	含义
$+2.0$	$+1$	加法器输出，代表 bit$=1$
$+0.01$	$+1$	微小正值，应为 $+1$
$0$	$0$	未初始化，保持为 $0$
$-0.01$	$-1$	微小负值，应为 $-1$
$-2.0$	$-1$	加法器输出，代表 bit$=0$

8. 条件跳转（2层FFN）

Layer 1（合并层）：计算 flag + PC自增

这两个操作读取不同的寄存器（flag 读 regB，自增读 PC），可以在同一个 FFN 层中并行执行。

Flag 计算（2个ReLU神经元）

$$f_{\text{neg}} = \operatorname{ReLU}(b_{D_{\text{int}}}) \quad\text{（MSB 为 $+1$ 则为负数）}$$ $$f_{\text{zero}} = \operatorname{ReLU}\!\bigl(1 - D_{\text{int}} - \textstyle\sum_k b_k\bigr) \quad\text{（全 $-1$ 则为零）}$$ $$\texttt{flag} = f_{\text{neg}} + f_{\text{zero}}$$

PC 自增（$6L$ 个ReLU神经元）

$$\texttt{PC}_{\text{new}} = \texttt{PC} + 1$$ $$\text{（6-neuron binary incrementer + cleanup 旧 PC）}$$

Layer 2：选择下一个 PC

$$\mathbf{p}_{\text{next}}^{(k)} = \underbrace{\operatorname{ReLU}(\texttt{ptrC}^{(k)} + \texttt{flag} - 1) - \operatorname{ReLU}(-\texttt{ptrC}^{(k)} + \texttt{flag} - 1)}_{\texttt{flag}=1 \;\Rightarrow\; \texttt{ptrC}^{(k)}}$$ $$\phantom{\mathbf{p}_{\text{next}}^{(k)} =}\; + \underbrace{\operatorname{ReLU}(\texttt{PC}_{\text{new}}^{(k)} - \texttt{flag}) - \operatorname{ReLU}(-\texttt{PC}_{\text{new}}^{(k)} - \texttt{flag})}_{\texttt{flag}=0 \;\Rightarrow\; \texttt{PC}_{\text{new}}^{(k)}} \;-\; \texttt{PC}_{\text{new}}^{(k)}$$

9. 与 Looped Transformer 的对比

指标	Transformer	Mamba	原因
层数/指令	$9$	$16$	Attention=1步寻址 vs Scan=2步
工作内存/层	$O(n^2)$	$O(r)$	无注意力矩阵
每层时间	$O(n^2 r)$	$O(n r^2)$	线性扫描 vs 二次注意力
非目标泄漏	$O(e^{-\lambda}) > 0$	精确 $0$	$\operatorname{SiLU}(0)=0$
非均匀参数	$\lambda = \Omega(\log n)$	$\beta = O(\log n)$	对等

多出的7层具体来源

差异	层数	能否消除	原因
READ $\times 3$	$+3$	不可	每次寻址 broadcast+collect=2层 vs attention=1层
WRITE	$+1$	不可	同上（ReLU 深度限制）
round_regB	$+1$	不可	$\operatorname{SiLU}(0)=0$ 导致加法器零值无法广播
round_regA	$+1$	可以	增大 $\beta \ge 10\log n$ 可消除
合并优化	$-1$	—	flag + PC自增合并到一层

不可消除的5层是用 $O(n)$ 替代 $O(n^2)$ 的本质代价。

在 $n=1024$ 时，Mamba 的工作内存是 Transformer 的 $\mathbf{1/3600}$。

代码：mamba_subleq.py + mamba_read_write.py

测试：test_subleq_comprehensive.py（35个程序，验证 memory + p_next）