๐Ÿ”ง SCANS Algorithm 1 ์™„์ „ ์‹œ๊ฐํ™”

Safety-Conscious Activation Steering ์›Œํฌํ”Œ๋กœ์šฐ

๐ŸŽฏ SCANS ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด ์›Œํฌํ”Œ๋กœ์šฐ

๐Ÿ“ฅ ์ž…๋ ฅ (Input)

โ€ข Safety-aligned LLM M
โ€ข Steering multiplier ฮฑ
โ€ข Steering layers [Ll, LH]
โ€ข Anchor data Q = {Qโˆ’, Q+}
โ€ข Positive response rpos
โ€ข Hyperparameters T, L
โ€ข Input queries {q}

๐Ÿ“ค ์ถœ๋ ฅ (Output)

โ€ข The steered outputs
โ€ข Safe and helpful responses
โ€ข Balanced safety behavior

๐ŸŽฏ Phase 1: ๊ฑฐ๋ถ€ ์กฐํ–ฅ ๋ฒกํ„ฐ ์œ ๋„

๋ชฉํ‘œ: ์•ˆ์ „/์œ„ํ—˜ ์ฟผ๋ฆฌ ๊ฐ„์˜ ํ™œ์„ฑํ™” ์ฐจ์ด๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋ฒกํ„ฐ ์ถ”์ถœ

vrl = (1/|Qโˆ’|) ฮฃ al(qโˆ’) - (1/|Q+|) ฮฃ al(q+)

๊ณผ์ •:
1. ๊ฐ ๋ ˆ์ด์–ด์—์„œ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ ์ˆ˜์ง‘
2. ์œ ํ•ด/๋ฌดํ•ด ์ฟผ๋ฆฌ ํ™œ์„ฑํ™” ์ฐจ์ด ๊ณ„์‚ฐ
3. ์•ˆ์ „ ์ค‘์š” ๋ ˆ์ด์–ด๋ณ„ ๊ฑฐ๋ถ€ ๋ฒกํ„ฐ ์ƒ์„ฑ

๐Ÿงญ Phase 2: ์กฐํ–ฅ ๋ฐฉํ–ฅ ์‹๋ณ„

๋ชฉํ‘œ: ์ƒˆ๋กœ์šด ์ฟผ๋ฆฌ์˜ ์•ˆ์ „์„ฑ์„ ํŒ๋‹จํ•˜๊ณ  ์ ์ ˆํ•œ ์กฐํ–ฅ ๋ฐฉํ–ฅ ๊ฒฐ์ •

ฯƒ(q) = { -1 if sq < T, 1 otherwise }

๊ณผ์ •:
1. ์ฟผ๋ฆฌ์— ๊ธ์ • ์‘๋‹ต("Sure") ์ถ”๊ฐ€
2. ์ˆจ๊ฒจ์ง„ ์ƒํƒœ ์ „ํ™˜ ๋ถ„์„
3. ์œ ํ•ด ๋ฐฉํ–ฅ๊ณผ์˜ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
4. ์ž„๊ณ„๊ฐ’ ๊ธฐ๋ฐ˜ ์ด์ง„ ๋ถ„๋ฅ˜

โšก Phase 3: ์•ˆ์ „ ์˜์‹์  ํ™œ์„ฑํ™” ์กฐํ–ฅ

๋ชฉํ‘œ: ๊ฒฐ์ •๋œ ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ์˜ ํ™œ์„ฑํ™”๋ฅผ ์กฐ์ž‘ํ•˜์—ฌ ๊ท ํ˜•์žกํžŒ ์‘๋‹ต ์ƒ์„ฑ

รฃl(q) = al(q) + ฯƒ(q) ยท ฮฑ ยท vrl

๊ณผ์ •:
1. ์ถ”๋ก  ์‹œ์ ์—์„œ ์‹ค์‹œ๊ฐ„ ๊ฐœ์ž…
2. ์•ˆ์ „ ์ค‘์š” ๋ ˆ์ด์–ด์—์„œ๋งŒ ์กฐํ–ฅ
3. ์กฐํ–ฅ ๋ฐฉํ–ฅ๊ณผ ๊ฐ•๋„ ์ ์šฉ
4. ๊ท ํ˜•์žกํžŒ ์•ˆ์ „ ์‘๋‹ต ์ƒ์„ฑ

๐ŸŽฏ Phase 1: Inducing the Refusal Steering Vectors

๐Ÿ“Š Step 1-2: ์•ต์ปค ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

๊ฐ ์ฟผ๋ฆฌ q โˆˆ Q์— ๋Œ€ํ•ด ๋ชจ๋“  ๋ ˆ์ด์–ด l์—์„œ ๋งˆ์ง€๋ง‰ ํ† ํฐ ์œ„์น˜์˜ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ al(q) ์ˆ˜์ง‘

// Line 1-2: Initialize and collect hidden states v_r โ† โˆ… For each query q โˆˆ Q: collect hidden states a^l(q) for each layer l at the last token position
โฌ‡๏ธ
Qโˆ’ (์œ ํ•ด ์ฟผ๋ฆฌ)
64๊ฐœ ์ƒ˜ํ”Œ
โž•
Q+ (๋ฌดํ•ด ์ฟผ๋ฆฌ)
64๊ฐœ ์ƒ˜ํ”Œ
โ†’
ํ™œ์„ฑํ™” ์ˆ˜์ง‘
๊ฐ ๋ ˆ์ด์–ด๋ณ„

๐Ÿงฎ Step 3-5: ๊ฑฐ๋ถ€ ๋ฒกํ„ฐ ๊ณ„์‚ฐ

์•ˆ์ „ ์ค‘์š” ๋ ˆ์ด์–ด ๋ฒ”์œ„ [Ll, LH]์—์„œ ์œ ํ•ด/๋ฌดํ•ด ์ฟผ๋ฆฌ ๊ฐ„ ํ™œ์„ฑํ™” ์ฐจ์ด ๊ณ„์‚ฐ

vrl = (1/|Qโˆ’|) ฮฃqโˆ’โˆˆQโˆ’ al(qโˆ’) - (1/|Q+|) ฮฃq+โˆˆQ+ al(q+)
// Line 3-5: Compute refusal vectors for l โ† L_l to L_H do: Compute v_r^l using Eq. 1 v_r โ† v_r โˆช {v_r^l}
โฌ‡๏ธ

๐ŸŽฏ ๊ฒฐ๊ณผ: ๋ ˆ์ด์–ด๋ณ„ ๊ฑฐ๋ถ€ ์กฐํ–ฅ ๋ฒกํ„ฐ

๊ฐ ์•ˆ์ „ ์ค‘์š” ๋ ˆ์ด์–ด๋งˆ๋‹ค ๊ฑฐ๋ถ€ ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ vrl ํš๋“

Layer 10
vr10
Layer 15
vr15
Layer 20
vr20
...
๐Ÿงญ Phase 2: Identifying the Steering Direction

๐Ÿ“ Step 6-7: ๊ธ์ • ์‘๋‹ต ์—ฐ๊ฒฐ

๊ฐ ์œ ํ•ด ์ฟผ๋ฆฌ q โˆˆ Qโˆ’ ์— ๊ธ์ • ์‘๋‹ต rpos (์˜ˆ: "Sure") ์—ฐ๊ฒฐ

// Line 6-7: Concatenate positive response for q โˆˆ Qโˆ’ do: q' โ† concat(q, r_pos) // e.g., "How to hack?" + "Sure"
"How to hack?"
+
"Sure"
โ†’
"How to hack? Sure"
โฌ‡๏ธ

๐Ÿ”„ Step 8-9: ์ˆจ๊ฒจ์ง„ ์ƒํƒœ ์ „ํ™˜ ๊ณ„์‚ฐ

์ฟผ๋ฆฌ ๋ถ€๋ถ„๊ณผ ์ „์ฒด ์ž…๋ ฅ์˜ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ ์ฐจ์ด๋กœ ์ƒํƒœ ์ „ํ™˜ ๋ฒกํ„ฐ ๊ณ„์‚ฐ

atl(q) = apl(q + rpos) - ael(q + rpos)
// Line 8-9: Collect hidden state transition Input q', collect two hidden states: - a_p: from last token of query part - a_e: from final token of entire input Compute a_t(q) = {a_t^l(q)}_{lโˆˆL} using Eq. 2
โฌ‡๏ธ

๐Ÿ“Š Step 10: ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

์ƒํƒœ ์ „ํ™˜ ๋ฒกํ„ฐ์™€ ์œ ํ•ด ๋ฐฉํ–ฅ ๋ฒกํ„ฐ ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

sq = (1/|L|) ฮฃlโˆˆL cos(atl(q), dharml)
์ƒํƒœ ์ „ํ™˜
atl(q)
vs
์œ ํ•ด ๋ฐฉํ–ฅ
dharml
โ†’
์œ ์‚ฌ๋„ ์ ์ˆ˜
sq
โฌ‡๏ธ

โš–๏ธ Step 11-14: ์ด์ง„ ๋ถ„๋ฅ˜

์ž„๊ณ„๊ฐ’ T์™€ ๋น„๊ตํ•˜์—ฌ ์ฟผ๋ฆฌ์˜ ์•ˆ์ „์„ฑ ํŒ๋‹จ ๋ฐ ์กฐํ–ฅ ๋ฐฉํ–ฅ ๊ฒฐ์ •

// Line 11-14: Binary classification if s_q < T then: ฯƒ(q) โ† -1 /* query q is safe */ else: ฯƒ(q) โ† 1 /* query q is unsafe */
sq < T
ฯƒ(q) = -1
(์•ˆ์ „)
sq โ‰ฅ T
ฯƒ(q) = +1
(์œ„ํ—˜)
โšก Phase 3: Safety-Conscious Activation Steering

๐Ÿš€ ์ถ”๋ก  ์‹œ์  (During Inference)

์‹ค์ œ ์‚ฌ์šฉ์ž ์ฟผ๋ฆฌ๊ฐ€ ์ž…๋ ฅ๋˜๋ฉด ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ™œ์„ฑํ™” ์กฐํ–ฅ ์ˆ˜ํ–‰

// Line 15: Real-time inference Input queries {q} to M each layer l outputs corresponding hidden states
โฌ‡๏ธ

๐ŸŽฏ Step 16: ์•ˆ์ „ ์ค‘์š” ๋ ˆ์ด์–ด ์„ ํƒ

๋ฏธ๋ฆฌ ์ •์˜๋œ ์•ˆ์ „ ์ค‘์š” ๋ ˆ์ด์–ด ๋ฒ”์œ„ [Ll, LH]์—์„œ๋งŒ ๊ฐœ์ž…

Layer 0-9
โšช ๊ฐœ์ž… ์•ˆํ•จ
Layer 10-20
๐Ÿ”ด ๊ฐœ์ž… ๋Œ€์ƒ
Layer 21-31
โšช ๊ฐœ์ž… ์•ˆํ•จ
// Line 16: Check safety-critical layers if l โˆˆ [L_l, L_H] then: // Apply steering only to these layers
โฌ‡๏ธ

๐Ÿ”ง Step 17-18: ํ™œ์„ฑํ™” ์กฐํ–ฅ

๊ฒฐ์ •๋œ ๋ฐฉํ–ฅ๊ณผ ๊ฐ•๋„๋กœ ๋งˆ์ง€๋ง‰ ํ† ํฐ ์œ„์น˜์˜ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ ์กฐ์ž‘

รฃl(q) = al(q) + ฯƒ(q) ยท ฮฑ ยท vrl
// Line 17-18: Steer the hidden states Steer hidden states a^l(q) at last token position towards: รฃ^l(q) = a^l(q) + ฯƒ(q) ยท ฮฑ ยท v_r^l
โฌ‡๏ธ

๐ŸŽฎ ์กฐํ–ฅ ํšจ๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜

์•ˆ์ „ํ•œ ์ฟผ๋ฆฌ (ฯƒ(q) = -1)

๊ฑฐ๋ถ€ ๋ฐฉํ–ฅ ์–ต์ œ
al + (-1) ร— ฮฑ ร— vrl
โ†’ ๋„์›€์  ์‘๋‹ต

์œ„ํ—˜ํ•œ ์ฟผ๋ฆฌ (ฯƒ(q) = +1)

๊ฑฐ๋ถ€ ๋ฐฉํ–ฅ ๊ฐ•ํ™”
al + (+1) ร— ฮฑ ร— vrl
โ†’ ์•ˆ์ „ํ•œ ๊ฑฐ๋ถ€

๐Ÿ“ค Step 19: ์กฐํ–ฅ๋œ ์ถœ๋ ฅ ๋ฐ˜ํ™˜

ํ™œ์„ฑํ™” ์กฐํ–ฅ ํ›„ ๊ท ํ˜•์žกํžŒ ์•ˆ์ „ํ•˜๊ณ  ๋„์›€์ด ๋˜๋Š” ์‘๋‹ต ์ƒ์„ฑ

// Line 19: Return steered outputs return the steered outputs after activation steering // Result: Safe and helpful responses
๐Ÿš€ SCANS ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ๋ฐ๋ชจ

๐Ÿ’ฌ ์ฟผ๋ฆฌ ์•ˆ์ „์„ฑ ๋ถ„๋ฅ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ

์•„๋ž˜์— ์ฟผ๋ฆฌ๋ฅผ ์ž…๋ ฅํ•˜๊ณ  SCANS๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ถ„๋ฅ˜ํ•˜๊ณ  ์กฐํ–ฅํ•˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด์„ธ์š”:

๐Ÿ“Š ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„

์‹œ๊ฐ„ ๋ณต์žก๋„:
โ€ข ์ „์ฒ˜๋ฆฌ: O(|Q| ร— L ร— d)
โ€ข ๋ถ„๋ฅ˜: O(L ร— d)
โ€ข ์กฐํ–ฅ: O(1)

๊ณต๊ฐ„ ๋ณต์žก๋„:
โ€ข ๋ฒกํ„ฐ ์ €์žฅ: O(L ร— d)
โ€ข ์ž„์‹œ ๋ฉ”๋ชจ๋ฆฌ: O(d)

โš™๏ธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ฮฑ (์กฐํ–ฅ ๊ฐ•๋„): 2.0 ~ 4.0
T (๋ถ„๋ฅ˜ ์ž„๊ณ„๊ฐ’): 0.6 ~ 0.8
[Ll, LH] (์•ˆ์ „ ๋ ˆ์ด์–ด):
โ€ข 7B ๋ชจ๋ธ: [10, 20]
โ€ข 13B ๋ชจ๋ธ: [16, 26]

๐ŸŽฏ ํ•ต์‹ฌ ํ˜์‹ ์ 

1. Training-free: ์ถ”๊ฐ€ ํ›ˆ๋ จ ๋ถˆํ•„์š”
2. Real-time: ์ถ”๋ก  ์‹œ์  ์ฆ‰์‹œ ์ ์šฉ
3. Adaptive: ์ฟผ๋ฆฌ๋ณ„ ๋งž์ถค ์กฐํ–ฅ
4. Interpretable: ๋ช…ํ™•ํ•œ ์กฐํ–ฅ ๊ทผ๊ฑฐ