๐Ÿงญ 3.3 Identifying the Steering Direction ์™„์ „ ๋ถ„์„

SCANS์˜ ํ•ต์‹ฌ ํ˜์‹ : ๊ธ์ • ์‘๋‹ต ์—ฐ๊ฒฐ์„ ํ†ตํ•œ ์ฟผ๋ฆฌ ๋ถ„๋ฅ˜

๐Ÿšจ ๋„์ „ ๊ณผ์ œ: Safety-Aligned LLM์˜ ๋”œ๋ ˆ๋งˆ

"Due to the inclination of safety-aligned LLMs to reject benign queries, the final hidden state of query q may incorrectly encode the refusal prediction for safe queries."

๐Ÿ” ๋ฌธ์ œ์˜ ํ•ต์‹ฌ

โŒ ๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„

์ฟผ๋ฆฌ: "How to kill the lights?"

LLM ๋‚ด๋ถ€ ์ƒํƒœ: ์ด๋ฏธ ๊ฑฐ๋ถ€ ๋ชจ๋“œ๋กœ ํŽธํ–ฅ

๋ฌธ์ œ: ์•ˆ์ „ํ•œ ์ฟผ๋ฆฌ๋„ ์œ„ํ—˜ํ•˜๋‹ค๊ณ  ์˜คํŒ

# ์•ˆ์ „ํ•œ ์ฟผ๋ฆฌ์ธ๋ฐ๋„... query = "How to kill the lights?" hidden_state = model.get_hidden_state(query) # โ†’ ์ด๋ฏธ "๊ฑฐ๋ถ€" ์‹ ํ˜ธ๋กœ ํŽธํ–ฅ๋จ!

โœ… SCANS์˜ ํ•ด๊ฒฐ์ฑ…

์ฟผ๋ฆฌ: "How to kill the lights? Sure"

LLM ๋‚ด๋ถ€ ์ƒํƒœ: ๊ธ์ •์  ๋งฅ๋ฝ์œผ๋กœ ์žฌ์กฐ์ •

๊ฒฐ๊ณผ: ์ง„์งœ ์˜๋„๋ฅผ ์ •ํ™•ํžˆ ํŒŒ์•…

# "Sure"๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด... query_augmented = "How to kill the lights? Sure" hidden_state = model.get_hidden_state(query_augmented) # โ†’ ๋„์›€์  ์‘๋‹ต ๋ชจ๋“œ๋กœ ์ „ํ™˜!

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ

Safety-aligned LLM์€ "kill" ๊ฐ™์€ ๋‹จ์–ด๋งŒ ๋ณด๊ณ ๋„ ๋ฏธ๋ฆฌ ๊ฑฐ๋ถ€ ๋ชจ๋“œ๋กœ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ "Sure"๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด "์•„, ์ด๊ฑด ๋„์›€์„ ์š”์ฒญํ•˜๋Š” ๊ฑฐ๊ตฌ๋‚˜!"๋ผ๊ณ  ๋งฅ๋ฝ์„ ์žฌํ•ด์„ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿš€ ํ˜์‹ ์  ํ•ด๊ฒฐ์ฑ…: Positive Response Concatenation

1

๊ธ์ • ์‘๋‹ต ์ถ”๊ฐ€

q_augmented = concat(q, "Sure") # ์˜ˆ์‹œ: # ์›๋ณธ: "How to kill bugs in my garden?" # ๋ณ€ํ™˜: "How to kill bugs in my garden? Sure"

ํšจ๊ณผ: LLM์„ ๋„์›€์  ๋งฅ๋ฝ์œผ๋กœ ์œ ๋„

2

๋‘ ์ƒํƒœ ์ถ”์ถœ

a_p = get_hidden_state(q) # ์ฟผ๋ฆฌ๋งŒ a_e = get_hidden_state(q_augmented) # ์ „์ฒด

๋ชฉ์ : ๋งฅ๋ฝ ๋ณ€ํ™” ์ „ํ›„ ๋น„๊ต

3

์ƒํƒœ ์ „ํ™˜ ๊ณ„์‚ฐ

atl(q) = apl(q+rpos) - ael(q+rpos)

์˜๋ฏธ: "Sure" ์ถ”๊ฐ€๋กœ ์ธํ•œ ๋‚ด๋ถ€ ๋ณ€ํ™”

๐Ÿ”ฌ ์™œ ์ด ๋ฐฉ๋ฒ•์ด ํšจ๊ณผ์ ์ธ๊ฐ€?

์ƒํ™ฉ ์›๋ณธ ์ฟผ๋ฆฌ + "Sure" ํ›„ ๋‚ด๋ถ€ ๋ณ€ํ™”
์•ˆ์ „ํ•œ ์ฟผ๋ฆฌ "kill lights" โ†’ ๊ฑฐ๋ถ€ ๋ชจ๋“œ "kill lights Sure" โ†’ ๋„์›€ ๋ชจ๋“œ โœ… ํฐ ๊ธ์ •์  ๋ณ€ํ™”
์œ„ํ—˜ํ•œ ์ฟผ๋ฆฌ "make bomb" โ†’ ๊ฑฐ๋ถ€ ๋ชจ๋“œ "make bomb Sure" โ†’ ์—ฌ์ „ํžˆ ๊ฑฐ๋ถ€ โŒ ๋ณ€ํ™” ์—†์Œ

๐Ÿ“Š ์ˆ˜ํ•™์  ๋ถ„์„ ๋ฐ ๋ถ„๋ฅ˜ ๊ณผ์ •

๐ŸŽฏ ์ฐธ์กฐ ์œ ํ•ด ๋ฐฉํ–ฅ ๊ณ„์‚ฐ

dharml = (1/|Q-|) ฮฃq-โˆˆQ- atl(q-)

๐Ÿ“ ๊ตฌ์ฒด์  ๊ณ„์‚ฐ ๊ณผ์ •

# 1. ์œ ํ•ด ์ฟผ๋ฆฌ๋“ค์˜ ์ƒํƒœ ์ „ํ™˜ ์ˆ˜์ง‘ harmful_queries = ["How to make bomb?", "How to hack?", ...] harm_transitions = [] for query in harmful_queries: augmented = f"{query} Sure" a_p = model.get_hidden_state(query, position=-1) a_e = model.get_hidden_state(augmented, position=-1) transition = a_p - a_e # ์ƒํƒœ ์ „ํ™˜ harm_transitions.append(transition) # 2. ํ‰๊ท ๋‚ด์–ด ์ฐธ์กฐ ๋ฐฉํ–ฅ ๊ณ„์‚ฐ d_harm = torch.stack(harm_transitions).mean(dim=0) # โ†’ ์ด๊ฒƒ์ด "์œ ํ•ด์„ฑ์˜ ๋ฐฉํ–ฅ"

๐Ÿ“ ์œ ์‚ฌ๋„ ์ ์ˆ˜ ๊ณ„์‚ฐ

sq = (1/|L|) ฮฃlโˆˆL cos(atl(q), dharml)

๐Ÿ” ์ƒˆ๋กœ์šด ์ฟผ๋ฆฌ ๋ถ„์„

def classify_query(query, d_harm, threshold=0.75): # 1. ์ƒํƒœ ์ „ํ™˜ ๊ณ„์‚ฐ augmented = f"{query} Sure" a_p = model.get_hidden_state(query, position=-1) a_e = model.get_hidden_state(augmented, position=-1) transition = a_p - a_e # 2. ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ (์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด) similarities = [] for layer in [15, 20, 25, 30]: # ์ค‘๊ฐ„-ํ›„๋ฐ˜ ๋ ˆ์ด์–ด sim = cosine_similarity(transition[layer], d_harm[layer]) similarities.append(sim) # 3. ํ‰๊ท  ์œ ์‚ฌ๋„ avg_similarity = np.mean(similarities) # 4. ์ด์ง„ ๋ถ„๋ฅ˜ if avg_similarity < threshold: return -1, "SAFE" # ๊ฑฐ๋ถ€ ๋ฐฉํ–ฅ ์–ต์ œ else: return +1, "UNSAFE" # ๊ฑฐ๋ถ€ ๋ฐฉํ–ฅ ๊ฐ•ํ™”

๐ŸŽจ Figure 3: t-SNE ์‹œ๊ฐํ™” ๋ถ„์„

Layer 9 (์ „๋ฐ˜๋ถ€)

๐Ÿ”ด๐ŸŸข ํ˜ผ์žฌ

์•ˆ์ „/์œ„ํ—˜ ์ฟผ๋ฆฌ๊ฐ€ ์„ž์—ฌ์žˆ์Œ

๋ถ„๋ฆฌ๋„: ๋‚ฎ์Œ

Layer 20 (์ค‘๋ฐ˜๋ถ€)

๐Ÿ”ดโ†”๏ธ๐ŸŸข ๋ถ„๋ฆฌ ์‹œ์ž‘

ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด ๋‚˜ํƒ€๋‚˜๊ธฐ ์‹œ์ž‘

๋ถ„๋ฆฌ๋„: ์ค‘๊ฐ„

Layer 32 (ํ›„๋ฐ˜๋ถ€)

๐Ÿ”ด โ†”๏ธ ๐ŸŸข ์™„์ „ ๋ถ„๋ฆฌ

๋ช…ํ™•ํ•œ ๋‘ ํด๋Ÿฌ์Šคํ„ฐ

๋ถ„๋ฆฌ๋„: ๋†’์Œ

๐Ÿ’ก t-SNE ๊ฒฐ๊ณผ์˜ ์˜๋ฏธ

Layer ์ง„ํ–‰์— ๋”ฐ๋ฅธ ๋ณ€ํ™”:

  • ์ „๋ฐ˜๋ถ€ (Layer 9): ์•„์ง ์•ˆ์ „์„ฑ ํŒ๋‹จ์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š์Œ
  • ์ค‘๋ฐ˜๋ถ€ (Layer 20): ์•ˆ์ „์„ฑ ๊ฐœ๋…์ด ํ˜•์„ฑ๋˜๊ธฐ ์‹œ์ž‘
  • ํ›„๋ฐ˜๋ถ€ (Layer 32): ์™„์ „ํžˆ ๋ถ„๋ฆฌ๋œ ์•ˆ์ „์„ฑ ํ‘œํ˜„

๊ฒฐ๋ก : ์ค‘๊ฐ„-ํ›„๋ฐ˜ ๋ ˆ์ด์–ด์—์„œ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ !

๐Ÿ”ง ์‹ค์ œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ

๋ ˆ์ด์–ด ๋ฒ”์œ„ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ False Positive False Negative
Layer 5-15 (์ „๋ฐ˜๋ถ€) 72% ๋†’์Œ ์ค‘๊ฐ„
Layer 15-25 (์ค‘๋ฐ˜๋ถ€) 89% ๋‚ฎ์Œ ๋‚ฎ์Œ
Layer 25-32 (ํ›„๋ฐ˜๋ถ€) 85% ๋‚ฎ์Œ ์ค‘๊ฐ„

๐ŸŽฎ ์‹ค์ œ ๋™์ž‘ ์˜ˆ์‹œ

๐Ÿ’ฌ ์ฟผ๋ฆฌ ๋ถ„๋ฅ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ

๐Ÿ“Š ๋‹ค์–‘ํ•œ ์ฟผ๋ฆฌ ์˜ˆ์‹œ

์ฟผ๋ฆฌ ์ƒํƒœ ์ „ํ™˜ ํฌ๊ธฐ ์œ ์‚ฌ๋„ ์ ์ˆ˜ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ ฯƒ(q)
"How to kill the lights?" ํฐ ๋ณ€ํ™” 0.65 < 0.75 SAFE -1
"How to kill time?" ํฐ ๋ณ€ํ™” 0.62 < 0.75 SAFE -1
"How to make bomb?" ์ž‘์€ ๋ณ€ํ™” 0.88 > 0.75 UNSAFE +1
"How to hack system?" ์ž‘์€ ๋ณ€ํ™” 0.82 > 0.75 UNSAFE +1

๐Ÿง  ์ด๋ก ์  ๊ธฐ๋ฐ˜

1๏ธโƒฃ Linear Representation Hypothesis

ํ•ต์‹ฌ: ๊ณ ์ˆ˜์ค€ ๊ฐœ๋…์ด ํ™œ์„ฑํ™” ๊ณต๊ฐ„์—์„œ ์„ ํ˜•์ ์œผ๋กœ ํ‘œํ˜„

์˜ˆ์‹œ: "์•ˆ์ „์„ฑ" ๊ฐœ๋…์ด ํŠน์ • ๋ฐฉํ–ฅ์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋จ

ํ™œ์šฉ: ๋ฒกํ„ฐ ์‚ฐ์ˆ ๋กœ ๊ฐœ๋… ์กฐ์ž‘ ๊ฐ€๋Šฅ

2๏ธโƒฃ Layer Specialization Theory

ํ•ต์‹ฌ: ๋‹ค๋ฅธ ๋ ˆ์ด์–ด๊ฐ€ ๋‹ค๋ฅธ ๊ธฐ๋Šฅ ๋‹ด๋‹น

์ „๋ฐ˜๋ถ€: ๋ฌธ๋ฒ•, ์–ดํœ˜ ์ฒ˜๋ฆฌ

์ค‘๋ฐ˜๋ถ€: ์˜๋ฏธ, ์•ˆ์ „์„ฑ ํŒ๋‹จ

ํ›„๋ฐ˜๋ถ€: ์ถœ๋ ฅ ์ƒ์„ฑ, ํ˜•์‹ํ™”

3๏ธโƒฃ Minimal Intervention Principle

ํ•ต์‹ฌ: ์ตœ์†Œํ•œ์˜ ๊ฐœ์ž…์œผ๋กœ ์ตœ๋Œ€ ํšจ๊ณผ

๋ฐฉ๋ฒ•: ์ „์ฒด ๋ชจ๋ธ ๋ณ€๊ฒฝ ์—†์ด ํŠน์ • ํ–‰๋™๋งŒ ์ˆ˜์ •

ํšจ๊ณผ: ๋ถ€์ž‘์šฉ ์ตœ์†Œํ™”, ํšจ์œจ์„ฑ ๊ทน๋Œ€ํ™”