๐Ÿ”ฅGNGB-Young๐Ÿ”ฅ
GNGB-team-tech-blog
๐Ÿ”ฅGNGB-Young๐Ÿ”ฅ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (8)
    • AWS (0)
    • ML&DL Service (1)
    • ML&DL Research (2)
    • DevOps (0)
      • Common (0)
      • Docker (0)
      • MLOps (0)
    • Frontend (4)
      • Common (0)
      • Svelte (4)
      • Flutter (0)
      • Streamlit (0)
    • Backend (0)
      • Common (0)
    • CS (0)
      • Common (0)
      • Network (0)
    • Statistics (1)
    • Project (0)
      • CoinBot (0)
      • Crawler (0)
    • Etc (0)
      • Tistory (0)
    • Introduction (0)
      • Team (0)

ํƒœ๊ทธ

  • language model
  • tailwindcss
  • Svelte
  • ๋‚œ์ด๋„(ํ•˜)
  • Fine-tuning
  • frontend
  • Knowledge Distillation
  • Editor:Paul
  • hierarchical modeling
  • netlify
  • Statistics
  • DaisyUI
  • Editor:Redstone
  • GPT
  • gpt-4
  • hosting
  • Machine Learning
  • ChatGPT
  • Editor:Young
  • bayesian
  • ML
  • nlp
  • instructGPT

์ธ๊ธฐ ๊ธ€

์ตœ๊ทผ ๊ธ€

์ตœ๊ทผ ๋Œ“๊ธ€

์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
hELLO ยท Designed By ์ •์ƒ์šฐ.
๐Ÿ”ฅGNGB-Young๐Ÿ”ฅ

GNGB-team-tech-blog

Big model์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹น์‹ ์ด ๊ณ ๋ คํ•ด์•ผํ•  ๊ฒƒ(fine-tuning, knowledge distillation)
ML&DL Research

Big model์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹น์‹ ์ด ๊ณ ๋ คํ•ด์•ผํ•  ๊ฒƒ(fine-tuning, knowledge distillation)

2023. 4. 15. 19:46

ํ•ด๋‹น ๊ฒŒ์‹œ๊ธ€์€ ํ”ํžˆ ๋งํ•˜๋Š” Big model(GPT-3, BERT)๋“ค์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ฌด์—‡์„ ๊ณ ๋ คํ•ด์•ผ ํ• ์ง€, ํŠนํžˆ

(1) Fine-tuning

(2) Knowledge distillation

์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.  


    GPT-3์˜ ์—…๊ทธ๋ ˆ์ด๋“œ ๋ฒ„์ „์ธ GPT-4๊ฐ€ ์ตœ๊ทผ์— ๋ฐœํ‘œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ChatGPT๊ฐ€ ์„ฑ๊ณตํ•œ ์›์ธ๋„ GPT-3๋ผ๋Š” Big model์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๋Š” ์ƒ๊ฐ์ด ๋“œ๋Š”๋ฐ์š”. ์ด๋ ‡๋“ฏ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์ธ ์†๋„๋กœ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋Š” pretrained big model๋“ค์„ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ, ์‹ค์ œ ์„œ๋น„์Šค๋‚˜ ์ ์šฉ ๋ถ„์•ผ์— ์ž˜ ํ™œ์šฉํ•  ์ค„ ์•„๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ด์ง„๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋•Œ ์—ฌ๋Ÿฌ๋ถ„๋“ค์ด ์•Œ์•„์•ผ ํ•  ๊ฒƒ๋“ค, ํŠนํžˆ fine-tuning๊ณผ knowledge distillation์— ๋Œ€ํ•ด์„œ ์ด์•ผ๊ธฐ ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

ch1 - Fine-tuning

์•„๋ฌด๋ฆฌ ์ข‹์€ pre-trained model์ด๋ผ๋„ ์„œ๋น„์Šค์— ๋ฐ”๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๊ฑด ์•„๋‹™๋‹ˆ๋‹ค

์™œ fine-tuning์„ ํ•ด์•ผ๋˜๋‚˜์š”?

    Pretrained big model๋“ค์„ ์—ฌ๋Ÿฌ downstream task์— ์ ์šฉํ• ๋•Œ fine-tuningํ•˜๋Š” ๊ฒƒ์€ ์ด๋ฏธ ๋งŽ์€ ๋ถ„๋“ค์ด ์•Œ๊ณ  ๊ณ„์‹œ๋Š” ๋ฐฉ๋ฒ•์ผ๊ฒ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ€๋ถ„์—์„œ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ pretrained model ์ธ BERT์—์„œ๋„ pretrained model์„ ์—ฌ๋Ÿฌ task์— fine-tuningํ•˜๋Š” ๋ฐฉ์‹์„ ๋ณด์—ฌ์คฌ๋Š”๋ฐ์š”.

 

๋งˆ์ง€๋ง‰ layer๋ฅผ ์ ์šฉํ•˜๋Š” task์— ๋งž๊ฒŒ ์„ค๊ณ„ํ•œ ๋’ค fine-tuning์„ ์ง„ํ–‰ํ•˜๋Š” BERT

    ๋ชจ๋ธ์˜ ๊ตฌ์กฐ์ƒ fine-tuning์ด ํ•„์š” ์—†๋Š” ๊ฒฝ์šฐ, ์ฆ‰ GPT-3๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ fine-tuning์„ ์ง„ํ–‰ํ•œ ChatGPT์™€ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋„ fine-tuning์€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ OpenAI์—์„œ๋Š” ‘alignment’ ๋ผ๋Š” ๊ฐœ๋…์œผ๋กœ ์„ค๋ช…ํ•˜๋Š”๋ฐ์š”.

Capability vs alignment

    ๋ชจ๋ธ์ด ํ•™์Šต๋  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” Loss function์— ๋Œ€ํ•ด์„œ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ์ตœ์ ํ™” ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ‘Capability๋ผ๊ณ  ํ•  ๋•Œ, Alignment๋Š” ๋ชจ๋ธ์ด ํ•™์Šต๋œ ๋ฐฉํ–ฅ์ด ์šฐ๋ฆฌ๊ฐ€ ๋ชจ๋ธ์—๊ฒŒ์„œ ์›ํ•˜๋Š” output ๋ฐฉํ–ฅ๊ณผ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š” ์ง€๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋‚˜ ๊ณผ์ •์— ๋”ฐ๋ผ์„œ, capability๋Š” ๋†’์ง€๋งŒ align๋˜์ง€ ๋ชปํ•˜๋Š”, mis-align๋œ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด์ฃ . GPT-3์˜ ๊ฒฝ์šฐ ํ•™์Šต ๋ฐฉํ–ฅ์„ ๊ณ ๋ คํ–ˆ์„ ๋•Œ ‘์‚ฌ๋žŒ์ด ๋งํ•  ๋ฒ• ํ•œ’ ๋ฌธ์žฅ์„ ๋งŒ๋“œ๋Š” ๋ฐ์—๋Š” ๋†’์€ capability๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์ง€๋งŒ, ‘์งˆ๋ฌธ์ž๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๋Š” ๋Œ€๋‹ต’์„ ํ•˜๋Š”๋ฐ๋Š” align ๋˜์ง€ ๋ชปํ•œ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ถ€๋ถ„์„ ๊ฐœ์„ ํ•œ ๊ฒƒ์ด InstructGPT, ChatGPT์ธ ๊ฒƒ์ด๊ตฌ์š”. 

 

    ์ด๋Ÿฌํ•œ alignment ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๊ฐ€ fine-tuning์ž…๋‹ˆ๋‹ค. Fine-tuning์„ ํ†ตํ•ด์„œ ๋ชจ๋ธ์˜ output๋“ค์ด ์‚ฌ์šฉ์ž์˜ ์˜๋„์™€ align๋˜๋„๋ก pre-trained model์˜ parameter๋ฅผ ์ˆ˜์ •ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ alignment๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ChatGPT์ฒ˜๋Ÿผ ๊ฐ•ํ™”ํ•™์Šต(RLHF)๋ฅผ ์ด์šฉํ•ด์„œ fine-tuningํ•  ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ, ๊ฐ€์žฅ ํ”ํ•œ ๋ฐฉ๋ฒ•์€ ์ ์ ˆํ•œ pre-trained model์— domain-specificํ•œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ถ”๊ฐ€์ ์œผ๋กœ fine-tuningํ•ด์ฃผ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, BioBERT์—์„œ๋Š” ๊ธฐ์กด BERT๊ฐ€ biomedical text mining ๋ฌธ์ œ์ธ NER, RE, QA์— ๋Œ€ํ•ด์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋„๋ก Biomedical document ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ fine-tuning์„ ์ง„ํ–‰ํ•˜์—ฌ ๊ด€๋ จ ๋ถ„์•ผ์˜ ๋Œ€ํšŒ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ์ ์„ ๊ธฐ๋กํ•˜๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์–ด๋–ป๊ฒŒ ํ•ด์•ผ fine-tuning์„ ‘์ž˜’ ํ• ์ˆ˜ ์žˆ๋‚˜์š”?

    ์‚ฌ์‹ค fine-tuning์„ ์ž˜ ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ •ํ•ด์ง„ ๊ฒƒ์ด ์—†์Šต๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ task์— ๋Œ€ํ•ด์„œ ์ ์ ˆํ•œ evaluation ๋ฐฉ๋ฒ•์ด ์ •์˜๋˜์—ˆ์„๋•Œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค๋ฉด ์ด๋ฏธ ํšจ๊ณผ์ ์œผ๋กœ fine-tune ํ•œ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, fine-tuning ์ด ์ž˜ ์•ˆ๋ ๋•Œ ์ค‘์ ์ ์œผ๋กœ ๊ณ ๋ คํ•ด๋ด์•ผ ๋˜๋Š” ๋ช‡ ๊ฐ€์ง€์— ๋Œ€ํ•ด์„œ๋Š” ์ด์•ผ๊ธฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

    ๋จผ์ € pre-trained model์„ ์ž˜ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. Pre-trained model์ด ํ•™์Šต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง„ task์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์™€ ๋น„์Šทํ•œ ๊ฒฝ์šฐ๊ฐ€ ์ด์ƒ์ ์ด๊ฒ ์ฃ . ๋˜ํ•œ, pre-trained model์ด ์–ด๋–ป๊ฒŒ ํ•™์Šต๋˜์—ˆ๋Š”์ง€, ์–ด๋–ค loss function์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋˜์—ˆ๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. GPT-3์™€ ChatGPT์˜ ์˜ˆ์‹œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ๋ง์ด์ฃ .

 

    ํ•™์Šต ๊ณผ์ •์—์„œ๋Š” Learning rate๋ฅผ ์ž˜ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ ๋˜ํ•œ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. BERT, GPT-3์™€ ๊ฐ™์€ Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์€ ๊ฒฝํ—˜์ƒ hyper-parameter, ํŠนํžˆ learning rate ์„ค์ •์— ๊ต‰์žฅํžˆ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜๋Š” ํŽธ์ž…๋‹ˆ๋‹ค. ๋ณดํ†ต fine-tuning์„ ํ• ๋•Œ๋Š” learning rate๋ฅผ pre-trained model์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•  ๋•Œ ๋ณด๋‹ค ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘๊ฒŒ ์ฃผ๊ณ  ํ•™์Šตํ•˜๋Š” ๊ฒฝ์šฐ๋„ ๋งŽ์Šต๋‹ˆ๋‹ค. Learning rate๋ฅผ ๋„ˆ๋ฌด ํฌ๊ฒŒ ์ฃผ๋ฉด ์ฃผ์–ด์ง„ task dataset์— overfitting ํ•˜๋Š” ๊ฒฝ์šฐ๋„ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์ฃผ์˜ํ•ด์•ผ ๋ฉ๋‹ˆ๋‹ค.

 

    ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ ์ ˆํ•œ regularization์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Fine-tuning์„ ์ง„ํ–‰ํ•  ๋•Œ, ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ task dataset์— overfitting ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ต‰์žฅํžˆ ์žฆ์€๋ฐ์š”. ๊ทธ๋ž˜์„œ learning rate์™€ ๋น„์Šทํ•˜๊ฒŒ fine-tuning์‹œ์—๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ pre-trained model์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ์™€ ๋‹ค๋ฅด๊ฒŒ L1/L2 regularization์„ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ early stopping์„ ์ ๊ทน์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์ถ”์„ธ๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.


ch2 - Knowledge distillation

Pre-trained model์„ ๊ฐ€๋ณ๊ณ  ์ž‘์€ ๋ชจ๋ธ๋กœ ์˜ฎ๊ฒจ๋ณด์ž!

์™œ Knowledge distillation์ด ํ•„์š”ํ•œ๊ฐ€์š”?

 

    ์•ž์—์„œ ์„ค๋ช…ํ•œ fine-tuning์œผ๋กœ pretrained big model์„ ์ฃผ์–ด์ง„ task์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๋ณดํ†ต ์ด๋Ÿฐ pretrained model์€ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ๊ณ  ์‚ฌ์ด์ฆˆ๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋น„์Šค์—์„œ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์„œ๋น„์Šค ํ•˜๊ณ  ์‹ถ์€ AI ๋ชจ๋ธ์„ ์„œ๋ฒ„ ์ƒ์—์„œ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ ์‚ฌ์šฉ์ž์˜ ๊ธฐ๊ธฐ(์Šค๋งˆํŠธํฐ, PC)์—์„œ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น AI ๋ชจ๋ธ์ด ์ฐจ์ง€ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ํ•ด๋‹น ๊ธฐ๊ธฐ์˜ ์—ฐ์‚ฐ๋Ÿ‰์ด AI๋ชจ๋ธ์—์„œ ์š”๊ตฌํ•˜๋Š” ์ˆ˜์ค€์— ๋ฏธ์น˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜น์€, AI ๋ชจ๋ธ์˜ ์—ฐ์‚ฐ์„ ์„œ๋ฒ„์ƒ์—์„œ ์ง„ํ–‰ํ•˜๋”๋ผ๋„ ์„œ๋ฒ„ ๋น„์šฉ ๋ฌธ์ œ๋กœ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์— ๋” ๊ฐ€๋ฒผ์šด AI ๋ชจ๋ธ์ด ํ•„์š”ํ•  ์ˆ˜๋„ ์žˆ๊ตฌ์š”. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ Knowledge distillation์„ ์ด์šฉํ•ด์„œ ์ œ๊ณตํ•˜๊ณ ์ž ํ•˜๋Š” ์„œ๋น„์Šค์— ๋” ์ ํ•ฉํ•˜๋ฉด์„œ ๋” ๊ฐ€๋ฒผ์šด AI๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์–ด๋–ป๊ฒŒ ํ•ด์•ผ Knowledge distillation์„ ‘์ž˜’ ํ• ์ˆ˜ ์žˆ๋‚˜์š”?

 

Knowledge distillation์˜ ๊ธฐ๋ณธ์ ์ธ ํ‹€

    Knowledge distillation์€ ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ ์ž‘์€ ๋ชจ๋ธ(student model)์— pre-trained ๋ชจ๋ธ(teacher model)์˜ knowledge๋ฅผ ์ฆ๋ฅ˜, ์ฆ‰ ํ•„์š”ํ•œ ์ง€์‹๋งŒ ์ถ”์ถœํ•ด๋‚ด๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ž…๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์ด ์žˆ๊ฒ ์ง€๋งŒ, ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ์‹์œผ๋กœ๋Š” Geoffrey Hinton์ด ์ œ์•ˆํ•œ soft label์„ ์ด์šฉํ•œ knowledge distillation์ด ์žˆ์Šต๋‹ˆ๋‹ค(Distilling the Knowledge in a Neural Network). 

 

Soft label๊ณผ hard label

    Classification์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ ์ฃผ์–ด์ง€๋Š” one-hot encoding์ด ๋œ label์„ hard label์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ teacher model์˜ ์˜ˆ์ธก๊ฐ’์€ softmax๋ฅผ ๊ฑฐ์นœ ํ˜•ํƒœ๋กœ, ์ •๋‹ต dog๊ฐ€ ์•„๋‹Œ class๋“ค์ด 0์ด ์•„๋‹Œ ๊ฐ’์„ ๊ฐ€์ง€๋Š”๋ฐ ์ด๋ฅผ soft label์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. Hinton์˜ ์ƒ๊ฐ์€, soft label์—์„œ ์ •๋‹ต์ด ์•„๋‹Œ class์— ๋Œ€ํ•œ teacher model์˜ ์˜ˆ์ธก๊ฐ’ ๋˜ํ•œ ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ง€๋‹ˆ๊ณ  ์žˆ์œผ๋ฏ€๋กœ, soft label์„ ์ด์šฉํ•ด student model์„ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์„œ knowledge distillation์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด, ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ๋‚˜์˜จ hard label๊ณผ, teacher model์—์„œ ๋‚˜์˜จ soft label์„ ํ•จ๊ป˜ ์ด์šฉํ•˜์—ฌ student model์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ํ˜•ํƒœ์˜ knowledge distillation์ด ๋ฉ๋‹ˆ๋‹ค.

 

    ํšจ๊ณผ์ ์ธ knowledge distillation์„ ์œ„ํ•ด์„œ ์ค‘์ ์ ์œผ๋กœ ๊ณ ๋ คํ•ด๋ด์•ผ ๋˜๋Š” ๋ถ€๋ถ„๋“ค์€ fine-tuning๊ณผ ๋น„์Šทํ•œ๋ฐ์š”. ๋จผ์ € ์ข‹์€ teacher model, ์ฆ‰ pre-trained model์„ ์ž˜ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•  ๊ฒƒ์ด๊ตฌ์š”. knowledge distillation ์—ญ์‹œ overfitting ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ ์ ˆํ•œ regularization์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ๋Š” ์–ด๋–ค loss function์„ ์ด์šฉํ•ด์„œ student model์„ ํ•™์Šตํ• ์ง€ ๋˜ํ•œ ์ค‘์š”ํ•œ๋ฐ์š”. Teacher model์˜ output์ด ์ฃผ๋Š” ์ •๋ณด์™€ ์‹ค์ œ ์ •๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํšจ๊ณผ์ ์œผ๋กœ ์กฐํ•ฉํ•  ์ง€์— ๋Œ€ํ•ด์„œ ์ถฉ๋ถ„ํžˆ ๊ณ ๋ฏผํ•ด๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.


Summary

    ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ตœ๊ทผ ํญ๋ฐœ์ ์œผ๋กœ ๋ฐœ์ „๋˜๊ณ  ์žˆ๋Š” pre-trained big model๋“ค์„ ์„œ๋น„์Šค์— ์ž˜ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ fine-tuning๊ณผ knowledge์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ด์•ผ๊ธฐ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋ฐœ์ „๋œ big model๋“ค์„ ์‹ค์ œ ์„œ๋น„์Šค๋‚˜ ํŠน์ • ๋„๋ฉ”์ธ์— ‘์ž˜’ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์€ big model๋“ค์ด ๋ฐœ์ „๋จ์— ๋”ฐ๋ผ ์ค‘์š”ํ•ด์งˆ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

 

์ž˜๋ชป๋œ ๋‚ด์šฉ์ด ์žˆ๊ฑฐ๋‚˜ ๋ฌธ์˜์‚ฌํ•ญ์ด ์žˆ์„ ๊ฒฝ์šฐ ํŽธํ•˜๊ฒŒ ๋Œ“๊ธ€๋กœ ๋‚จ๊ฒจ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!

์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)
    'ML&DL Research' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • ChatGPT๋ฅผ ๊ฐ€๋Šฅ์ผ€ํ•œ ๊ทผ๋ณธ์€ ๋ฌด์—‡์ผ๊นŒ? ChatGPT์—๊ฒŒ ๋ฌผ์–ด๋ณด์•˜๋‹ค.(feat. InstructGPT)
    ๐Ÿ”ฅGNGB-Young๐Ÿ”ฅ
    ๐Ÿ”ฅGNGB-Young๐Ÿ”ฅ
    GNGB ํŒ€ (Paul, Redstone, Young) ์ด ๊ด€๋ฆฌํ•˜๋Š” IT ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ์ž…๋‹ˆ๋‹ค. ํŒ€์› ๋ชจ๋‘ CS ๋น„์ „๊ณต์ž์˜€์ง€๋งŒ, ํ˜„์žฌ๋Š” IT ์—…๊ณ„์— ์žฌ์ง์ค‘์ž…๋‹ˆ๋‹ค. ๋น„์ „๊ณต์ž๋“ค๋„ ์‰ฝ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋‹ฌ๋ ฅ์žˆ๋Š” ์ปจํ…์ธ ๋ฅผ ์ œ๊ณตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”