Computer Science/Data Science

Decision Tree ๊ฐ„.๋‹จ.๋ช….๋ฃŒ

_cactus 2021. 3. 8. 00:02
๋ฐ˜์‘ํ˜•

 Decision tree  
: ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด
๋ถ„๋ฅ˜(classification)๊ณผ ํšŒ๊ท€๋ถ„์„(regression)์— ๋ชจ๋‘ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋–„๋ฌธ์— CART(Classification And Regression Tree)๋ผ๊ณ  ๋ถˆ๋ฆผ

node

  • tree์˜ node : ์งˆ๋ฌธ/๋‹ต์„ ๋‹ด๊ณ  ์žˆ์Œ
    • root node : ์ตœ์ƒ์œ„ node
      • ์ตœ์ƒ์œ„ node์˜ ์†์„ฑ feature๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์„ฑ
    • leaf node : ๋งˆ์ง€๋ง‰ node (๋ง๋‹จ๋…ธ๋“œ)
    • ๋งŒ์•ฝ tree์˜ ๋ชจ๋“  leaf node๊ฐ€ pure node๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ์ง„ํ–‰ํ•˜๋ฉด model์˜ ๋ณต์žก๋„๋Š” ๋งค์šฐ ๋†’์•„์ง€๊ณ  overfitting๋จ
  • overfitting ๋ฐฉ์ง€
    1. tree์˜ ์ƒ์„ฑ์„ ์‚ฌ์ „์— ์ค‘์ง€ : pre-prunning (=๊นŠ์ด์˜ ์ตœ๋Œ€๋ฅผ ์„ค์ •, max_depth)
    2. ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ node ์‚ญ์ œ/๋ณ‘ํ•ฉ : post-prunning

feature importance

  • tree๋Š” ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ์š”์•ฝํ•˜๋Š” ์†์„ฑ๋“ค์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
  • feature imporance๋Š” tree๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒฐ์ •์— ๊ฐ ํŠน์„ฑ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ํ‰๊ฐ€
  • 0 ~ 1 ์‚ฌ์ด ๊ฐ’ (0 : ์ „ํ˜€ ์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ, 1 : ์™„๋ฒฝํ•˜๊ฒŒ ์˜ˆ์ธก) 
  • feature importance๋„ ์‹œ๊ฐํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค


tree ์‹œ๊ฐํ™”

import graphviz
from sklearn.tree import export_graphviz

 

classification ๋ถ„๋ฅ˜

  • ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ํŠน์ • terminal node์— ์†ํ•œ๋‹ค๋Š” ์ •๋ณด๋ฅผ ํ™•์ธํ•œ ๋’ค ํ•ด๋‹น terminal node์—์„œ ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋ฒ”์ฃผ์— ์ƒˆ๋กœ์šด ๋ฐ์ด์ฒ˜๋ฅผ ๋ถ„๋ฅ˜

regression ํšŒ๊ท€

  • ํ•ด๋‹น terminal node์˜ ์ข…์†๋ณ€์ˆ˜(y)์˜ ํ‰๊ท ์„ ์˜ˆ์ธก๊ฐ’์œผ๋กœ ๋ฐ˜ํ™˜
  • ์˜ˆ์ธก๊ฐ’์˜ ์ข…๋ฅ˜ = terminal node์˜ ๊ฐœ์ˆ˜
    • ๋”ฐ๋ผ์„œ, ๋งŒ์•ฝ terminal node ์ˆ˜๊ฐ€ 3๊ฐœ ๋ฟ์ด๋ผ๋ฉด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ 1000๊ฐœ ์ฃผ์–ด์ง„๋‹ค ํ•˜๋”๋ผ๋„ decision tree๋Š” ๋”ฑ 3์ข…๋ฅ˜์˜ ๋‹ต๋งŒ ์ถœ๋ ฅ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90
๋ฐ˜์‘ํ˜•