728x90
๋ฐ˜์‘ํ˜•

data mining 4

Naive Bayes Classifier

๊ฐœ์š” ๋‹จ์ˆœ๊ทœ์น™๋ชจํ˜•: ์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ ํ•„์š” ์—†๋Š” ๋ชจํ˜•, ์ฃผ๋กœ ๊ณ ๊ธ‰ ๋ชจํ˜•๋“ค๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•œ baseline ๋‹จ์ˆœ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๋ชจํ˜• => ์ด ๊ธฐ๋ฒ•๋“ค์€ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ๊ฑฐ์˜ ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ณตํ†ต์ ! (data-driven) (makes no assumption about the data) ๋‹จ์ˆœ๊ทœ์น™ ๋ชจ๋“  ์˜ˆ์ธก๋ณ€์ˆ˜๋ฅผ ๋ถ„๋ฅ˜ํ•œ ์ƒ์ฑ„์—์„œ ์–ด๋Š ํ•œ record๋ฅผ m๊ฐœ์˜ ์ง‘๋‹จ ์ค‘์— ์ œ์ผ ๋งŽ์€ ํ•˜๋‚˜(prevalent class)๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋‹จ์ˆœํ•œ ๊ทœ์น™ ๋‹จ์ˆœ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๋ชจํ˜• ๋‹จ์ˆœ๊ทœ์น™๋ณด๋‹ค ์ •๊ตํ•œ ๋ฐฉ๋ฒ• : ๋‹จ์ˆœ๊ทœ์น™ + ์˜ˆ์ธก๋ณ€์ˆ˜ ์ •๋ณด ๋‹ค๋ฅธ ๋ถ„๋ฅ˜๋ชจํ˜•๊ณผ ๋‹ฌ๋ฆฌ naive bayes classifier๋Š” ์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜•์ธ ๊ฒฝ์šฐ์—๋งŒ ์ ์šฉ๋จ ๋”ฐ๋ผ์„œ ์ˆ˜์น˜ํ˜• ์˜ˆ์ธก๋ณ€์ˆ˜๋Š” ๋ฒ”์ฃผํ˜• ์˜ˆ์ธก๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ์•ผ ํ•จ ๋‹จ์ˆœ ๋ฒ ์ด์ฆˆ ๊ธฐ๋ฒ•์€ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ๋งค์šฐ ํด..

Decision Tree ๊ฐ„.๋‹จ.๋ช….๋ฃŒ

Decision tree : ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๋ถ„๋ฅ˜(classification)๊ณผ ํšŒ๊ท€๋ถ„์„(regression)์— ๋ชจ๋‘ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋–„๋ฌธ์— CART(Classification And Regression Tree)๋ผ๊ณ  ๋ถˆ๋ฆผ node tree์˜ node : ์งˆ๋ฌธ/๋‹ต์„ ๋‹ด๊ณ  ์žˆ์Œ root node : ์ตœ์ƒ์œ„ node ์ตœ์ƒ์œ„ node์˜ ์†์„ฑ feature๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์„ฑ leaf node : ๋งˆ์ง€๋ง‰ node (๋ง๋‹จ๋…ธ๋“œ) ๋งŒ์•ฝ tree์˜ ๋ชจ๋“  leaf node๊ฐ€ pure node๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ์ง„ํ–‰ํ•˜๋ฉด model์˜ ๋ณต์žก๋„๋Š” ๋งค์šฐ ๋†’์•„์ง€๊ณ  overfitting๋จ overfitting ๋ฐฉ์ง€ tree์˜ ์ƒ์„ฑ์„ ์‚ฌ์ „์— ์ค‘์ง€ : pre-prunning (=๊นŠ์ด์˜ ์ตœ๋Œ€๋ฅผ ์„ค์ •, max_depth) ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ node ์‚ญ..

Random Forest ๊ฐ„.๋‹จ.๋ช….๋ฃŒ

Ensemble ์•™์ƒ๋ธ” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ model์„ ์—ฐ๊ฒฐํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ model์„ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ• classifier/regression์— ์ „๋ถ€ ํšจ๊ณผ์  random forest์™€ gradient boosting์€ ๋‘˜๋‹ค model์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ธฐ๋ณธ ์š”์†Œ๋กœ decision tree๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค random forest ์กฐ๊ธˆ์”ฉ ๋‹ค ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ decision tree์˜ ๋ฌถ์Œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ : ๊ฐ๊ฐ์˜ tree๋Š” ๋น„๊ต์  ์˜ˆ์ธก์„ ์ž˜ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€์— overfittingํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๊ฐ€์ง ๋”ฐ๋ผ์„œ, ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ overfitting๋œ tree๋ฅผ ๋งŽ์ด ๋งŒ๋“ค๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ๋‚ด๋ฉด overfitting์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด tree model์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์€ ์œ ์ง€ํ•˜๋˜ overf..

K-means Clustering ๊ฐ„.๋‹จ.๋ช….๋ฃŒ

์•Œ๊ณ ๋ฆฌ์ฆ˜ : 1) cluster์˜ ๊ฐœ์ˆ˜ k๋ฅผ ์ง€์ • k๊ฐœ์˜ ์ดˆ๊ธฐ ํ‰๊ท ๊ฐ’ ์ง€์ • 2) ์„ ํƒœํ•œ k๊ฐœ์˜ cluster ์ค‘์‹ฌ๊ณผ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋Š” ๊ฐ€์žฅ ๊ฐ€๊น๊ฒŒ ์žˆ๋Š” cluster์˜ ์ค‘์‹ฌ์„ ๊ทธ ๋ฐ์ดํ„ฐ๊ฐ€ ์†Œ์†๋˜๋Š” cluster๋กœ ํ• ๋‹น 3) ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•˜๊ฒŒ ๋œ ๋ฐ์ดํ„ฐ๋“ค์˜ ํ‰๊ท ๊ฐ’์„ ์ƒˆ๋กœ์šด ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ์œผ๋กœ ๋‘  4) 2~3๋‹จ๊ณ„๋ฅผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต (ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ์ด ๋”์ด์ƒ ๋ณ€ํ•˜์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€) ์ถœ์ฒ˜ : m.blog.naver.com/PostView.nhn?blogId=samsjang&logNo=221016339218&proxyReferer=https:%2F%2Fwww.google.com%2F [30ํŽธ] k-means ํด๋Ÿฌ์Šคํ„ฐ๋ง ์šฐ๋ฆฌ๋Š” ์—ฌํƒœ๊นŒ์ง€ ๋‹ต์ด ์ด๋ฏธ ์ œ์‹œ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜..

728x90
๋ฐ˜์‘ํ˜•