使用ID3算法根据信息增益构建决策树

发布于:2025-02-11 ⋅ 阅读:(73) ⋅ 点赞:(0)

题目1

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

题目2

根据ID3(信息增益)方法利用以下数据集构建决策树。

election season oil price stock price
no winter rise rise
no summer fall fall
no summer rise rise
yes winter rise fall
no winter fall fall
yes summer rise rise
yes summer fall fall
yes winter fall fall

设数据集为S,

step1,计算熵
H ( S ) = − ( 3 8 log ⁡ 2 3 8 + 5 8 log ⁡ 2 5 8 ) = 0.954 H(S) = - \left( \frac{3}{8} \log_2 \frac{3}{8} + \frac{5}{8} \log_2 \frac{5}{8} \right)= 0.954 H(S)=(83log283+85log285)=0.954
step2:计算条件熵

  • step2.1 计算特征election的条件熵
    • g ( S , e l e c t i o n ) = H ( S ) − H ( S ∣ e l e c t i o n ) g(S, election)=H(S)-H(S|election) g(S,election)=H(S)H(Selection)
    • H ( S ∣ e l e c t i o n ) = 4 8 H ( S ∣ e l e c t i o n = y e s ) + 4 8 H ( S ∣ e l e c t i o n = n o ) H(S|election) = \frac{4}{8} H(S|election=yes) + \frac{4}{8} H(S|election=no) H(Selection)=84H(Selection=yes)+84H(Selection=no)
    • H ( S ∣ e l e c t i o n = y e s ) = − ( 1 4 log ⁡ 2 1 4 + 3 4 log ⁡ 2 3 4 ) H(S|election=yes) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) H(Selection=yes)=(41log241+43log243)
    • H ( S ∣ e l e c t i o n = n o ) = − ( 2 4 log ⁡ 2 2 4 + 2 4 log ⁡ 2 2 4 ) H(S|election=no) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) H(Selection=no)=(42log242+42log242)
    • H ( S ∣ e l e c t i o n ) = 0.9055 H(S|election) = 0.9055 H(Selection)=0.9055
    • g ( S , e l e c t i o n ) = 0.954 − 0.9055 = 0.0485 g(S, election)=0.954 - 0.9055 = 0.0485 g(S,election)=0.9540.9055=0.0485
  • step2.2 计算特征season的条件熵
    • g ( S , s e a s o n ) = H ( S ) − H ( S ∣ s e a s o n ) g(S, season)=H(S)-H(S|season) g(S,season)=H(S)H(Sseason)
    • H ( S ∣ s e a s o n ) = 4 8 H ( S ∣ s e a s o n = w i n t e r ) + 4 8 H ( S ∣ s e a s o n = s u m m e r ) H(S|season) = \frac{4}{8} H(S|season=winter) + \frac{4}{8} H(S|season=summer) H(Sseason)=84H(Sseason=winter)+84H(Sseason=summer)
    • H ( S ∣ s e a s o n = w i n t e r ) = − ( 1 4 log ⁡ 2 1 4 + 3 4 log ⁡ 2 3 4 ) = 0.811 H(S|season=winter) = - \left( \frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4} \right) = 0.811 H(Sseason=winter)=(41log241+43log243)=0.811
    • H ( S ∣ s e a s o n = s u m m e r ) = − ( 2 4 log ⁡ 2 2 4 + 2 4 log ⁡ 2 2 4 ) = 1 H(S|season=summer) = - \left( \frac{2}{4} \log_2 \frac{2}{4} + \frac{2}{4} \log_2 \frac{2}{4} \right) = 1 H(Sseason=summer)=(42log242+42log242)=1
    • H ( S ∣ s e a s o n ) = 0.9055 H(S|season) = 0.9055 H(Sseason)=0.9055
    • g ( S , s e a s o n ) = 0.954 − 0.9055 = 0.0485 g(S, season)=0.954 - 0.9055 = 0.0485 g(S,season)=0.9540.9055=0.0485
  • step2.3 计算特征oil price 的条件熵
    • g ( S , o i l p r i c e ) = H ( S ) − H ( S ∣ o i l p r i c e ) g(S, oil price)=H(S)-H(S|oil price) g(S,oilprice)=H(S)H(Soilprice)
    • H ( S ∣ o i l p r i c e ) = 4 8 H ( S ∣ o i l p r i c e = r i s e ) + 4 8 H ( S ∣ o i l p r i c e = f a l l ) H(S|oil price) = \frac{4}{8} H(S|oil price=rise) + \frac{4}{8} H(S|oil price=fall) H(Soilprice)=84H(Soilprice=rise)+84H(Soilprice=fall)
    • H ( S ∣ o i l p r i c e = r i s e ) = − ( 3 4 log ⁡ 2 3 4 + 1 4 log ⁡ 2 1 4 ) H(S|oil price=rise) = - \left( \frac{3}{4} \log_2 \frac{3}{4} + \frac{1}{4} \log_2 \frac{1}{4} \right) H(Soilprice=rise)=(43log243+41log241)
    • H ( S ∣ o i l p r i c e = f a l l ) = − ( 0 4 log ⁡ 2 0 4 + 4 4 log ⁡ 2 4 4 ) = 0 H(S|oil price=fall) = - \left( \frac{0}{4} \log_2 \frac{0}{4} + \frac{4}{4} \log_2 \frac{4}{4} \right) = 0 H(Soilprice=fall)=(40log240+44log244)=0
    • H ( S ∣ o i l p r i c e ) = 0.4055 H(S|oil price)=0.4055 H(Soilprice)=0.4055
    • g ( S , o i l p r i c e ) = 0.954 − 0.4055 = 0.5485 g(S, oil price)= 0.954 - 0.4055 = 0.5485 g(S,oilprice)=0.9540.4055=0.5485

step3:选择信息增益最大的特征为根节点,即oil price为根节点。
step4:将 S S S根据特征oil price=rise和fall,划分为 S 1 S_1 S1 S 2 S_2 S2,其中oil price=fall,stock price全部为fall,则 S 2 S_2 S2为叶节点。对 S 1 S_1 S1从特征election 和season中选择新的特征。

  • step4.1 计算 g ( S 1 , e l e c t i o n ) = H ( S 1 ) − H ( S 1 ∣ e l e c t i o n ) g(S_1, election)=H(S_1)-H(S_1|election) g(S1,election)=H(S1)H(S1election)

    • H ( S 1 ∣ e l e c t i o n ) = 2 4 H ( S 1 ∣ e l e c t i o n = n o ) + 2 4 H ( S 1 ∣ e l e c t i o n = y e s ) H(S_1|election)=\frac{2}{4}H(S_1|election=no)+\frac{2}{4}H(S_1|election=yes) H(S1election)=42H(S1election=no)+42H(S1election=yes)
    • H ( S 1 ∣ e l e c t i o n = n o ) = − 2 2 log ⁡ 2 2 2 − 0 2 log ⁡ 2 0 2 = 0 H(S_1|election=no)=-\frac{2}{2}\log_2 \frac{2}{2}-\frac{0}{2}\log_2 \frac{0}{2} = 0 H(S1election=no)=22log22220log220=0
    • H ( S 1 ∣ e l e c t i o n = y e s ) = − 1 2 log ⁡ 2 1 2 − 1 2 log ⁡ 2 1 2 = 1 H(S_1|election=yes)=-\frac{1}{2}\log_2 \frac{1}{2}-\frac{1}{2}\log_2 \frac{1}{2} = 1 H(S1election=yes)=21log22121log221=1
    • H ( S 1 ∣ e l e c t i o n ) = 0.5 H(S_1|election) = 0.5 H(S1election)=0.5
    • g ( S 1 , e l e c t i o n ) = H ( S 1 ) − H ( S 1 ∣ e l e c t i o n ) = 0.811 − 0.5 = 0.311 g(S_1, election)=H(S_1)-H(S_1|election)=0.811-0.5=0.311 g(S1,election)=H(S1)H(S1election)=0.8110.5=0.311
  • step4.2 计算 g ( S 1 , s e a s o n ) = H ( S 1 ) − H ( S 1 ∣ s e a s o n ) g(S_1, season)=H(S_1)-H(S_1|season) g(S1,season)=H(S1)H(S1season)

    • H(S_1)=H(S|oil price=rise)=
    • H ( S 1 ∣ s e a s o n ) = 2 4 H ( S 1 ∣ s e a s o n = w i n t e r ) + 2 4 H ( S 1 ∣ s e a s o n = s u m m e r ) H(S_1|season) = \frac{2}{4}H(S_1|season=winter)+ \frac{2}{4} H(S_1|season=summer) H(S1season)=42H(S1season=winter)+42H(S1season=summer)
    • H ( S 1 ∣ s e a s o n = w i n t e r ) = − 1 2 log ⁡ 2 1 2 − 1 2 log ⁡ 2 1 2 = 1 H(S_1|season=winter)=-\frac{1}{2}\log_2 \frac{1}{2}-\frac{1}{2}\log_2 \frac{1}{2} = 1 H(S1season=winter)=21log22121log221=1
    • H ( S 1 ∣ s e a s o n = s u m m e r ) = − 2 2 log ⁡ 2 2 2 − 0 2 log ⁡ 2 0 2 = 0 H(S_1|season=summer)=-\frac{2}{2}\log_2 \frac{2}{2}-\frac{0}{2}\log_2 \frac{0}{2} = 0 H(S1season=summer)=22log22220log220=0
    • H ( S 1 ∣ s e a s o n ) = 0.5 H(S_1|season) = 0.5 H(S1season)=0.5
    • g ( S 1 , s e a s o n ) = H ( S 1 ) − H ( S 1 ∣ s e a s o n ) = 0.811 − 0.5 = 0.311 g(S_1, season)=H(S_1)-H(S_1|season)=0.811-0.5=0.311 g(S1,season)=H(S1)H(S1season)=0.8110.5=0.311

step5: g ( S 1 , e l e c t i o n ) g(S_1, election) g(S1,election) g ( S 1 , s e a s o n ) g(S_1, season) g(S1,season)一样,这里选择election作为节点划分。election=no,stock price全为rise,为叶子节点。对数据集election=yes,设为 H ( S 11 ) H(S_{11}) H(S11),只有season特征,选择season作为划分特征,两子集都为纯。当 season = winter 时,stock price = fall。当 season = summer 时,stock price = rise

oil price?
├── fall: stock price = fall
└── rise:
    ├── election?
    │   ├── no: stock price = rise
    │   └── yes:
    │       ├── season?
    │       │   ├── winter: stock price = fall
    │       │   └── summer: stock price = rise

log计算:

  • − 1 2 log ⁡ 2 1 2 − 1 2 log ⁡ 2 1 2 = 1 \displaystyle -\frac{1}{2}\log_2 \frac{1}{2}-\frac{1}{2}\log_2 \frac{1}{2} = 1 21log22121log221=1
  • − 2 2 log ⁡ 2 2 2 − 0 2 log ⁡ 2 0 2 = 0 \displaystyle -\frac{2}{2}\log_2 \frac{2}{2}-\frac{0}{2}\log_2 \frac{0}{2} = 0 22log22220log220=0
  • − 2 3 log ⁡ 2 2 3 − 1 3 log ⁡ 2 1 3 = 0.918 \displaystyle -\frac{2}{3}\log_2 \frac{2}{3}-\frac{1}{3}\log_2 \frac{1}{3} = 0.918 32log23231log231=0.918
  • − 3 4 log ⁡ 2 3 4 − 1 4 log ⁡ 2 1 4 = 0.811 \displaystyle - \frac{3}{4} \log_2 \frac{3}{4} - \frac{1}{4} \log_2 \frac{1}{4}=0.811 43log24341log241=0.811
  • − 3 5 log ⁡ 2 3 5 − 2 5 log ⁡ 2 2 5 = 0.971 \displaystyle - \frac{3}{5} \log_2 \frac{3}{5} - \frac{2}{5} \log_2 \frac{2}{5}=0.971 53log25352log252=0.971
  • − 1 5 log ⁡ 2 1 5 − 4 5 log ⁡ 2 4 5 = 0.722 \displaystyle - \frac{1}{5} \log_2 \frac{1}{5} - \frac{4}{5} \log_2 \frac{4}{5}=0.722 51log25154log254=0.722
  • − 1 6 log ⁡ 2 1 6 − 5 6 log ⁡ 2 5 6 = 0.650 \displaystyle - \frac{1}{6} \log_2 \frac{1}{6} - \frac{5}{6} \log_2 \frac{5}{6}=0.650 61log26165log265=0.650