Understanding the data

Some knowledge discovered

Iris - different aspects of rules

Mushrooms - large number of symbolic features

The Mushroom Guide clearly states that there is no simple rule for determining the edibility of these mushrooms; no rule like “leaflets three, let it be“ for Poisonous Oak and Ivy.

8124 cases, 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features.
2480 missing values for attribute 11
51.8% represent edible, the rest non-edible mushrooms.

Safe rule for edible mushrooms:

odor = (almond.or.anise.or.none) Ù spore-print-color = Ø green	48 errors, 99.41% correct
This is why animals have such a good sense of smell!
Other odors: creosote, fishy, foul, musty, pungent or spicy
Rules for poisonous mushrooms - 6 attributes only
R₁) odor = Ø (almond Ú anise Ú none);	120 errors, 98.52%
R₂) spore-print-color = green	48 errors, 99.41% correct
R₃) odor = none Ù stalk-surface-below-ring = scaly Ù stalk-color-above-ring = Ø brown	8 errors, 99.90%
R₄) habitat = leaves Ùcap-color = white	no errors!

R₁ + R₂ are quite stable, found even with 10% of data;
R₃ and R₄ may be replaced by other rules:

R'₃): gill-size = narrow Ù stalk-surface-above-ring = (silky Ú scaly)
R'₄): gill-size = narrow Ù population = clustered

Only 5 attributes used ! So far the simplest rules.
100% also in crossvalidation tests - structure of this data is completely understandable.

What chemical receptors in the nose realize such discrimination? What does it tell us about evlolution?

Other methods:

Method Acc.
% Rules/Cond
Features Type Reference

RULENEG 91.0 300/8087/? C Hayward et.al.

HILLARY 95.0 ? C ML induction, Iba et.al.

STAGGER 95.0 ? C ML induction, Schlimmer

REAL 98.0 155/6603/? C Craven+Shavlik

RULEX 98.5 1/3/? C Andrews+Geva

DEDEC 99.8 26/26/? C Tickle et.al.

C4.5 99.8 3/3/? C Quinlan

Successive Regularization 99.4 1/4/2 C Ishikava

99.9 2/22/4 C Ishikava

100 3/24/6 C Ishikava

TREX 100 3/13/? F Geva

C-MLP2LN, SSV 98.5 1/3/1 C Duch et.al.

99.4 2/4/2 C Duch et.al.

99.9 3/7/4 C Duch et.al.

C-MLP2LN 100 4/9/6 C Duch et.al.

SSV 100 4/9/5 C Duch et.al.

3 Monk problems

Artificial small problems designed to test machine learning algorithms (Thurn et.al. 1991).
6 features, 432 possible combinations.

Problem Monk 1:
head shape = body shape OR jacket color = red
124 cases randomly selected for training.

Problem Monk 2:
exactly two of the six features have their first values
169 cases randomly selected for training.

Problem Monk 3:
NOT (body shape = octagon OR jacket color = blue) OR (holding = sward AND jacket color = green)
122 cases randomly selected for training, 5% misclassifcations added.

Such artificial data are difficult to handle.
2 neurons must be trained in C-MLP2LN network simultaneously in Monk 1.
4 neurons must be trained in C-MLP2LN network simultaneously in Monk 2.
Initial rules are too general covering cases from a wrong class.
Exceptions to the general rules: neurons with a negative contribution to the output.
Hierarchical rules: first check exceptions, if not true than rules.

Monk-1: 4 rules and 2 exceptions, 14 atomic formulae.
Monk-2: 16 rules and 8 exceptions, 132 atomic formulae.
Monk-3: 3 rules and 4 exceptions, 33 atomic formulae, 100% accuracy.

Fuzzy methods give poor results here.

Method Monk-1 Monk-2 Monk-3 Remarks

AQ17-DCI 100 100 94.2 Michalski

AQ17-HCI 100 93.1 100 Michalski

AQ17-GA 100 86.8 100 Michalski

Assistant Pro. 100 81.5 100 Monk paper

mFOIL 100 69.2 100 Monk paper

ID5R 79.7 69.2 95.2 Monk paper

IDL 97.2 66.2 -- Monk paper

ID5R-hat 90.3 65.7 -- Monk paper

TDIDT 75.7 66.7 -- Monk paper

ID3 98.6 67.9 94.4 Monk paper

AQR 95.9 79.7 87.0 Monk paper

CLASSWEB 0.10 71.8 64.8 80.8 Monk paper

CLASSWEB 0.15 65.7 61.6 85.4 Monk paper

CLASSWEB 0.20 63.0 57.2 75.2 Monk paper

PRISM 86.3 72.7 90.3 Monk paper

ECOWEB 82.7 71.3 68.0 Monk paper

Neural methods

MLP 100 100 93.1 Monk paper

MLP+reg. 100 100 97.2 Monk paper

Cascade correlation 100 100 97.2 Monk paper

FSM, Gaussians 94.5 79.3 95.5 Duch et.al.

SSV 100 80.6 97.2 Duch et.al.

C-MLP2LN 100 100 100 Duch et.al.

Ljubliana breast cancer

286 cases, 201 no recurrence cancer events (70.3%), 85 are recurrence (29.7%) events.
9 attributes, symbolic with 2 to 13 values.

Single rule:

with ELSE condition gives over 77% in crossvalidation;
best systems do not exceed 78% accuracy (insignificant difference).
All knowledge contained in the data is:

IF more than 2 nodes were involved AND cancer is highly malignant THEN there will be recurrence.

C-MLP2LN more accurate rules: 78% overall accuracy
R1: deg_malig=3 & breast=left & node_caps=yes
R2: (deg_malig=3 OR breast=left) & NOT inv_nodes=[0,2] & NOT age=[50,59]
1 % gained - statistically insignificant difference - but much more complex rules.

Method Accuracy, % test Reference

C-MLP2LN 77.4 our

CART 77.1 Weiss, Kapouleas

PVM 77.1 Weiss, Kapouleas

AQ15 66-72 Michalski et.al

Inductive 65-72 Clark, Niblett

Wisconsin breast cancer

699 cases, 458 benign (65.5%), 241 (34.5%) malignant.
9 features (properties of cells), integers 1-10, one attribute missing in 16 cases.

Simplest rules from C-MLP2LN, large regularization:

IF f₂ ł 7 Ú f₇ ł 6 THEN malignant (95.6%)

Overall accuracy (including ELSE condition) is 94.9%.
f₂ - uniformity of cell size; f₇ - bland chromatin

Hierarchical sets of rules with increasing accuracy may be build
More accurate set of rules:

R₁: f₂<6 Ù f₄<3 Ù f₈<8	(99.8)%
R₂: f₂<9 Ù f₅<4 Ù f₇<2 Ù f₈<5	(100)%
R₃: f₂<10 Ù f₄<4 Ù f₅<4 Ù f₇<3	(100)%
R₄: f₂<7 Ù f₄<9 Ù f₅<3 Ù f_7Î[4,9] Ù f₈<4	(100)%
R₅: f_2Î[3,4]Ù f₄<9 Ù f₅<10 Ù f₇<6 Ù f₈<8	(99.8)%

R₁ and R₅ misclassify the same 1 benign vector.

ELSE condition makes 6 errors, overall reclassification accuracy 99.00%

In all cases features f₃ and f₆ (uniformity of cell shape and bare nuclei) are not important, f₂ (clump thickness) and f₇ (bare nuclei) being the most important.

100% reliable set of rules rejects 51 cases (7.3%).

Results from the 10-fold (stratified) crossvalidation - accuracy of rules is hard to compare without the test set

Method
% accuracy

IncNet 97.1

3-NN, Manhattan 97.1± 0.1

Fisher LDA 96.8

MLP+backpropagation 96.7

LVQ (vector quantization) 96.6

Bayes (pairwise dependent) 96.6

FSM, 12 fuzzy Gaussian rules 96.5

Naive Bayes 96.4

SSV, 3 crisp rules 96.3±0.2

DB-CART 96.2

Linear Discriminant Analysis 96.0

RBF 95.9

CART (decision tree) 94.2

LFC, ASI, ASR (decision trees) 94.4-95.6

Quadratic Discriminant Analysis 34.5

Reclassifcation results are only about 1% better than 10xCV

Method Accuracy Rules/type

C-MLP2LN 99.0 5 crisp

C-MLP2LN 97.7 4 crisp

SSV 97.4 3 crisp

NEFCLASS 96.5 4 fuzzy

C-MLP2LN 94.9 2 crisp

NEFCLASS 92.7 3 fuzzy

The Hypothyroid dataset

Data from Machine Learning Database repository, UCI
3 classes: hypothyroid, hiperthyroid, normal;
# training vectors 3772 = 93+191+3488
# test vectors 3428 = 73+177+3178
21 attributes (medical tests), 6 continuos

Optimized rules: 4 errors on the training set (99.89%), 22 errors on the test set (99.36%)

primary hypothyroid: TSH>30.48 & FTI <64.27 97.06%

primary hypothyroid: TSH=[6.02,29.53] & FTI <64.27 & T3< 23.22 100%

compensated: TSH > 6.02 & FTI=[64.27,186.71] & TT4=[50, 150.5) &
On_Tyroxin=no & surgery=no 98.96%

no hypothyroid: ELSE 100%

4 continuos attributes used and 2 binary.

Method % training % test Reference

C-MLP2LN rules + ASA 99.9 99.36 our group

CART 99.8 99.36 Weiss

PVM 99.8 99.33 Weiss

IncNet 99.7 99.24 our group

MLP init+ a,b opt. 99.5 99.1 our group

C-MLP2LN rules 99.7 99.0 our group

Cascade correlation 100.0 98.5 Schiffmann

BP + local adapt. rates 99.6 98.5 Schiffmann

BP+genetic opt. 99.4 98.4 Schiffmann

Quickprop 99.6 98.3 Schiffmann

RPROP 99.6 98.0 Schiffmann

3-NN, Euclides, 3 features used 98.7 97.9 our group

1-NN, Euclides, 3 features used 98.4 97.7 our group

Best backpropagation 99.1 97.6 Schiffmann

1-NN, Euclides, 8 features used -- 97.3 our group

Bayesian classif. 97.0 96.1 Weiss

BP+conjugate gradient 94.6 93.8 Schiffmann

1-NN Manhattan, std data 93.8 our group

default: 250 test errors 92.7

1-NN Manhattan, raw data 92.2 our group

Why logical rules are most accurate here?
Probably doctors assigned patients to crisp classes: hypo, hiper. normal on basis of sharp decisions.
MLP is not able to describe sharp rectangular decision borders unless very large weights or large slopes are used.

NASA Shuttle

Training set 43500, test set 14500, 9 attributes, 7 classes
Approximately 80% of the data belongs to class 1, only 6 vectors in class 6.

Rules from FSM after optimization: 15 rules, train 99.89%, test 99.81% accuracy.

32 rules obtained from SSV give 100% train, 99.99% test accuracy (1 error).

Method % training % test Reference

SSV, 32 rules 100 99.99 our result, 1 test error

NewID decision tree 100 99.99 Statlog

Baytree decision tree 100 99.98 Statlog

CN2 decision tree 100 99.97 Statlog

FSM, 17 rules 99.98 99.97 our group; 1 test error and 3 unclassfied

CART 99.96 99.92 Statlog

C4.5 99.96 99.90 Statlog

FSM, 15 rules 99.89 99.81 our group

MLP 95.50 99.57 Statlog

k-NN 99.61 99.56 Statlog

RBF 98.40 98.60 Statlog

Logistic DA 96.06 96.17 Statlog

LDA 95.02 95.17 Statlog

Naive Bayes 95.40 95.50 Statlog

Default 78.41 79.16

FSM: 17 crisp rules make 3 errors on training (99.99%), 8 vectors are unclassified, no errors on the test, 9 vectors unclassified (99.94%).
Gaussian fuzzification (0.05%): 3 errors + 5 unclassified on training, 3 unclassified and 1 error (with p of correct class close to 50%) on test.
NewID never was the best in StatLog project, os this is probably good luck.

More examples of logical rules discovered are on our rule-extraction WWW page and SSV results page

Włodzisław Duch

Method	Acc. %	Rules/Cond Features	Type	Reference
RULENEG	91.0	300/8087/?	C	Hayward et.al.
HILLARY	95.0	?	C	ML induction, Iba et.al.
STAGGER	95.0	?	C	ML induction, Schlimmer
REAL	98.0	155/6603/?	C	Craven+Shavlik
RULEX	98.5	1/3/?	C	Andrews+Geva
DEDEC	99.8	26/26/?	C	Tickle et.al.
C4.5	99.8	3/3/?	C	Quinlan
Successive Regularization	99.4	1/4/2	C	Ishikava
	99.9	2/22/4	C	Ishikava
	100	3/24/6	C	Ishikava
TREX	100	3/13/?	F	Geva

C-MLP2LN, SSV	98.5	1/3/1	C	Duch et.al.
	99.4	2/4/2	C	Duch et.al.
	99.9	3/7/4	C	Duch et.al.
C-MLP2LN	100	4/9/6	C	Duch et.al.
SSV	100	4/9/5	C	Duch et.al.

Method	Monk-1	Monk-2	Monk-3	Remarks
AQ17-DCI	100	100	94.2	Michalski
AQ17-HCI	100	93.1	100	Michalski
AQ17-GA	100	86.8	100	Michalski
Assistant Pro.	100	81.5	100	Monk paper
mFOIL	100	69.2	100	Monk paper
ID5R	79.7	69.2	95.2	Monk paper
IDL	97.2	66.2	--	Monk paper
ID5R-hat	90.3	65.7	--	Monk paper
TDIDT	75.7	66.7	--	Monk paper
ID3	98.6	67.9	94.4	Monk paper
AQR	95.9	79.7	87.0	Monk paper
CLASSWEB 0.10	71.8	64.8	80.8	Monk paper
CLASSWEB 0.15	65.7	61.6	85.4	Monk paper
CLASSWEB 0.20	63.0	57.2	75.2	Monk paper
PRISM	86.3	72.7	90.3	Monk paper
ECOWEB	82.7	71.3	68.0	Monk paper
Neural methods
MLP	100	100	93.1	Monk paper
MLP+reg.	100	100	97.2	Monk paper
Cascade correlation	100	100	97.2	Monk paper
FSM, Gaussians	94.5	79.3	95.5	Duch et.al.
SSV	100	80.6	97.2	Duch et.al.
C-MLP2LN	100	100	100	Duch et.al.

Method	Accuracy, % test	Reference
C-MLP2LN	77.4	our
CART	77.1	Weiss, Kapouleas
PVM	77.1	Weiss, Kapouleas
AQ15	66-72	Michalski et.al
Inductive	65-72	Clark, Niblett

Method	% accuracy
IncNet	97.1
3-NN, Manhattan	97.1± 0.1
Fisher LDA	96.8
MLP+backpropagation	96.7
LVQ (vector quantization)	96.6
Bayes (pairwise dependent)	96.6
FSM, 12 fuzzy Gaussian rules	96.5
Naive Bayes	96.4
SSV, 3 crisp rules	96.3±0.2
DB-CART	96.2
Linear Discriminant Analysis	96.0
RBF	95.9
CART (decision tree)	94.2
LFC, ASI, ASR (decision trees)	94.4-95.6
Quadratic Discriminant Analysis	34.5

Method	Accuracy	Rules/type
C-MLP2LN	99.0	5 crisp
C-MLP2LN	97.7	4 crisp
SSV	97.4	3 crisp
NEFCLASS	96.5	4 fuzzy
C-MLP2LN	94.9	2 crisp
NEFCLASS	92.7	3 fuzzy

primary hypothyroid:	TSH>30.48 & FTI <64.27	97.06%
primary hypothyroid:	TSH=[6.02,29.53] & FTI <64.27 & T3< 23.22	100%
compensated:	TSH > 6.02 & FTI=[64.27,186.71] & TT4=[50, 150.5) & On_Tyroxin=no & surgery=no	98.96%
no hypothyroid:	ELSE	100%

Method	% training	% test	Reference
C-MLP2LN rules + ASA	99.9	99.36	our group
CART	99.8	99.36	Weiss
PVM	99.8	99.33	Weiss
IncNet	99.7	99.24	our group
MLP init+ a,b opt.	99.5	99.1	our group
C-MLP2LN rules	99.7	99.0	our group
Cascade correlation	100.0	98.5	Schiffmann
BP + local adapt. rates	99.6	98.5	Schiffmann
BP+genetic opt.	99.4	98.4	Schiffmann
Quickprop	99.6	98.3	Schiffmann
RPROP	99.6	98.0	Schiffmann
3-NN, Euclides, 3 features used	98.7	97.9	our group
1-NN, Euclides, 3 features used	98.4	97.7	our group
Best backpropagation	99.1	97.6	Schiffmann
1-NN, Euclides, 8 features used	--	97.3	our group
Bayesian classif.	97.0	96.1	Weiss
BP+conjugate gradient	94.6	93.8	Schiffmann
1-NN Manhattan, std data		93.8	our group
default: 250 test errors		92.7
1-NN Manhattan, raw data		92.2	our group