21.14 Remove Own Stop Words
<- tm_map(docs, removeWords, c("department", "email")) docs
viewDocs(docs, 16)
## hybrid weighted random forests
## classifying highdimensional data
## baoxun xu joshua zhexue huang graham williams
## yunming ye
##
##
## computer science harbin institute technology shenzhen graduate
## school shenzhen china
##
## shenzhen institutes advanced technology chinese academy sciences shenzhen
## china
## amusing gmailcom
## random forests popular classification method based ensemble
## single type decision trees subspaces data literature
## many different types decision tree algorithms including c cart
## chaid type decision tree algorithm may capture different information
## structure paper proposes hybrid weighted random forest algorithm
## simultaneously using feature weighting method hybrid forest method
## classify high dimensional data hybrid weighted random forest algorithm
## can effectively reduce subspace size improve classification performance
## without increasing error bound conduct series experiments eight
## high dimensional datasets compare method traditional random forest
## methods classification methods results show method
## consistently outperforms traditional methods
## keywords random forests hybrid weighted random forest classification decision tree
##
##
##
## introduction
##
## random forests popular classification
## method builds ensemble single type
## decision trees different random subspaces
## data decision trees often either built using
## c cart one type within
## single random forest recent years random
## forests attracted increasing attention due
## competitive performance compared
## classification methods especially highdimensional
## data algorithmic intuitiveness simplicity
## important capability ensemble using
## bagging stochastic discrimination
## several methods proposed grow random
## forests subspaces data
## methods popular forest construction
## procedure proposed breiman first use
## bagging generate training data subsets building
## individual trees
## subspace features
## randomly selected node grow branches
## decision tree trees combined
## ensemble forest ensemble learner
## performance random forest highly dependent
## two factors performance tree
## diversity trees forests breiman
## formulated overall performance set trees
## average strength proved generalization
##
## error random forest bounded ratio
## average correlation trees divided square
## average strength trees
## high dimensional data text data
## usually large portion features
## uninformative classes forest building
## process informative features large
## chance missed randomly select small
## subspace breiman suggested selecting log m
## features subspace m number
## independent features data high dimensional
## data result weak trees created
## subspaces average strength trees reduced
## error bound random forest enlarged
## therefore large proportion weak
## trees generated random forest forest
## large likelihood make wrong decision mainly
## results weak trees classification power
## address problem aim optimize decision
## trees random forest two strategies one
## straightforward strategy enhance classification
## performance individual trees feature weighting
## method subspace sampling
## method feature weights computed respect
## correlations features class feature
## regarded probabilities feature
## selected subspaces method obviously
## increases classification performance individual
##
## computer journal vol
##
##
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
##
## trees subspaces will biased contain
## informative features however chance
## correlated trees also increased since features
## large weights likely repeatedly selected
## second strategy straightforward use
## several different types decision trees training
## data subset increase diversity trees
## select optimal tree individual
## tree classifier random forest model work
## presented extends algorithm developed
## specifically build three different types tree
## classifiers c cart chaid
## training data subset evaluate performance
## three classifiers select best tree
## way build hybrid random forest may
## include different types decision trees ensemble
## added diversity decision trees can effectively
## improve accuracy tree forest
## hence classification performance ensemble
## however use method build best
## random forest model classifying high dimensional
## data can sure subspace size best
## paper propose hybrid weighted random
## forest algorithm simultaneously using new feature
## weighting method together hybrid random
## forest method classify high dimensional data
## new random forest algorithm calculate feature
## weights use weighted sampling randomly select
## features subspaces node building different
## types trees classifiers c cart chaid
## training data subset select best tree
## individual tree final ensemble model
## experiments performed high dimensional
## text datasets dimensions ranging
## compared performance eight random
## forest methods wellknown classification methods
## c random forest cart random forest chaid
## random forest hybrid random forest c weighted
## random forest cart weighted random forest chaid
## weighted random forest hybrid weighted random
## forest support vector machines naive bayes
## knearest neighbors
## experimental
## results show hybrid weighted random forest
## achieves improved classification performance
## ten competitive methods
## remainder paper organized follows
## section introduce framework building
## hybrid weighted random forest describe new
## random forest algorithm section summarizes four
## measures evaluate random forest models present
## experimental results high dimensional text datasets
## section section contains conclusions
##
## table contingency table input feature class
## feature y
## y y
## y yj
## y yq total
##
##
##
## j
##
## q
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## ai
##
##
## ij
##
## iq
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## ap
## p
##
## pj
##
## pq
## p
## total
##
##
## j
##
## q
##
##
## general framework building hybrid random forests
## integrating two methods propose novel
## hybrid weighted random forest algorithm
##
##
## let y class target feature q distinct
## class labels yj j q purposes
## discussion consider single categorical feature
## dataset d p distinct category values
## denote distinct values ai p
## numeric features can discretized p intervals
## supervised discretization method
## assume d val objects size subset
## d satisfying condition ai y yj
## denoted ij considering combinations
## categorical values labels y can
## obtain contingency table y shown
## table far right column contains marginal
## totals feature
##
## hybrid
## forests
##
## weighted
##
## random
##
## section first introduce feature weighting
## method subspace sampling present
##
## q
##
##
##
##
## ij
##
## p
##
##
##
## j
##
## bottom row marginal totals class
## feature y
## j
##
## p
##
##
## ij
##
## j q
##
##
##
##
##
## grand total total number samples
## bottom right corner
##
##
## q
## p
##
##
## ij
##
##
##
## j
##
## given training dataset d feature first
## compute contingency table feature weights
## computed using two methods discussed
## following subsection
##
##
##
##
## notation
##
## feature weighting method
##
## subsection give details feature
## weighting method subspace sampling random
## forests consider mdimensional feature space
## present compute
##
## computer journal vol
##
##
##
##
##
## hybrid weighted random forests classifying highdimensional data
## weights w w wm every feature space
## weights used improved algorithm
## grow decision tree random forest
## feature weight computation
## weight feature represents correlation
## values feature values
## class feature y larger weight will indicate
## class labels objects training dataset
## correlated values feature indicating
## informative class objects thus
## suggested stronger power predicting
## classes new objects
## following propose use chisquare
## statistic compute feature weights
## method can quantify correspondence two
## categorical variables
## given contingency table input feature
## class feature y dataset d chisquare statistic
## two features computed
## corra y
##
## q
## p
##
## ij tij
## tij
## j
##
##
##
## ij observed frequency
## contingency table tij expected frequency
## computed
## x j
## tij
##
##
##
##
## larger measure corra y
## informative feature predicting class y
## normalized feature weight
## practice feature weights normalized feature
## subspace sampling use corra y measure
## informativeness features consider
## feature weights however treat weights
## probabilities features normalize measures
## ensure sum normalized feature weights
## equal let corrai y m set
## m feature measures compute normalized
## weights
##
## corrai y
## wi n
##
## corrai y
## use square root smooth values
## measures wi can considered probability
## feature ai randomly sampled subspace
## informative feature larger weight
## higher probability feature selected
##
## diversity commonly obtained using bagging
## random subspace sampling introduce
## element diversity using different types trees
## considering analogy forestry different data subsets bagging represent soil structures different decision tree algorithms represent different tree species approach two key aspects
## one use three types decision tree algorithms
## generate three different tree classifiers training data subset evaluate accuracy
## tree measure tree importance
## paper use outofbag accuracy assess importance tree
## following breiman use bagging generate
## series training data subsets build
## trees tree data subset used grow
## tree called inofbag iob data
## remaining data subset called outofbag oob
## data since oob data used building trees
## can use data objectively evaluate trees
## accuracy importance oob accuracy gives
## unbiased estimate true accuracy model
## given n instances training dataset d tree
## classifier hk iobk built kth training data
## subset iobk define oob accuracy tree
## hk iobk di d
## n
## oobacck
##
## framework building hybrid random
## forest
##
## ensemble learner performance random
## forest highly dependent two factors diversity
## among trees accuracy tree
##
##
##
## ihk di yi di oobk
## n
## idi oobk
##
##
##
## indicator function larger
## oobacck better classification quality tree
## use outofbag data subset oobi calculate
## outofbag accuracies three types trees
## c cart chaid evaluation values e
## e e respectively
## fig illustrates procedure building hybrid
## random forest model firstly series iob oob
## datasets generated entire training dataset
## bagging three types tree classifiers c
## cart chaid built using iob dataset
## corresponding oob dataset used calculate
## oob accuracies three tree classifiers finally
## select tree highest oob accuracy
## final tree classifier included hybrid
## random forest
## building hybrid random forest model
## way will increase diversity among trees
## classification performance individual tree
## classifier also maximized
##
##
##
##
##
##
## decision tree algorithms
##
## core approach diversity decision
## tree algorithms random forest different decision
## tree algorithms grow structurally different trees
## training data selecting good decision tree
## algorithm grow trees random forest critical
##
## computer journal vol
##
##
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
## difference lies way split node
## split functions binary branches multibranches work use different decision
## tree algorithms build hybrid random forest
##
##
##
## figure hybrid random forests framework
##
## performance random forest studies
## considered different decision tree algorithms
## affect random forest paper
## common decision tree algorithms follows
## classification trees c supervised
## learning classification algorithm used construct
## decision trees given set preclassified objects
## described vector attribute values construct
## mapping attribute values classes c uses
## divideandconquer approach grow decision trees
## beginning entire dataset tree constructed
## considering predictor variable dividing
## dataset best predictor chosen node
## using impurity diversity measure goal
## produce subsets data homogeneous
## respect target variable c selects test
## maximizes information gain ratio igr
## classification regression tree cart
## recursive partitioning method can used
## regression classification main difference
## c cart test selection
## evaluation process
## chisquared automatic interaction detector
## chaid method based chisquare test
## association chaid decision tree constructed
## repeatedly splitting subsets space two
## nodes determine best split
## node allowable pair categories predictor
## variables merged statistically
## significant difference within pair respect
## target variable
## decision tree algorithms can see
##
## hybrid weighted random forest algorithm
##
## subsection present hybrid weighted
## random forest algorithm simultaneously using
## feature weights hybrid method classify high
## dimensional data benefits algorithm
## two aspects firstly compared hybrid forest
## method can use small subspace size
## create accurate random forest models
## secondly
## compared building random forest using feature
## weighting can use several different types
## decision trees training data subset increase
## diversities trees added diversity
## decision trees can effectively improve classification
## performance ensemble model detailed steps
## introduced algorithm
## input parameters algorithm include training
## dataset d set features class feature y
## number trees random forest k
## size subspaces m output random forest
## model m lines form loop building k
## decision trees loop line samples training
## data d sampling replacement generate
## inofbag data subset iobi building decision tree
## line build three types tree classifiers c
## cart chaid procedure line calls
## function createt reej build tree classifier
## line calculates outofbag accuracy tree
## classifier procedure line selects tree
## classifier maximum outofbag accuracy k
## decision tree trees thus generated form hybrid
## weighted random forest model m
## generically function createt reej first creates
## new node tests stopping criteria decide
## whether return upper node split
## node choose split node feature
## weighting method used randomly select m features
## subspace node splitting features
## used candidates generate best split
## partition node subset partition
## createt reej called create new node
## current node leaf node created returns
## parent node recursive process continues
## full tree generated
##
## computer journal vol
##
##
##
##
##
## hybrid weighted random forests classifying highdimensional data
## algorithm new random forest algorithm
## input
## d training dataset
## features space
## y class features space y y yq
## k number trees
## m size subspaces
## output random forest m
## method
## k
##
## draw bootstrap sample inofbag data subset
## iobi outofbag data subset oobi
## training dataset d
##
## j
##
## hij iobi createt reej
## use outofbag data subset oobi calculate
##
## outofbag accuracy oobacci j tree
## classifier hij iobi equation
##
## end
##
## select hi iobi highest outofbag
## accuracy oobacci optimal tree
## end
## combine
##
## k
## tree
## classifiers
## h iob h iob hk iobk random
## forest m
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## function createtree
## create new node n
## stopping criteria met
## return n leaf node
## else
## j m
## compute
##
## informativeness
## measure
## corraj y equation
## end
## compute feature weights w w wm
## equation
## use feature weighting method randomly
## select m features
## use m features candidates generate
## best split node partitioned
## call createtree split
## end
## return n
## evaluation measures
##
## paper use five measures ie strength
## correlation error bound c s test accuracy f
## metric evaluate random forest models strength
## measures collective performance individual trees
## random forest correlation measures
## diversity trees ratio correlation
## square strength c s indicates
## generalization error bound random forest model
## three measures introduced
## accuracy measures performance random forest
## model unseen test data f metric
##
##
##
## commonly used measure classification performance
##
##
## strength correlation measures
##
## follow breimans method described
## calculate strength correlation ratio c s
## following breimans notation denote strength
## s correlation let hk iobk kth
## tree classifier grown kth training data iobk
## sampled d replacement
## assume
## random forest model contains k trees outofbag
## proportion votes di d class j
## k
## ihk di j di
## iobk
## qdi j kk
##
## iobk
## k idi
## number trees random forest
## trained without di classify di class
## j divided number training datasets
## containing di
## strength s computed
##
## qdi yi maxjyi qdi j
## n
## n
##
## s
##
##
##
## n number objects d yi indicates
## true class di
## correlation computed
## n
##
##
##
## qdi yi maxjyi qdi j s
## n
##
##
##
##
## k
##
## k
## k pk pk
## k pk p
##
##
## n
## pk
##
##
##
## ihk di yi di
## iobk
## n
## iobk
## idi
##
##
##
##
## n
## pk
##
##
##
## ihk di jdi y di
## iobk
## n
## id
##
##
## iob
##
##
## k
##
##
##
##
##
## jdi y argmaxjyi qd j
##
##
##
## class obtains maximal number votes
## among classes true class
##
##
## general error bound measure c s
##
## given strength correlation outofbag
## estimate c s measure can computed
## important theoretical result breimans method
## upper bound generalization error
## random forest ensemble derived
## p e s s
##
##
##
## mean value correlations
## pairs individual classifiers s strength
## set individual classifiers estimated
##
## computer journal vol
##
##
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
##
## average accuracy individual classifiers d
## outofbag evaluation inequality shows
## generalization error random forest affected
## strength individual classifiers mutual
## correlations therefore breiman defined c s ratio
## measure random forest
## c s s
##
##
##
## smaller ratio better performance
## random forest c s gives guidance
## reducing generalization error random forests
##
##
## test accuracy
##
## test accuracy measures classification performance random forest test data set let
## dt test data yt class labels given
## di dt number votes di class j
## n di j
##
## k
##
##
## ihk di j
##
##
##
## table
## summary statistic highdimensional
## datasets
## name
## features
## instances
## classes minority
## fbis
##
##
##
##
## re
##
##
##
##
## re
##
##
##
##
## tr
##
##
##
##
## wap
##
##
##
##
## tr
##
##
##
##
## las
##
##
##
##
## las
##
##
##
##
##
## emphasizes performance classifier rare
## categories define follows
##
##
##
## t pi
## t pi
##
## t pi f pi
## t pi f ni
##
##
##
## f category macroaveraged f
## computed
##
## k
##
## test accuracy calculated
## f
##
## di yi maxjyi n di j
## n
##
##
## m acrof
##
##
## q
##
##
## q
##
## f
##
##
##
## n
##
## acc
##
## n number objects dt yi indicates
## true class di
##
##
## f metric
##
## evaluate performance classification methods
## dealing unbalanced class distribution use
## f metric introduced yang liu
## measure equal harmonic mean recall
## precision overall f score entire
## classification problem can computed microaverage macroaverage
## microaveraged f computed globally
## classes emphasizes performance classifier
## common classes define follows
## q
##
## q
## t pi
## t pi
## q
## q
##
## t pi f pi
## t pi f ni
## q number classes t pi true positives
## number objects correctly predicted class
## f pi false positives number objects
## predicted belong class microaveraged f computed
## m icrof
##
##
##
##
##
##
## macroaveraged f first computed locally
## class average classes taken
##
## larger microf macrof values
## higher classification performance classifier
##
##
## experiments
##
## section present two experiments
## demonstrate effectiveness new random
## forest algorithm classifying high dimensional data
## high dimensional datasets various sizes
## characteristics used experiments
## first experiment designed show proposed
## method can reduce generalization error bound
## c s improve test accuracy size
## selected subspace large second
## experiment used demonstrate classification
## performance proposed method comparison
## classification methods ie svm nb knn
##
##
## datasets
##
## experiments used eight realworld high
## dimensional datasets datasets selected
## due diversities number features
## number instances number classes
## dimensionalities vary instances
## vary minority class rate varies
## dataset randomly
## select instances training dataset
## remaining data test dataset detailed
## information eight datasets listed table
## fbis re re tr wap tr las
## las datasets classical text classification
## benchmark datasets carefully selected
##
## computer journal vol
##
##
##
##
##
## hybrid weighted random forests classifying highdimensional data
## preprocessed han karypis dataset fbis
## compiled foreign broadcast information
## service trec datasets re re
## selected reuters text categorization test
## collection distribution datasets tr
## tr derived trec trec
## trec dataset wap webace
## project wap datasets las las
## selected los angeles times trec
## classes datasets generated
## relevance judgment provided collections
##
##
## performance comparisons random forest methods
##
## purpose experiment evaluate
## effect hybrid weighted random forest
## method h w rf strength correlation c s
## test accuracy
## eight high dimensional
## datasets analyzed results compared
## seven random forest methods ie c
## random forest c rf cart random forest
## cart rf chaid random forest chaid rf
## hybrid random forest h rf c weighted random
## forest c w rf cart weighted random forest
## cart w rf chaid weighted random forest
## chaid w rf dataset ran
## random forest algorithm different sizes
## feature subspaces since number features
## datasets large started subspace
## features increased subspace
## features time given subspace size built
## trees random forest model order
## obtain stable result built random forest models
## subspace size dataset algorithm
## computed average values four measures
## strength correlation c s test accuracy
## final results comparison performance
## eight random forest algorithms four measures
## datasets shown figs
##
## fig plots strength eight methods
## different subspace sizes datasets
## subspace higher strength
## better result curves can see
## new algorithm h w rf consistently performs
## better seven random forest algorithms
## advantages obvious small subspaces
## new algorithm quickly achieved higher strength
## subspace size increases
## seven
## random forest algorithms require larger subspaces
## achieve higher strength results indicate
## hybrid weighted random forest algorithm enables
## random forest models achieve higher strength
## small subspace sizes compared seven
## random forest algorithms
## fig plots curves correlations
## eight random forest methods datasets
##
##
##
## small subspace sizes h rf c rf cart rf
## chaid rf produce higher correlations
## trees datasets correlation decreases
## subspace size increases random forest
## models lower correlation trees
## better final model
## new
## random forest algorithm h w rf low correlation
## level achieved small subspaces
## datasets also note subspace size
## increased correlation level increased well
## understandable subspace size increases
## informative features likely
## selected repeatedly subspaces increasing
## similarity decision trees therefore feature
## weighting method subspace selection works well
## small subspaces least point view
## correlation measure
## fig shows error bound indicator c s
## eight methods datasets figures
## can observe subspace size increases c s
## consistently reduces behaviour indicates
## subspace size larger log m benefits eight
## algorithms however new algorithm h w rf
## achieved lower level c s subspace size
## log m seven algorithms
## fig plots curves showing accuracy
## eight random forest models test datasets
## datasets can clearly see new random
## forest algorithm h w rf outperforms seven
## random forest algorithms eight data sets
## can seen new method stable
## classification performance methods
## figures observed highest test
## accuracy often obtained default subspace size
## log m implies practice large
## size subspaces necessary grow highquality
## trees random forests
##
##
## performance comparisons
## classification methods
##
##
##
##
##
## conducted experimental comparison
## three widely used text classification
## methods support vector machines svm naive
## bayes nb knearest neighbor knn
## support vector machine used linear kernel
## regularization parameter often
## used text categorization naive bayes
## adopted multivariate bernoulli event model
## frequently used text classification knearest neighbor knn set number k
## neighbors experiments used wekas
## implementation three text classification
## methods used single subspace size
## features eight datasets run random forest
## algorithms h rf c rf cart rf
## chaid rf used subspace size features
## first datasets ie fbis re re tr wap
##
## computer journal vol
##
##
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
## fbis
##
## re
##
##
##
##
##
##
##
##
##
##
##
## strength
##
## strength
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## chaidwrf
##
## chaidwrf
##
##
##
## hrf
##
##
##
## hrf
##
## crf
##
## crf
##
##
##
## cartrf
##
##
##
## cartrf
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## re
##
##
##
##
##
## tr
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## strength
##
## strength
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## chaidwrf
##
##
##
## chaidwrf
##
## hrf
##
## hrf
##
##
##
## crf
##
##
##
## crf
##
## cartrf
##
## cartrf
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## wap
##
## tr
##
##
##
##
##
##
##
##
##
##
## strength
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## strength
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
## chaidwrf
## hrf
##
##
##
## hrf
##
##
##
## crf
##
## crf
##
## cartrf
##
## cartrf
##
##
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## las
##
##
##
##
##
##
##
##
##
##
##
## las
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
## chaidwrf
##
##
##
## strength
##
## strength
##
##
##
## number features
##
## number features
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
## hrf
## crf
##
##
##
## hrf
##
##
##
## crf
##
## cartrf
## chaidrf
##
## cartrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## figure strength changes number features subspace high dimensional datasets
##
## tr run random forest algorithms used
## subspace size features last datasets
## las las run random forest algorithms
## h w rf c w rf cart w rf
## chaid w rf used breimans subspace size
##
## log m run random forest algorithms
## number features provided consistent result
## shown fig order obtain stable results
## built random forest models random forest
## algorithm dataset present average
##
## computer journal vol
##
##
##
##
##
## hybrid weighted random forests classifying highdimensional data
## fbis
##
##
##
## re
##
##
##
##
##
##
##
##
##
## correlation
##
## correlation
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
## hrf
## crf
##
##
##
## hrf
##
##
##
## crf
##
## cartrf
## chaidrf
##
## cartrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## re
##
##
##
##
##
## tr
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## correlation
##
## correlation
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## wap
##
## tr
##
##
##
##
##
##
##
##
## correlation
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## correlation
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## hrf
##
## hrf
##
## crf
##
##
##
## crf
##
## cartrf
##
## cartrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## las
##
## las
##
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
##
##
## correlation
##
## correlation
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## hrf
##
## hrf
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## figure correlation changes number features subspace high dimensional datasets
##
## results noting range values less
## hybrid trees always accurate
## comparison results classification performance
## eleven methods shown table
##
## performance estimated using test accuracy acc
##
## micro f mic macro f mac boldface
## denotes best results eleven classification
## methods
## improvement often quite
## small always improvement demonstrated
## observe proposed method h w rf
##
## computer journal vol
##
##
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
## fbis
##
##
##
## re
##
##
##
##
## log m
##
##
##
##
##
##
##
##
##
##
##
## cwrf
##
##
##
##
##
##
##
## hwrf
##
## c s
##
## c s
##
##
##
##
##
## cartwrf
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## chaidwrf
##
##
##
## chaidwrf
##
##
##
## hrf
##
## hrf
##
## log m
##
##
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
##
## chaidrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
## log m
##
##
##
##
##
##
##
## c s
##
## hwrf
##
##
##
##
##
##
##
##
##
##
##
## tr
##
##
##
##
##
## c s
##
##
##
## number features
##
## re
##
##
##
##
##
## cwrf
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## cartwrf
##
## chaidwrf
##
##
##
## chaidwrf
##
##
##
## hrf
##
## hrf
##
## crf
##
## log m
##
##
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
##
## chaidrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## wap
##
## tr
##
##
##
##
##
##
##
## log m
##
## log m
##
##
##
##
##
##
##
##
## cwrf
##
## c s
##
##
##
## c s
##
## hwrf
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
## cartwrf
##
##
## chaidwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## las
##
## las
##
##
##
##
##
##
##
##
##
##
##
## cwrf
## cartwrf
##
##
##
##
##
## hrf
##
##
##
## crf
## cartwrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
## log m
##
##
##
## c s
##
## c s
##
##
##
##
## hwrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## chaidwrf
##
## log m
##
##
##
##
## hrf
## crf
##
##
##
## cartrf
## chaidrf
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## figure c s changes number features subspace high dimensional datasets
##
## outperformed classification methods
## datasets
##
##
##
## conclusions
##
## paper presented hybrid weighted random
## forest algorithm simultaneously using feature
## weighting method hybrid forest method classify
## computer journal vol
##
##
##
##
##
## hybrid weighted random forests classifying highdimensional data
## fbis
##
##
##
##
##
## re
##
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## accuracy
##
## accuracy
##
##
##
## chaidwrf
## hrf
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
##
##
## crf
##
## crf
##
## log m
##
## cartrf
##
##
##
##
##
## cartrf
##
## log m
##
##
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## re
##
## tr
##
##
##
##
##
##
##
##
## log m
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
##
##
## accuracy
##
## accuracy
##
##
##
## chaidwrf
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
##
##
##
## log m
##
##
##
## crf
##
## crf
##
##
##
## cartrf
##
## cartrf
##
##
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## wap
##
##
##
##
##
## tr
##
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## accuracy
##
## accuracy
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
## chaidwrf
##
##
##
## hrf
##
## log m
##
##
##
##
## hrf
##
## crf
##
##
##
## cartrf
##
##
##
## crf
##
## log m
##
##
##
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## number features
##
## las
##
##
##
##
##
## las
##
##
##
##
##
##
##
## accuracy
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## accuracy
##
##
##
##
##
##
##
## hwrf
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
##
## log m
##
##
##
## crf
##
##
##
## crf
##
## log m
##
##
##
##
##
## cartrf
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number features
##
## figure test accuracy changes number features subspace high dimensional datasets
##
## high dimensional data algorithm retains
## small subspace size breimans formula log m
## determining subspace size create accurate
## random forest models also effectively reduces
## upper bound generalization error
##
## improves classification performance results
## experiments various high dimensional datasets
## random forest generated new method superior
## classification methods can use default
## log m subspace size generally guarantee
##
## computer journal vol
##
##
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
##
## table comparison results
## datasets
## dataset
## fbis
## measures
## acc
## mic
## svm
##
## knn
##
##
## nb
##
## h rf
##
## c rf
##
## cart rf
##
## chaid rf
##
## h w rf
##
## c w rf
##
## cart w rf
##
## chaid w rf
##
## dataset
## wap
## measures
## acc
## mic
## svm
##
##
## knn
##
## nb
##
## h rf
##
## c rf
##
## cart rf
##
## chaid rf
##
## h w rf
##
## c w rf
##
## cart w rf
##
##
## chaid w rf
##
##
## best accuracy micro f macro f results eleven methods
## re
## mic
##
##
##
##
##
##
##
##
##
##
##
## tr
## mac
## acc
## mic
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## acc
##
##
##
##
##
##
##
##
##
##
##
##
## always produce best models variety
## measures using hybrid weighted random forest
## algorithm
## acknowledgements
## research supported part nsfc
## grant shenzhen new industry development fund grant nocxba
## references
## breiman l random forests machine learning
##
## ho t random subspace method constructing decision forests ieee transactions pattern
## analysis machine intelligence
## quinlan j c programs machine
## learning morgan kaufmann
## breiman l classification regression trees
## chapman hall crc
## breiman l bagging predictors
## machine
## learning
## ho t random decision forests proceedings
## third international conference document
## analysis recognition pp ieee
## dietterich t experimental comparison
## three methods constructing ensembles decision
## trees bagging boosting randomization machine
## learning
##
## mac
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## re
## mic
##
##
##
##
##
##
##
##
##
##
##
## las
## acc
## mic
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## acc
##
##
##
##
##
##
##
##
##
##
##
##
## tr
## mic
##
##
##
##
##
##
##
##
##
##
##
## las
## mac
## acc
## mic
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## acc
##
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## banfield r hall l bowyer k kegelmeyer w
## comparison decision tree ensemble creation
## techniques ieee transactions pattern analysis
## machine intelligence
##
## robniksikonja
## m improving random forests
## proceedings th european conference
## machine learning pp springer
## ho t c decision forests proceedings
## fourteenth international conference pattern
## recognition pp ieee
## dietterrich t machine learning research four
## current direction artificial intelligence magzine
##
## amaratunga d cabrera j lee y
## enriched random forests bioinformatics
##
## ye y li h deng x huang j
## feature weighting random forest detection hidden
## web search interfaces journal computational
## linguistics chinese language processing
##
## xu b huang j williams g wang q
## ye y classifying highdimensional data
## random forests built small subspaces
## international journal data warehousing
## mining
## xu b huang j williams g li j ye y
## hybrid random forests advantages mixed
## trees classifying text data proceedings th
## pacificasia conference knowledge discovery
## data mining springer
##
## computer journal vol
##
##
##
##
##
## hybrid weighted random forests classifying highdimensional data
## biggs d de ville b suen e method
## choosing multiway partitions classification
## decision trees journal applied statistics
## ture m kurt turhan kurum ozdamar
## k comparing classification techniques
## predicting essential hypertension expert systems
## applications
## begum n ma f ren f automatic text summarization using support vector machine
## international journal innovative computing information control
## chen j huang h tian s qu y
## feature selection text classification naive
## bayes expert systems applications
##
## tan s neighborweighted knearest neighbor
## unbalanced text corpus
## expert systems
## applications
## pearson k theory contingency
## relation association normal correlation
## cambridge university press
## yang y liu x reexamination
## text categorization methods proceedings th
## international conference research development
## information retrieval pp acm
## han e karypis g centroidbased
## document classification analysis experimental
## results proceedings th european conference
## principles data mining knowledge discovery
## pp springer
## trec
##
## text
## retrieval
## conference
## http trecnistgov
## lewis
## d
##
## reuters
## text
## categorization
## test
## collection
## distribution
##
## http wwwresearchattcom lewis
## han e boley d gini m gross r hastings
## k karypis g kumar v mobasher b
## moore j webace web agent document
## categorization exploration proceedings nd
## international conference autonomous agents pp
## acm
## mccallum nigam k comparison
## event models naive bayes text classification aaai workshop learning text categorization pp
##
## witten frank e hall m data mining
## practical machine learning tools techniques
## morgan kaufmann
##
## computer journal vol
##
##
Previously we used the English stopwords provided by . We could instead or in addition remove our own stop words as we have done above. We have chosen here two words, simply for illustration. The choice might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0
