## 21.12 Remove Punctuation

docs <- tm_map(docs, removePunctuation)
viewDocs(docs, 16)
## hybrid weighted random forests for
## classifying very highdimensional data
## baoxun xu  joshua zhexue huang  graham williams and
## yunming ye
##
##
## department of computer science harbin institute of technology shenzhen graduate
## school shenzhen  china
##
## shenzhen institutes of advanced technology chinese academy of sciences shenzhen
##  china
## email amusing gmailcom
## random forests are a popular classification method based on an ensemble of a
## single type of decision trees from subspaces of data in the literature there
## are many different types of decision tree algorithms including c cart and
## chaid each type of decision tree algorithm may capture different information
## and structure this paper proposes a hybrid weighted random forest algorithm
## simultaneously using a feature weighting method and a hybrid forest method to
## classify very high dimensional data the hybrid weighted random forest algorithm
## can effectively reduce subspace size and improve classification performance
## without increasing the error bound we conduct a series of experiments on eight
## high dimensional datasets to compare our method with traditional random forest
## methods and other classification methods the results show that our method
## consistently outperforms these traditional methods
## keywords random forests hybrid weighted random forest classification decision tree
##
##
##
## introduction
##
## random forests   are a popular classification
## method which builds an ensemble of a single type
## of decision trees from different random subspaces of
## data the decision trees are often either built using
## c  or cart  but only one type within
## a single random forest in recent years random
## forests have attracted increasing attention due to
##  its competitive performance compared with other
## classification methods especially for highdimensional
## data  algorithmic intuitiveness and simplicity and
##  its most important capability  ensemble using
## bagging  and stochastic discrimination
## several methods have been proposed to grow random
## forests from subspaces of data        in
## these methods the most popular forest construction
## procedure was proposed by breiman  to first use
## bagging to generate training data subsets for building
## individual trees
## a subspace of features is then
## randomly selected at each node to grow branches of
## a decision tree the trees are then combined as an
## ensemble into a forest as an ensemble learner the
## performance of a random forest is highly dependent
## on two factors the performance of each tree and the
## diversity of the trees in the forests  breiman
## formulated the overall performance of a set of trees as
## the average strength and proved that the generalization
##
## error of a random forest is bounded by the ratio of the
## average correlation between trees divided by the square
## of the average strength of the trees
## for very high dimensional data such as text data
## there are usually a large portion of features that are
## uninformative to the classes during this forest building
## process informative features would have the large
## chance to be missed if we randomly select a small
## subspace breiman suggested selecting log m
## features in a subspace where m is the number of
## independent features in the data from high dimensional
## data  as a result weak trees are created from these
## subspaces the average strength of those trees is reduced
## and the error bound of the random forest is enlarged
## therefore when a large proportion of such weak
## trees are generated in a random forest the forest has a
## large likelihood to make a wrong decision which mainly
## results from those weak trees classification power
## to address this problem we aim to optimize decision
## trees of a random forest by two strategies one
## straightforward strategy is to enhance the classification
## performance of individual trees by a feature weighting
## method for subspace sampling    in this
## method feature weights are computed with respect
## to the correlations of features to the class feature
## and regarded as the probabilities of the feature to
## be selected in subspaces this method obviously
## increases the classification performance of individual
##
## the computer journal vol
##
## no
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
##
## trees because the subspaces will be biased to contain
## more informative features however the chance of more
## correlated trees is also increased since the features with
## large weights are likely to be repeatedly selected
## the second strategy is more straightforward use
## several different types of decision trees for each training
## data subset to increase the diversity of the trees
## and then select the optimal tree as the individual
## tree classifier in the random forest model the work
## presented here extends the algorithm developed in
## specifically we build three different types of tree
## classifiers c cart and chaid   for each
## training data subset we then evaluate the performance
## of the three classifiers and select the best tree in
## this way we build a hybrid random forest which may
## include different types of decision trees in the ensemble
## the added diversity of the decision trees can effectively
## improve the accuracy of each tree in the forest and
## hence the classification performance of the ensemble
## however when we use this method to build the best
## random forest model for classifying high dimensional
## data we can not be sure of what subspace size is best
## in this paper we propose a hybrid weighted random
## forest algorithm by simultaneously using a new feature
## weighting method together with the hybrid random
## forest method to classify high dimensional data in
## this new random forest algorithm we calculate feature
## weights and use weighted sampling to randomly select
## features for subspaces at each node in building different
## types of trees classifiers c cart and chaid for
## each training data subset and select the best tree as
## the individual tree in the final ensemble model
## experiments were performed on  high dimensional
## text datasets with dimensions ranging from  to
##  we compared the performance of eight random
## forest methods and wellknown classification methods
## c random forest cart random forest chaid
## random forest hybrid random forest c weighted
## random forest cart weighted random forest chaid
## weighted random forest hybrid weighted random
## forest support vector machines  naive bayes
## and knearest neighbors
## the experimental
## results show that our hybrid weighted random forest
## achieves improved classification performance over the
## ten competitive methods
## the remainder of this paper is organized as follows
## in section  we introduce a framework for building
## a hybrid weighted random forest and describe a new
## random forest algorithm section  summarizes four
## measures to evaluate random forest models we present
## experimental results on  high dimensional text datasets
## in section  section  contains our conclusions
##
## table  contingency table of input feature a and class
## feature y
## y  y
## y  yj
## y  yq total
## a  a
##
##
## j
##
## q
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## a  ai
## i
##
## ij
##
## iq
## i
##
##
##
##
##
##
##
##
##
##
##
##
##
## a  ap
## p
##
## pj
##
## pq
## p
## total
##
##
## j
##
## q
##
##
## general framework for building hybrid random forests
## by integrating these two methods we propose a novel
## hybrid weighted random forest algorithm
##
##
## let y be the class or target feature with q distinct
## class labels yj for j       q for the purposes of
## our discussion we consider a single categorical feature
## a in dataset d with p distinct category values we
## denote the distinct values by ai for i       p
## numeric features can be discretized into p intervals with
## a supervised discretization method
## assume d has val objects the size of the subset of
## d satisfying the condition that a  ai and y  yj is
## denoted by ij  considering all combinations of the
## categorical values of a and the labels of y  we can
## obtain a contingency table  of a against y as shown
## in table  the far right column contains the marginal
## totals for feature a
##
## hybrid
## forests
##
## weighted
##
## random
##
## in this section we first introduce a feature weighting
## method for subspace sampling then we present a
##
## q
##
##
## i
##
## ij
##
## for i       p
##
##
##
## j
##
## and the bottom row is the marginal totals for class
## feature y
## j
##
## p
##
##
## ij
##
## for j       q
##
##
##
## i
##
## the grand total the total number of samples is in
## the bottom right corner
##
##
## q
## p
##
##
## ij
##
##
##
## j i
##
## given a training dataset d and feature a we first
## compute the contingency table the feature weights are
## then computed using the two methods to be discussed
## in the following subsection
##
##
##
##
## notation
##
## feature weighting method
##
## in this subsection we give the details of the feature
## weighting method for subspace sampling in random
## forests consider an mdimensional feature space
## a  a      am  we present how to compute the
##
## the computer journal vol
##
## no
##
##
##
## hybrid weighted random forests for classifying very highdimensional data
## weights w  w      wm  for every feature in the space
## these weights are then used in the improved algorithm
## to grow each decision tree in the random forest
##  feature weight computation
## the weight of feature a represents the correlation
## between the values of feature a and the values of the
## class feature y  a larger weight will indicate that the
## class labels of objects in the training dataset are more
## correlated with the values of feature a indicating that
## a is more informative to the class of objects thus it
## is suggested that a has a stronger power in predicting
## the classes of new objects
## in the following we propose to use the chisquare
## statistic to compute feature weights because this
## method can quantify the correspondence between two
## categorical variables
## given the contingency table of an input feature a and
## the class feature y of dataset d the chisquare statistic
## of the two features is computed as
## corra y
##
## q
## p
##
## ij  tij
## tij
## i j
##
##
##
## where ij is the observed frequency from the
## contingency table and tij is the expected frequency
## computed as
## i x j
## tij
##
##
##
##
## the larger the measure corra y  the more
## informative the feature a is in predicting class y
##  normalized feature weight
## in practice feature weights are normalized for feature
## subspace sampling we use corra y  to measure the
## informativeness of these features and consider them
## as feature weights however to treat the weights as
## probabilities of features we normalize the measures to
## ensure the sum of the normalized feature weights is
## equal to  let corrai  y    i  m  be the set
## of m feature measures we compute the normalized
## weights as
##
## corrai  y
## wi  n
##
## i corrai  y
## here we use the square root to smooth the values of
## the measures wi can be considered as the probability
## that feature ai is randomly sampled in a subspace the
## more informative a feature is the larger the weight and
## the higher the probability of the feature being selected
##
## diversity is commonly obtained by using bagging and
## random subspace sampling we introduce a further
## element of diversity by using different types of trees
## considering an analogy with forestry the different data subsets from bagging represent the soil structures different decision tree algorithms represent different tree species our approach has two key aspects
## one is to use three types of decision tree algorithms to
## generate three different tree classifiers for each training data subset the other is to evaluate the accuracy
## of each tree as the measure of tree importance in this
## paper we use the outofbag accuracy to assess the importance of a tree
## following breiman  we use bagging to generate
## a series of training data subsets from which we build
## trees for each tree the data subset used to grow
## the tree is called the inofbag iob data and the
## remaining data subset is called the outofbag oob
## data since oob data is not used for building trees
## we can use this data to objectively evaluate each trees
## accuracy and importance the oob accuracy gives an
## unbiased estimate of the true accuracy of a model
## given n instances in a training dataset d and a tree
## classifier hk iobk  built from the kth training data
## subset iobk  we define the oob accuracy of the tree
## hk iobk  for di  d as
## n
## oobacck
##
## framework for building a hybrid random
## forest
##
## as an ensemble learner the performance of a random
## forest is highly dependent on two factors the diversity
## among the trees and the accuracy of each tree
##
## i
##
## ihk di   yi  di  oobk
## n
## i idi  oobk
##
##
##
## where i is an indicator function the larger the
## oobacck  the better the classification quality of a tree
## we use the outofbag data subset oobi to calculate
## the outofbag accuracies of the three types of trees
## c cart and chaid with evaluation values e
## e and e respectively
## fig  illustrates the procedure for building a hybrid
## random forest model firstly a series of iob oob
## datasets are generated from the entire training dataset
## by bagging then three types of tree classifiers c
## cart and chaid are built using each iob dataset
## the corresponding oob dataset is used to calculate the
## oob accuracies of the three tree classifiers finally
## we select the tree with the highest oob accuracy as
## the final tree classifier which is included in the hybrid
## random forest
## building a hybrid random forest model in this
## way will increase the diversity among the trees
## the classification performance of each individual tree
## classifier is also maximized
##
##
##
##
##
##
## decision tree algorithms
##
## the core of our approach is the diversity of decision
## tree algorithms in our random forest different decision
## tree algorithms grow structurally different trees from
## the same training data selecting a good decision tree
## algorithm to grow trees for a random forest is critical
##
## the computer journal vol
##
## no
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
## the difference lies in the way to split a node such
## as the split functions and binary branches or multibranches in this work we use these different decision
## tree algorithms to build a hybrid random forest
##
##
##
## figure  the hybrid random forests framework
##
## for the performance of the random forest few studies
## have considered how different decision tree algorithms
## affect a random forest we do so in this paper
## the common decision tree algorithms are as follows
## classification trees  c is a supervised
## learning classification algorithm used to construct
## decision trees given a set of preclassified objects each
## described by a vector of attribute values we construct
## a mapping from attribute values to classes c uses
## a divideandconquer approach to grow decision trees
## beginning with the entire dataset a tree is constructed
## by considering each predictor variable for dividing the
## dataset the best predictor is chosen at each node
## using a impurity or diversity measure the goal is
## to produce subsets of the data which are homogeneous
## with respect to the target variable c selects the test
## that maximizes the information gain ratio igr
## classification and regression tree cart is
## a recursive partitioning method that can be used for
## both regression and classification the main difference
## between c and cart is the test selection and
## evaluation process
## chisquared automatic interaction detector
## chaid method is based on the chisquare test of
## association a chaid decision tree is constructed
## by repeatedly splitting subsets of the space into two
## or more nodes to determine the best split at any
## node any allowable pair of categories of the predictor
## variables is merged until there is no statistically
## significant difference within the pair with respect to the
## target variable
## from these decision tree algorithms we can see that
##
## hybrid weighted random forest algorithm
##
## in this subsection we present a hybrid weighted
## random forest algorithm by simultaneously using the
## feature weights and a hybrid method to classify high
## dimensional data the benefits of our algorithm has
## two aspects firstly compared with hybrid forest
## method  we can use a small subspace size to
## create accurate random forest models
## secondly
## compared with building a random forest using feature
## weighting  we can use several different types of
## decision trees for each training data subset to increase
## the diversities of trees the added diversity of the
## decision trees can effectively improve the classification
## performance of the ensemble model the detailed steps
## are introduced in algorithm
## input parameters to algorithm  include a training
## dataset d the set of features a the class feature y
## the number of trees in the random forest k and the
## size of subspaces m the output is a random forest
## model m  lines  form the loop for building k
## decision trees in the loop line  samples the training
## data d by sampling with replacement to generate an
## inofbag data subset iobi for building a decision tree
## line  build three types of tree classifiers c
## cart and chaid in this procedure line  calls
## the function createt reej  to build a tree classifier
## line  calculates the outofbag accuracy of the tree
## classifier after this procedure line  selects the tree
## classifier with the maximum outofbag accuracy k
## decision tree trees are thus generated to form a hybrid
## weighted random forest model m
## generically function createt reej  first creates a
## new node then it tests the stopping criteria to decide
## whether to return to the upper node or to split this
## node if we choose to split this node then the feature
## weighting method is used to randomly select m features
## as the subspace for node splitting these features
## are used as candidates to generate the best split to
## partition the node for each subset of the partition
## createt reej  is called again to create a new node under
## the current node if a leaf node is created it returns to
## the parent node this recursive process continues until
## a full tree is generated
##
## the computer journal vol
##
## no
##
##
##
## hybrid weighted random forests for classifying very highdimensional data
## algorithm  new random forest algorithm
##  input
##   d  the training dataset
##   a  the features space a  a   am
##   y  the class features space y  y   yq
##   k  the number of trees
##   m  the size of subspaces
##  output a random forest m
##  method
##  for i   to k do
##
## draw a bootstrap sample inofbag data subset
## iobi and outofbag data subset oobi from
## training dataset d
##
## for j   to  do
##
## hij iobi   createt reej
## use outofbag data subset oobi to calculate
##
## the outofbag accuracy oobacci j of the tree
## classifier hij iobi  by equation
##
## end for
##
## select hi iobi  with the highest outofbag
## accuracy oobacci as optimal tree i
##  end for
##  combine
## the
## k
## tree
## classifiers
## h iob  h iob   hk iobk  into a random
## forest m
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## function createtree
## create a new node n
## if stopping criteria is met then
## return n as a leaf node
## else
## for j   to m do
## compute
## the
## informativeness
## measure
## corraj  y  by equation
## end for
## compute feature weights w  w   wm  by
## equation
## use the feature weighting method to randomly
## select m features
## use these m features as candidates to generate
## the best split for the node to be partitioned
## call createtree for each split
## end if
## return n
## evaluation measures
##
## in this paper we use five measures ie strength
## correlation error bound c s  test accuracy and f
## metric to evaluate our random forest models strength
## measures the collective performance of individual trees
## in a random forest and the correlation measures the
## diversity of the trees the ratio of the correlation
## over the square of the strength c s indicates the
## generalization error bound of the random forest model
## these three measures were introduced in  the
## accuracy measures the performance of a random forest
## model on unseen test data the f metric is a
##
##
##
## commonly used measure of classification performance
##
##
## strength and correlation measures
##
## we follow breimans method described in  to
## calculate the strength correlation and the ratio c s
## following breimans notation we denote strength as
## s and correlation as  let hk iobk  be the kth
## tree classifier grown from the kth training data iobk
## sampled from d with replacement
## assume the
## random forest model contains k trees the outofbag
## proportion of votes for di  d on class j is
## k
## ihk di   j di
##   iobk
## qdi  j  kk
##
##   iobk
## k idi
## this is the number of trees in the random forest
## which are trained without di and classify di into class
## j divided by the number of training datasets not
## containing di
## the strength s is computed as
##
## qdi  yi   maxjyi qdi  j
## n i
## n
##
## s
##
##
##
## where n is the number of objects in d and yi indicates
## the true class of di
## the correlation  is computed as
## n
##
##
##
## i qdi  yi   maxjyi qdi  j  s
## n
##
##
##
##
## k
##
## k
## k  pk  pk
## k pk  p
## where
##
## n
## pk
##
## i
##
## ihk di   yi  di
##   iobk
## n
##   iobk
## i idi
##
##
##
## and
## n
## pk
##
## i
##
## ihk di   jdi  y  di
##   iobk
## n
## id
##
##
## iob
##
## i
## k
## i
##
##
##
## where
## jdi  y   argmaxjyi qd j
##
##
##
## is the class that obtains the maximal number of votes
## among all classes but the true class
##
##
## general error bound measure c s
##
## given the strength and correlation the outofbag
## estimate of the c s measure can be computed
## an important theoretical result in breimans method
## is the upper bound of the generalization error of the
## random forest ensemble that is derived as
## p e    s  s
##
##
##
## where  is the mean value of correlations between all
## pairs of individual classifiers and s is the strength of
## the set of individual classifiers that is estimated as the
##
## the computer journal vol
##
## no
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
##
## average accuracy of individual classifiers on d with
## outofbag evaluation this inequality shows that the
## generalization error of a random forest is affected by
## the strength of individual classifiers and their mutual
## correlations therefore breiman defined the c s ratio
## to measure a random forest as
## c s   s
##
##
##
## the smaller the ratio the better the performance of
## the random forest as such c s gives guidance for
## reducing the generalization error of random forests
##
##
## test accuracy
##
## the test accuracy measures the classification performance of a random forest on the test data set let
## dt be a test data and yt be the class labels given
## di  dt  the number of votes for di on class j is
## n di  j
##
## k
##
##
## ihk di   j
##
##
##
## table
## summary statistic of  highdimensional
## datasets
## name
## features
## instances
## classes  minority
## fbis
##
##
##
##
## re
##
##
##
##
## re
##
##
##
##
## tr
##
##
##
##
## wap
##
##
##
##
## tr
##
##
##
##
## las
##
##
##
##
## las
##
##
##
##
##
## it emphasizes the performance of a classifier on rare
## categories define  and  as follows
##
## i
##
## t pi
## t pi
##  i
## t pi  f pi
## t pi  f ni
##
##
##
## f  for each category i and the macroaveraged f
## are computed as
##
## k
##
## the test accuracy is calculated as
## f i
##
## in di  yi   maxjyi n di  j
## n i
##
## i i
##  m acrof
## i  i
##
## q
## i
##
## q
##
## f i
##
##
##
## n
##
## acc
##
## where n is the number of objects in dt and yi indicates
## the true class of di
##
##
## f metric
##
## to evaluate the performance of classification methods
## in dealing with an unbalanced class distribution we use
## the f metric introduced by yang and liu  this
## measure is equal to the harmonic mean of recall
## and precision  the overall f score of the entire
## classification problem can be computed by a microaverage and a macroaverage
## microaveraged f is computed globally over all
## classes and emphasizes the performance of a classifier
## on common classes define  and  as follows
## q
##
## q
## t pi
## i t pi
##   q i
##    q
##
## i t pi  f pi
## i t pi  f ni
## where q is the number of classes t pi true positives
## is the number of objects correctly predicted as class i
## f pi false positives is the number of objects that are
## predicted to belong to class i but do not the microaveraged f is computed as
## m icrof
##
##
##
##
##
##
## macroaveraged f is first computed locally over
## each class and then the average over all classes is taken
##
## the larger the microf and macrof values are the
## higher the classification performance of the classifier
##
##
## experiments
##
## in this section we present two experiments that
## demonstrate the effectiveness of the new random
## forest algorithm for classifying high dimensional data
## high dimensional datasets with various sizes and
## characteristics were used in the experiments the
## first experiment is designed to show how our proposed
## method can reduce the generalization error bound
## c s  and improve test accuracy when the size of
## the selected subspace is not too large the second
## experiment is used to demonstrate the classification
## performance of our proposed method in comparison to
## other classification methods ie svm nb and knn
##
##
## datasets
##
## in the experiments we used eight realworld high
## dimensional datasets these datasets were selected
## due to their diversities in the number of features the
## number of instances and the number of classes their
## dimensionalities vary from  to  instances
## vary from  to  and the minority class rate varies
## from  to  in each dataset we randomly
## select  of instances as the training dataset and
## the remaining data as the test dataset detailed
## information of the eight datasets is listed in table
## the fbis re re tr wap tr las
## and las datasets are classical text classification
## benchmark datasets which were carefully selected and
##
## the computer journal vol
##
## no
##
##
##
## hybrid weighted random forests for classifying very highdimensional data
## preprocessed by han and karypis  dataset fbis
## was compiled from the foreign broadcast information
## service trec  the datasets re and re were
## selected from the reuters text categorization test
## collection distribution   the datasets tr and
## tr were derived from trec  trec
## and trec  dataset wap is from the webace
## project wap  the datasets las and las were
## selected from the los angeles times for trec
## the classes of these datasets were generated from the
## relevance judgment provided in these collections
##
##
## performance comparisons between random forest methods
##
## the purpose of this experiment was to evaluate
## the effect of the hybrid weighted random forest
## method h w rf on strength correlation c s
## and test accuracy
## the eight high dimensional
## datasets were analyzed and results were compared
## with seven other random forest methods ie c
## random forest c rf cart random forest
## cart rf chaid random forest chaid rf
## hybrid random forest h rf c weighted random
## forest c w rf cart weighted random forest
## cart w rf chaid weighted random forest
## chaid w rf for each dataset we ran each
## random forest algorithm against different sizes of the
## feature subspaces since the number of features in these
## datasets was very large we started with a subspace
## of  features and increased the subspace by  more
## features each time for a given subspace size we built
##  trees for each random forest model in order to
## obtain a stable result we built  random forest models
## for each subspace size each dataset and each algorithm
## and computed the average values of the four measures
## of strength correlation c s  and test accuracy as the
## final results for comparison the performance of the
## eight random forest algorithms on the four measures
## for each of the  datasets is shown in figs    and
##
## fig  plots the strength for the eight methods against
## different subspace sizes on each of the  datasets
## in the same subspace the higher the strength the
## better the result from the curves we can see that
## the new algorithm h w rf consistently performs
## better than the seven other random forest algorithms
## the advantages are more obvious for small subspaces
## the new algorithm quickly achieved higher strength
## as the subspace size increases
## the seven other
## random forest algorithms require larger subspaces to
## achieve a higher strength these results indicate that
## the hybrid weighted random forest algorithm enables
## random forest models to achieve a higher strength
## for small subspace sizes compared to the seven other
## random forest algorithms
## fig  plots the curves for the correlations for the
## eight random forest methods on the  datasets for
##
##
##
## small subspace sizes h rf c rf cart rf
## and chaid rf produce higher correlations between
## the trees on all datasets the correlation decreases
## as the subspace size increases for the random forest
## models the lower the correlation between the trees
## then the better the final model
## with our new
## random forest algorithm h w rf a low correlation
## level was achieved with very small subspaces in all
##  datasets we also note that as the subspace size
## increased the correlation level increased as well this is
## understandable because as the subspace size increases
## the same informative features are more likely to be
## selected repeatedly in the subspaces increasing the
## similarity of the decision trees therefore the feature
## weighting method for subspace selection works well for
## small subspaces at least from the point of view of the
## correlation measure
## fig  shows the error bound indicator c s for the
## eight methods on the  datasets from these figures
## we can observe that as the subspace size increases c s
## consistently reduces the behaviour indicates that a
## subspace size larger than log m  benefits all eight
## algorithms however the new algorithm h w rf
## achieved a lower level of c s at subspace size of
## log m    than the seven other algorithms
## fig  plots the curves showing the accuracy of the
## eight random forest models on the test datasets from
## the  datasets we can clearly see that the new random
## forest algorithm h w rf outperforms the seven
## other random forest algorithms in all eight data sets
## it can be seen that the new method is more stable
## in classification performance than other methods in
## all of these figures it is observed that the highest test
## accuracy is often obtained with the default subspace size
## of log m    this implies that in practice large
## size subspaces are not necessary to grow highquality
## trees for random forests
##
##
## performance comparisons
## classification methods
##
## with
##
## other
##
## we conducted a further experimental comparison
## against three other widely used text classification
## methods support vector machines svm naive
## bayes nb and knearest neighbor knn the
## support vector machine used a linear kernel with a
## regularization parameter of  which was often
## used in text categorization for naive bayes we
## adopted the multivariate bernoulli event model that
## is frequently used in text classification  for knearest neighbor knn we set the number k of
## neighbors to  in the experiments we used wekas
## implementation for these three text classification
## methods  we used a single subspace size of
## features in all eight datasets to run the random forest
## algorithms for h rf c rf cart rf and
## chaid rf we used a subspace size of  features in
## the first  datasets ie fbis re re tr wap and
##
## the computer journal vol
##
## no
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
## fbis
##
## re
##
##
##
##
##
##
##
##
##
##
##
## strength
##
## strength
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## chaidwrf
##
## chaidwrf
##
##
##
## hrf
##
##
##
## hrf
##
## crf
##
## crf
##
##
##
## cartrf
##
##
##
## cartrf
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## re
##
##
##
##
##
## tr
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## strength
##
## strength
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## chaidwrf
##
##
##
## chaidwrf
##
## hrf
##
## hrf
##
##
##
## crf
##
##
##
## crf
##
## cartrf
##
## cartrf
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## wap
##
## tr
##
##
##
##
##
##
##
##
##
##
## strength
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## strength
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
## chaidwrf
## hrf
##
##
##
## hrf
##
##
##
## crf
##
## crf
##
## cartrf
##
## cartrf
##
##
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## las
##
##
##
##
##
##
##
##
##
##
##
## las
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
## chaidwrf
##
##
##
## strength
##
## strength
##
##
##
## number of features
##
## number of features
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
## hrf
## crf
##
##
##
## hrf
##
##
##
## crf
##
## cartrf
## chaidrf
##
## cartrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## figure  strength changes against the number of features in the subspace on the  high dimensional datasets
##
## tr to run the random forest algorithms and used
## a subspace size of  features in the last  datasets
## las and las to run these random forest algorithms
## for h w rf c w rf cart w rf and
## chaid w rf we used breimans subspace size of
##
## log m    to run these random forest algorithms
## this number of features provided a consistent result as
## shown in fig  in order to obtain stable results we
## built  random forest models for each random forest
## algorithm and each dataset and present the average
##
## the computer journal vol
##
## no
##
##
##
## hybrid weighted random forests for classifying very highdimensional data
## fbis
##
##
##
## re
##
##
##
##
##
##
##
##
##
## correlation
##
## correlation
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
## hrf
## crf
##
##
##
## hrf
##
##
##
## crf
##
## cartrf
## chaidrf
##
## cartrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## re
##
##
##
##
##
## tr
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## correlation
##
## correlation
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## wap
##
## tr
##
##
##
##
##
##
##
##
## correlation
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## correlation
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## hrf
##
## hrf
##
## crf
##
##
##
## crf
##
## cartrf
##
## cartrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## las
##
## las
##
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
##
##
## correlation
##
## correlation
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## hrf
##
## hrf
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## figure  correlation changes against the number of features in the subspace on the  high dimensional datasets
##
## results noting that the range of values are less than
##  and the hybrid trees are always more accurate
## the comparison results of classification performance
## of eleven methods are shown in table
## the
## performance is estimated using test accuracy acc
##
## micro f mic and macro f mac boldface
## denotes best results between eleven classification
## methods
## while the improvement is often quite
## small there is always an improvement demonstrated
## we observe that our proposed method h w rf
##
## the computer journal vol
##
## no
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
## fbis
##
##
##
## re
##
##
##
##
## log m
##
##
##
##
##
##
##
##
##
##
##
## cwrf
##
##
##
##
##
##
##
## hwrf
##
## c s
##
## c s
##
##
##
##
##
## cartwrf
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## chaidwrf
##
##
##
## chaidwrf
##
##
##
## hrf
##
## hrf
##
## log m
##
##
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
##
## chaidrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
## log m
##
##
##
##
##
##
##
## c s
##
## hwrf
##
##
##
##
##
##
##
##
##
##
##
## tr
##
##
##
##
##
## c s
##
##
##
## number of features
##
## re
##
##
##
##
##
## cwrf
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## cartwrf
##
## chaidwrf
##
##
##
## chaidwrf
##
##
##
## hrf
##
## hrf
##
## crf
##
## log m
##
##
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
##
## chaidrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## wap
##
## tr
##
##
##
##
##
##
##
## log m
##
## log m
##
##
##
##
##
##
##
##
## cwrf
##
## c s
##
##
##
## c s
##
## hwrf
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
## cartwrf
##
##
## chaidwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
## crf
##
##
##
## crf
##
##
##
## cartrf
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## las
##
## las
##
##
##
##
##
##
##
##
##
##
##
## cwrf
## cartwrf
##
##
##
##
##
## hrf
##
##
##
## crf
## cartwrf
##
##
##
## chaidrf
##
##
##
##
##
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
##
## log m
##
##
##
## c s
##
## c s
##
##
##
##
## hwrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## chaidwrf
##
## log m
##
##
##
##
## hrf
## crf
##
##
##
## cartrf
## chaidrf
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## figure  c s changes against the number of features in the subspace on the  high dimensional datasets
##
## outperformed the other classification methods in all
## datasets
##
##
##
## conclusions
##
## in this paper we presented a hybrid weighted random
## forest algorithm by simultaneously using a feature
## weighting method and a hybrid forest method to classify
## the computer journal vol
##
## no
##
##
##
## hybrid weighted random forests for classifying very highdimensional data
## fbis
##
##
##
##
##
## re
##
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## accuracy
##
## accuracy
##
##
##
## chaidwrf
## hrf
##
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
##
##
## crf
##
## crf
##
## log m
##
## cartrf
##
##
##
##
##
## cartrf
##
## log m
##
##
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## re
##
## tr
##
##
##
##
##
##
##
##
## log m
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
##
##
## accuracy
##
## accuracy
##
##
##
## chaidwrf
##
##
##
## hwrf
##
##
##
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
##
##
##
## log m
##
##
##
## crf
##
## crf
##
##
##
## cartrf
##
## cartrf
##
##
##
##
##
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## wap
##
##
##
##
##
## tr
##
##
##
##
##
##
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
##
## accuracy
##
## accuracy
##
##
##
##
##
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
## chaidwrf
##
##
##
## hrf
##
## log m
##
##
##
##
## hrf
##
## crf
##
##
##
## cartrf
##
##
##
## crf
##
## log m
##
##
##
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## number of features
##
## las
##
##
##
##
##
## las
##
##
##
##
##
##
##
## accuracy
##
##
## hwrf
## cwrf
##
##
##
## cartwrf
## chaidwrf
##
##
##
## accuracy
##
##
##
##
##
##
##
## hwrf
## cwrf
## cartwrf
##
##
##
## chaidwrf
## hrf
##
## hrf
##
## log m
##
##
##
## crf
##
##
##
## crf
##
## log m
##
##
##
##
##
## cartrf
##
## cartrf
## chaidrf
##
## chaidrf
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## number of features
##
## figure  test accuracy changes against the number of features in the subspace on the  high dimensional datasets
##
## high dimensional data our algorithm not only retains
## a small subspace size breimans formula log m
## for determining the subspace size to create accurate
## random forest models but also effectively reduces
## the upper bound of the generalization error and
##
## improves classification performance from the results of
## experiments on various high dimensional datasets the
## random forest generated by our new method is superior
## to other classification methods we can use the default
## log m    subspace size and generally guarantee
##
## the computer journal vol
##
## no
##
##
##
##
##
## baoxun xu joshua zhexue huang graham williams yunming ye
##
## table  the comparison of results
## datasets
## dataset
## fbis
## measures
## acc
## mic
## svm
##
## knn
##
##
## nb
##
## h rf
##
## c rf
##
## cart rf
##
## chaid rf
##
## h w rf
##
## c w rf
##
## cart w rf
##
## chaid w rf
##
## dataset
## wap
## measures
## acc
## mic
## svm
##
##
## knn
##
## nb
##
## h rf
##
## c rf
##
## cart rf
##
## chaid rf
##
## h w rf
##
## c w rf
##
## cart w rf
##
##
## chaid w rf
##
##
## best accuracy micro f and macro f results of the eleven methods on the
## re
## mic
##
##
##
##
##
##
##
##
##
##
##
## tr
## mac
## acc
## mic
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## acc
##
##
##
##
##
##
##
##
##
##
##
##
## to always produce the best models on a variety of
## measures by using the hybrid weighted random forest
## algorithm
## acknowledgements
## this research is supported in part by nsfc under
## grant no and shenzhen new industry development fund under grant nocxba
## references
##  breiman l  random forests machine learning
##
##  ho t  random subspace method for constructing decision forests ieee transactions on pattern
## analysis and machine intelligence
##  quinlan j  c programs for machine
## learning morgan kaufmann
##  breiman l  classification and regression trees
## chapman  hall crc
##  breiman l  bagging predictors
## machine
## learning
##  ho t  random decision forests proceedings
## of the third international conference on document
## analysis and recognition pp  ieee
##  dietterich t  an experimental comparison of
## three methods for constructing ensembles of decision
## trees bagging boosting and randomization machine
## learning
##
## mac
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## re
## mic
##
##
##
##
##
##
##
##
##
##
##
## las
## acc
## mic
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## acc
##
##
##
##
##
##
##
##
##
##
##
##
## tr
## mic
##
##
##
##
##
##
##
##
##
##
##
## las
## mac
## acc
## mic
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
## acc
##
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
## mac
##
##
##
##
##
##
##
##
##
##
##
##
##  banfield r hall l bowyer k and kegelmeyer w
##  a comparison of decision tree ensemble creation
## techniques ieee transactions on pattern analysis
## and machine intelligence
##
##  robniksikonja
## m  improving random forests
## proceedings of the th european conference on
## machine learning pp  springer
##  ho t  c decision forests proceedings of
## the fourteenth international conference on pattern
## recognition pp  ieee
##  dietterrich t  machine learning research four
## current direction artificial intelligence magzine
##
##  amaratunga d cabrera j and lee y
## enriched random forests bioinformatics
##
##  ye y li h deng x and huang j
## feature weighting random forest for detection of hidden
## web search interfaces the journal of computational
## linguistics and chinese language processing
##
##  xu b huang j williams g wang q and
## ye y  classifying very highdimensional data
## with random forests built from small subspaces
## international journal of data warehousing and
## mining
##  xu b huang j williams g li j and ye y
##  hybrid random forests advantages of mixed
## trees in classifying text data proceedings of the th
## pacificasia conference on knowledge discovery and
## data mining springer
##
## the computer journal vol
##
## no
##
##
##
## hybrid weighted random forests for classifying very highdimensional data
##  biggs d de ville b and suen e  a method
## of choosing multiway partitions for classification and
## decision trees journal of applied statistics
##  ture m kurt i turhan kurum a and ozdamar
## k  comparing classification techniques for
## predicting essential hypertension expert systems with
## applications
##  begum n ma f and ren f  automatic text summarization using support vector machine
## international journal of innovative computing information and control
##  chen j huang h tian s and qu y
## feature selection for text classification with naive
## bayes expert systems with applications
##
##  tan s  neighborweighted knearest neighbor
## for unbalanced text corpus
## expert systems with
## applications
##  pearson k  on the theory of contingency and
## its relation to association and normal correlation
## cambridge university press
##  yang y and liu x  a reexamination of
## text categorization methods proceedings of the th
## international conference on research and development
## in information retrieval pp  acm
##  han e and karypis g  centroidbased
## document classification analysis and experimental
## results proceedings of the th european conference on
## principles of data mining and knowledge discovery
## pp  springer
##  trec
##
## text
## retrieval
## conference
## http  trecnistgov
##  lewis
## d
##
## reuters
## text
## categorization
## test
## collection
## distribution
##
## http  wwwresearchattcom  lewis
##  han e boley d gini m gross r hastings
## k karypis g kumar v mobasher b and
## moore j  webace a web agent for document
## categorization and exploration proceedings of the nd
## international conference on autonomous agents pp
##  acm
##  mccallum a and nigam k  a comparison of
## event models for naive bayes text classification aaai workshop on learning for text categorization pp
##
##  witten i frank e and hall m  data mining
## practical machine learning tools and techniques
## morgan kaufmann
##
## the computer journal vol
##
## no

Punctuation can provide gramatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0