## 21.7 Exploring the Corpus

We can (and should) inspect the documents using tm::inspect(). This will assure us that data has been loaded properly and as we expect.

`inspect(docs[16])`

```
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## hwrf12.txt
## Hybrid weighted random forests for\nclassifying very high-dimensional data\nBaoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and\nYunming Ye1\n1\n\nDepartment of Computer Science, Harbin Institute of Technology Shenzhen Graduate\nSchool, Shenzhen 518055, China\n2\nShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen\n518055, China\nEmail: amusing002@gmail.com\nRandom forests are a popular classification method based on an ensemble of a\nsingle type of decision trees from subspaces of data. In the literature, there\nare many different types of decision tree algorithms, including C4.5, CART, and\nCHAID. Each type of decision tree algorithm may capture different information\nand structure. This paper proposes a hybrid weighted random forest algorithm,\nsimultaneously using a feature weighting method and a hybrid forest method to\nclassify very high dimensional data. The hybrid weighted random forest algorithm\ncan effectively reduce subspace size and improve classification performance\nwithout increasing the error bound. We conduct a series of experiments on eight\nhigh dimensional datasets to compare our method with traditional random forest\nmethods and other classification methods. The results show that our method\nconsistently outperforms these traditional methods.\nKeywords: Random Forests; Hybrid Weighted Random Forest; Classification; Decision tree;\n\n1.\n\nINTRODUCTION\n\nRandom forests [1, 2] are a popular classification\nmethod which builds an ensemble of a single type\nof decision trees from different random subspaces of\ndata. The decision trees are often either built using\nC4.5 [3] or CART [4], but only one type within\na single random forest. In recent years, random\nforests have attracted increasing attention due to\n(1) its competitive performance compared with other\nclassification methods, especially for high-dimensional\ndata, (2) algorithmic intuitiveness and simplicity, and\n(3) its most important capability - "ensemble" using\nbagging [5] and stochastic discrimination [2].\nSeveral methods have been proposed to grow random\nforests from subspaces of data [1, 2, 6, 7, 8, 9, 10]. In\nthese methods, the most popular forest construction\nprocedure was proposed by Breiman [1] to first use\nbagging to generate training data subsets for building\nindividual trees.\nA subspace of features is then\nrandomly selected at each node to grow branches of\na decision tree. The trees are then combined as an\nensemble into a forest. As an ensemble learner, the\nperformance of a random forest is highly dependent\non two factors: the performance of each tree and the\ndiversity of the trees in the forests [11]. Breiman\nformulated the overall performance of a set of trees as\nthe average strength and proved that the generalization\n\nerror of a random forest is bounded by the ratio of the\naverage correlation between trees divided by the square\nof the average strength of the trees.\nFor very high dimensional data, such as text data,\nthere are usually a large portion of features that are\nuninformative to the classes. During this forest building\nprocess, informative features would have the large\nchance to be missed, if we randomly select a small\nsubspace (Breiman suggested selecting log2 (M ) + 1\nfeatures in a subspace, where M is the number of\nindependent features in the data) from high dimensional\ndata [12]. As a result, weak trees are created from these\nsubspaces, the average strength of those trees is reduced\nand the error bound of the random forest is enlarged.\nTherefore, when a large proportion of such "weak"\ntrees are generated in a random forest, the forest has a\nlarge likelihood to make a wrong decision which mainly\nresults from those "weak" trees' classification power.\nTo address this problem, we aim to optimize decision\ntrees of a random forest by two strategies. One\nstraightforward strategy is to enhance the classification\nperformance of individual trees by a feature weighting\nmethod for subspace sampling [12, 13, 14]. In this\nmethod, feature weights are computed with respect\nto the correlations of features to the class feature\nand regarded as the probabilities of the feature to\nbe selected in subspaces. This method obviously\nincreases the classification performance of individual\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n2\n\nBaoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye\n\ntrees because the subspaces will be biased to contain\nmore informative features. However, the chance of more\ncorrelated trees is also increased since the features with\nlarge weights are likely to be repeatedly selected.\nThe second strategy is more straightforward: use\nseveral different types of decision trees for each training\ndata subset, to increase the diversity of the trees,\nand then select the optimal tree as the individual\ntree classifier in the random forest model. The work\npresented here extends the algorithm developed in [15].\nSpecifically, we build three different types of tree\nclassifiers (C4.5, CART, and CHAID [16, 17]) for each\ntraining data subset. We then evaluate the performance\nof the three classifiers and select the best tree. In\nthis way, we build a hybrid random forest which may\ninclude different types of decision trees in the ensemble.\nThe added diversity of the decision trees can effectively\nimprove the accuracy of each tree in the forest, and\nhence the classification performance of the ensemble.\nHowever, when we use this method to build the best\nrandom forest model for classifying high dimensional\ndata, we can not be sure of what subspace size is best.\nIn this paper, we propose a hybrid weighted random\nforest algorithm by simultaneously using a new feature\nweighting method together with the hybrid random\nforest method to classify high dimensional data. In\nthis new random forest algorithm, we calculate feature\nweights and use weighted sampling to randomly select\nfeatures for subspaces at each node in building different\ntypes of trees classifiers (C4.5, CART, and CHAID) for\neach training data subset, and select the best tree as\nthe individual tree in the final ensemble model.\nExperiments were performed on 8 high dimensional\ntext datasets with dimensions ranging from 2000 to\n13195. We compared the performance of eight random\nforest methods and well-known classification methods:\nC4.5 random forest, CART random forest, CHAID\nrandom forest, hybrid random forest, C4.5 weighted\nrandom forest, CART weighted random forest, CHAID\nweighted random forest, hybrid weighted random\nforest, support vector machines [18], naive Bayes [19],\nand k-nearest neighbors [20].\nThe experimental\nresults show that our hybrid weighted random forest\nachieves improved classification performance over the\nten competitive methods.\nThe remainder of this paper is organized as follows.\nIn Section 2, we introduce a framework for building\na hybrid weighted random forest, and describe a new\nrandom forest algorithm. Section 3 summarizes four\nmeasures to evaluate random forest models. We present\nexperimental results on 8 high dimensional text datasets\nin Section 4. Section 5 contains our conclusions.\n\nTABLE 1. Contingency table of input feature A and class\nfeature Y\nY = y1 . . .\nY = yj . . .\nY = yq Total\nA = a1\n11\n...\n1j\n...\n1q\n1*\n..\n..\n..\n..\n..\n..\n.\n.\n...\n.\n.\n.\n.\nA = ai\ni1\n...\nij\n...\niq\ni*\n..\n..\n..\n..\n..\n..\n...\n.\n.\n.\n.\n.\n.\nA = ap\np1\n...\npj\n...\npq\np*\nTotal\n*1\n...\n*j\n...\n*q\n\n\ngeneral framework for building hybrid random forests.\nBy integrating these two methods, we propose a novel\nhybrid weighted random forest algorithm.\n2.1.\n\nLet Y be the class (or target) feature with q distinct\nclass labels yj for j = 1, * * * , q. For the purposes of\nour discussion we consider a single categorical feature\nA in dataset D with p distinct category values. We\ndenote the distinct values by ai for i = 1, * * * , p.\nNumeric features can be discretized into p intervals with\na supervised discretization method.\nAssume D has val objects. The size of the subset of\nD satisfying the condition that A = ai and Y = yj is\ndenoted by ij . Considering all combinations of the\ncategorical values of A and the labels of Y , we can\nobtain a contingency table [21] of A against Y as shown\nin Table 1. The far right column contains the marginal\ntotals for feature A:\n\nHYBRID\nFORESTS\n\nWEIGHTED\n\nRANDOM\n\nIn this section, we first introduce a feature weighting\nmethod for subspace sampling. Then we present a\n\nq\n\n\ni. =\n\nij\n\nfor i = 1, * * * , p\n\n(1)\n\nj=1\n\nand the bottom row is the marginal totals for class\nfeature Y :\n.j =\n\np\n\n\nij\n\nfor j = 1, * * * , q\n\n(2)\n\ni=1\n\nThe grand total (the total number of samples) is in\nthe bottom right corner:\n=\n\nq \np\n\n\nij\n\n(3)\n\nj=1 i=1\n\nGiven a training dataset D and feature A we first\ncompute the contingency table. The feature weights are\nthen computed using the two methods to be discussed\nin the following subsection.\n2.2.\n\n2.\n\nNotation\n\nFeature Weighting Method\n\nIn this subsection, we give the details of the feature\nweighting method for subspace sampling in random\nforests. Consider an M-dimensional feature space\n{A1 , A2 , . . . , AM }. We present how to compute the\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\nHybrid weighted random forests for classifying very high-dimensional data\nweights {w1 , w2 , . . . , wM } for every feature in the space.\nThese weights are then used in the improved algorithm\nto grow each decision tree in the random forest.\n2.2.1. Feature Weight Computation\nThe weight of feature A represents the correlation\nbetween the values of feature A and the values of the\nclass feature Y . A larger weight will indicate that the\nclass labels of objects in the training dataset are more\ncorrelated with the values of feature A, indicating that\nA is more informative to the class of objects. Thus it\nis suggested that A has a stronger power in predicting\nthe classes of new objects.\nIn the following, we propose to use the chi-square\nstatistic to compute feature weights because this\nmethod can quantify the correspondence between two\ncategorical variables.\nGiven the contingency table of an input feature A and\nthe class feature Y of dataset D, the chi-square statistic\nof the two features is computed as:\ncorr(A, Y ) =\n\nq\np \n\n(ij - tij )2\ntij\ni=1 j=1\n\n(4)\n\nwhere ij is the observed frequency from the\ncontingency table and tij is the expected frequency\ncomputed as\ni* x *j\ntij =\n\n\n(5)\n\nThe larger the measure corr(A, Y ), the more\ninformative the feature A is in predicting class Y .\n2.2.2. Normalized Feature Weight\nIn practice, feature weights are normalized for feature\nsubspace sampling. We use corr(A, Y ) to measure the\ninformativeness of these features and consider them\nas feature weights. However, to treat the weights as\nprobabilities of features, we normalize the measures to\nensure the sum of the normalized feature weights is\nequal to 1. Let corr(Ai , Y ) (1 i M ) be the set\nof M feature measures. We compute the normalized\nweights as\n\ncorr(Ai , Y )\nwi = N \n(6)\ni=1 corr(Ai , Y )\nHere, we use the square root to smooth the values of\nthe measures. wi can be considered as the probability\nthat feature Ai is randomly sampled in a subspace. The\nmore informative a feature is, the larger the weight and\nthe higher the probability of the feature being selected.\n\nDiversity is commonly obtained by using bagging and\nrandom subspace sampling. We introduce a further\nelement of diversity by using different types of trees.\nConsidering an analogy with forestry, the different data subsets from bagging represent the "soil structures." Different decision tree algorithms represent "different tree species". Our approach has two key aspects:\none is to use three types of decision tree algorithms to\ngenerate three different tree classifiers for each training data subset; the other is to evaluate the accuracy\nof each tree as the measure of tree importance. In this\npaper, we use the out-of-bag accuracy to assess the importance of a tree.\nFollowing Breiman [1], we use bagging to generate\na series of training data subsets from which we build\ntrees. For each tree, the data subset used to grow\nthe tree is called the "in-of-bag" (IOB) data and the\nremaining data subset is called the "out-of-bag" (OOB)\ndata. Since OOB data is not used for building trees\nwe can use this data to objectively evaluate each tree's\naccuracy and importance. The OOB accuracy gives an\nunbiased estimate of the true accuracy of a model.\nGiven n instances in a training dataset D and a tree\nclassifier hk (IOBk ) built from the k'th training data\nsubset IOBk , we define the OOB accuracy of the tree\nhk (IOBk ), for di D, as:\nn\nOOBAcck =\n\nFramework for Building a Hybrid Random\nForest\n\nAs an ensemble learner, the performance of a random\nforest is highly dependent on two factors: the diversity\namong the trees and the accuracy of each tree [11].\n\ni=1\n\nI(hk (di ) = yi ; di OOBk )\nn\ni=1 I(di OOBk )\n\n(7)\n\nwhere I(.) is an indicator function. The larger the\nOOBAcck , the better the classification quality of a tree.\nWe use the out-of-bag data subset OOBi to calculate\nthe out-of-bag accuracies of the three types of trees\n(C4.5, CART and CHAID) with evaluation values E1 ,\nE2 and E3 respectively.\nFig. 1 illustrates the procedure for building a hybrid\nrandom forest model. Firstly, a series of IOB/OOB\ndatasets are generated from the entire training dataset\nby bagging. Then, three types of tree classifiers (C4.5,\nCART and CHAID) are built using each IOB dataset.\nThe corresponding OOB dataset is used to calculate the\nOOB accuracies of the three tree classifiers. Finally,\nwe select the tree with the highest OOB accuracy as\nthe final tree classifier, which is included in the hybrid\nrandom forest.\nBuilding a hybrid random forest model in this\nway will increase the diversity among the trees.\nThe classification performance of each individual tree\nclassifier is also maximized.\n2.4.\n\n2.3.\n\n3\n\nDecision Tree Algorithms\n\nThe core of our approach is the diversity of decision\ntree algorithms in our random forest. Different decision\ntree algorithms grow structurally different trees from\nthe same training data. Selecting a good decision tree\nalgorithm to grow trees for a random forest is critical\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n4\n\nBaoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye\nthe difference lies in the way to split a node, such\nas the split functions and binary branches or multibranches. In this work we use these different decision\ntree algorithms to build a hybrid random forest.\n\n2.5.\n\nFIGURE 1. The Hybrid Random Forests framework.\n\nfor the performance of the random forest. Few studies\nhave considered how different decision tree algorithms\naffect a random forest. We do so in this paper.\nThe common decision tree algorithms are as follows:\nClassification Trees 4.5 (C4.5) is a supervised\nlearning classification algorithm used to construct\ndecision trees. Given a set of pre-classified objects, each\ndescribed by a vector of attribute values, we construct\na mapping from attribute values to classes. C4.5 uses\na divide-and-conquer approach to grow decision trees.\nBeginning with the entire dataset, a tree is constructed\nby considering each predictor variable for dividing the\ndataset. The best predictor is chosen at each node\nusing a impurity or diversity measure. The goal is\nto produce subsets of the data which are homogeneous\nwith respect to the target variable. C4.5 selects the test\nthat maximizes the information gain ratio (IGR) [3].\nClassification and Regression Tree (CART) is\na recursive partitioning method that can be used for\nboth regression and classification. The main difference\nbetween C4.5 and CART is the test selection and\nevaluation process.\nChi-squared Automatic Interaction Detector\n(CHAID) method is based on the chi-square test of\nassociation. A CHAID decision tree is constructed\nby repeatedly splitting subsets of the space into two\nor more nodes. To determine the best split at any\nnode, any allowable pair of categories of the predictor\nvariables is merged until there is no statistically\nsignificant difference within the pair with respect to the\ntarget variable [16, 17].\nFrom these decision tree algorithms, we can see that\n\nHybrid Weighted Random Forest Algorithm\n\nIn this subsection we present a hybrid weighted\nrandom forest algorithm by simultaneously using the\nfeature weights and a hybrid method to classify high\ndimensional data. The benefits of our algorithm has\ntwo aspects: Firstly, compared with hybrid forest\nmethod [15], we can use a small subspace size to\ncreate accurate random forest models.\nSecondly,\ncompared with building a random forest using feature\nweighting [14], we can use several different types of\ndecision trees for each training data subset to increase\nthe diversities of trees. The added diversity of the\ndecision trees can effectively improve the classification\nperformance of the ensemble model. The detailed steps\nare introduced in Algorithm 1.\nInput parameters to Algorithm 1 include a training\ndataset D, the set of features A, the class feature Y ,\nthe number of trees in the random forest K and the\nsize of subspaces m. The output is a random forest\nmodel M . Lines 9-16 form the loop for building K\ndecision trees. In the loop, Line 10 samples the training\ndata D by sampling with replacement to generate an\nin-of-bag data subset IOBi for building a decision tree.\nLine 11-14 build three types of tree classifiers (C4.5,\nCART, and CHAID). In this procedure, Line 12 calls\nthe function createT reej () to build a tree classifier.\nLine 13 calculates the out-of-bag accuracy of the tree\nclassifier. After this procedure, Line 15 selects the tree\nclassifier with the maximum out-of-bag accuracy. K\ndecision tree trees are thus generated to form a hybrid\nweighted random forest model M .\nGenerically, function createT reej () first creates a\nnew node. Then, it tests the stopping criteria to decide\nwhether to return to the upper node or to split this\nnode. If we choose to split this node, then the feature\nweighting method is used to randomly select m features\nas the subspace for node splitting. These features\nare used as candidates to generate the best split to\npartition the node. For each subset of the partition,\ncreateT reej () is called again to create a new node under\nthe current node. If a leaf node is created, it returns to\nthe parent node. This recursive process continues until\na full tree is generated.\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\nHybrid weighted random forests for classifying very high-dimensional data\nAlgorithm 1 New Random Forest Algorithm\n1: Input:\n2: - D : the training dataset,\n3: - A : the features space {A1 , A2 , ..., AM },\n4: - Y : the class features space {y1 , y2 , ..., yq },\n5: - K : the number of trees,\n6: - m : the size of subspaces.\n7: Output: A random forest M ;\n8: Method:\n9: for i = 1 to K do\n10:\ndraw a bootstrap sample in-of-bag data subset\nIOBi and out-of-bag data subset OOBi from\ntraining dataset D;\n11:\nfor j = 1 to 3 do\n12:\nhi,j (IOBi ) = createT reej ();\nuse out-of-bag data subset OOBi to calculate\n13:\nthe out-of-bag accuracy OOBAcci, j of the tree\nclassifier hi,j (IOBi ) by Equation(1);\n14:\nend for\n15:\nselect hi (IOBi ) with the highest out-of-bag\naccuracy OOBAcci as Optimal tree i;\n16: end for\n17: combine\nthe\nK\ntree\nclassifiers\nh1 (IOB1 ), h2 (IOB2 ), ..., hK (IOBK ) into a random\nforest M ;\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n\n3.\n\nFunction createTree()\ncreate a new node N ;\nif stopping criteria is met then\nreturn N as a leaf node;\nelse\nfor j = 1 to M do\ncompute\nthe\ninformativeness\nmeasure\ncorr(Aj , Y ) by Equation (4);\nend for\ncompute feature weights {w1 , w2 , ..., wM } by\nEquation (6);\nuse the feature weighting method to randomly\nselect m features;\nuse these m features as candidates to generate\nthe best split for the node to be partitioned;\ncall createTree() for each split;\nend if\nreturn N ;\nEVALUATION MEASURES\n\nIn this paper, we use five measures, i.e., strength,\ncorrelation, error bound c/s2 , test accuracy, and F1\nmetric, to evaluate our random forest models. Strength\nmeasures the collective performance of individual trees\nin a random forest and the correlation measures the\ndiversity of the trees. The ratio of the correlation\nover the square of the strength c/s2 indicates the\ngeneralization error bound of the random forest model.\nThese three measures were introduced in [1]. The\naccuracy measures the performance of a random forest\nmodel on unseen test data. The F1 metric is a\n\n5\n\ncommonly used measure of classification performance.\n3.1.\n\nStrength and Correlation Measures\n\nWe follow Breiman's method described in [1] to\ncalculate the strength, correlation and the ratio c/s2 .\nFollowing Breiman's notation, we denote strength as\ns and correlation as . Let hk (IOBk ) be the kth\ntree classifier grown from the kth training data IOBk\nsampled from D with replacement.\nAssume the\nrandom forest model contains K trees. The out-of-bag\nproportion of votes for di D on class j is\nK\nI(hk (di ) = j; di \n/ IOBk )\nQ(di , j) = k=1K\n(8)\n/ IOBk )\nk=1 I(di \nThis is the number of trees in the random forest\nwhich are trained without di and classify di into class\nj, divided by the number of training datasets not\ncontaining di .\nThe strength s is computed as:\n1\n(Q(di , yi ) - maxj=yi Q(di , j))\nn i=1\nn\n\ns=\n\n(9)\n\nwhere n is the number of objects in D and yi indicates\nthe true class of di .\nThe correlation is computed as:\nn\n1\n2\n2\ni=1 (Q(di , yi ) - maxj=yi Q(di , j)) - s\nn\n(10)\n =\n\n\nK\n1\n(K\nk + (pk - pk )2 )2\nk=1 pk + p\nwhere\n\nn\npk =\n\ni=1\n\nI(hk (di ) = yi ; di \n/ IOBk )\nn\n/ IOBk )\ni=1 I(di \n\n(11)\n\nand\nn\npk =\n\ni=1\n\nI(hk (di ) = j(di , Y ); di \n/ IOBk )\nn\nI(d\n\n/\nIOB\n)\ni\nk\ni=1\n\n(12)\n\nwhere\nj(di , Y ) = argmaxj=yi Q(d, j)\n\n(13)\n\nis the class that obtains the maximal number of votes\namong all classes but the true class.\n3.2.\n\nGeneral Error Bound Measure c/s2\n\nGiven the strength and correlation, the out-of-bag\nestimate of the c/s2 measure can be computed.\nAn important theoretical result in Breiman's method\nis the upper bound of the generalization error of the\nrandom forest ensemble that is derived as\nP E (1 - s2 )/s2\n\n(14)\n\nwhere is the mean value of correlations between all\npairs of individual classifiers and s is the strength of\nthe set of individual classifiers that is estimated as the\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n6\n\nBaoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye\n\naverage accuracy of individual classifiers on D with\nout-of-bag evaluation. This inequality shows that the\ngeneralization error of a random forest is affected by\nthe strength of individual classifiers and their mutual\ncorrelations. Therefore, Breiman defined the c/s2 ratio\nto measure a random forest as\nc/s2 = /s2\n\n(15)\n\nThe smaller the ratio, the better the performance of\nthe random forest. As such, c/s2 gives guidance for\nreducing the generalization error of random forests.\n3.3.\n\nTest Accuracy\n\nThe test accuracy measures the classification performance of a random forest on the test data set. Let\nDt be a test data and Yt be the class labels. Given\ndi Dt , the number of votes for di on class j is\nN (di , j) =\n\nK\n\n\nI(hk (di ) = j)\n\n(16)\n\nTABLE 2.\nSummary statistic of 8 high-dimensional\ndatasets\nName\nFeatures\nInstances\nClasses % Minority\nFbis\n2000\n2463\n17\n1.54\nRe0\n2886\n1504\n13\n0.73\nRe1\n3758\n1657\n25\n0.6\nTr41\n7454\n878\n10\n1.03\nWap\n8460\n1560\n20\n0.32\nTr31\n10,128\n927\n7\n0.22\nLa2s\n12,432\n3075\n6\n8.07\nLa1s\n13,195\n3204\n6\n8.52\n\nIt emphasizes the performance of a classifier on rare\ncategories. Define and as follows:\n\ni =\n\nT Pi\nT Pi\n, i =\n(T Pi + F Pi )\n(T Pi + F Ni )\n\n(20)\n\nF 1 for each category i and the macro-averaged F1\nare computed as:\n\nk=1\n\nThe test accuracy is calculated as\nF 1i =\n1\nI(N (di , yi ) - maxj=yi N (di , j) > 0) (17)\nn i=1\n\n2i i\n, M acroF 1 =\ni + i\n\nq\ni=1\n\nq\n\nF 1i\n\n(21)\n\nn\n\nAcc =\n\nwhere n is the number of objects in Dt and yi indicates\nthe true class of di .\n3.4.\n\nF1 Metric\n\nTo evaluate the performance of classification methods\nin dealing with an unbalanced class distribution, we use\nthe F1 metric introduced by Yang and Liu [22]. This\nmeasure is equal to the harmonic mean of recall ()\nand precision (). The overall F1 score of the entire\nclassification problem can be computed by a microaverage and a macro-average.\nMicro-averaged F1 is computed globally over all\nclasses, and emphasizes the performance of a classifier\non common classes. Define and as follows:\nq\n\nq\nT Pi\ni=1 T Pi\n = q i=1\n, = q\n(18)\ni=1 (T Pi + F Pi )\ni=1 (T Pi + F Ni )\nwhere q is the number of classes. T Pi (True Positives)\nis the number of objects correctly predicted as class i,\nF Pi (False Positives) is the number of objects that are\npredicted to belong to class i but do not. The microaveraged F1 is computed as:\nM icroF 1 =\n\n2\n+\n\n(19)\n\nMacro-averaged F1 is first computed locally over\neach class, and then the average over all classes is taken.\n\nThe larger the MicroF1 and MacroF1 values are, the\nhigher the classification performance of the classifier.\n4.\n\nEXPERIMENTS\n\nIn this section, we present two experiments that\ndemonstrate the effectiveness of the new random\nforest algorithm for classifying high dimensional data.\nHigh dimensional datasets with various sizes and\ncharacteristics were used in the experiments. The\nfirst experiment is designed to show how our proposed\nmethod can reduce the generalization error bound\nc/s2 , and improve test accuracy when the size of\nthe selected subspace is not too large. The second\nexperiment is used to demonstrate the classification\nperformance of our proposed method in comparison to\nother classification methods, i.e. SVM, NB and KNN.\n4.1.\n\nDatasets\n\nIn the experiments, we used eight real-world high\ndimensional datasets. These datasets were selected\ndue to their diversities in the number of features, the\nnumber of instances, and the number of classes. Their\ndimensionalities vary from 2000 to 13,195. Instances\nvary from 878 to 3204 and the minority class rate varies\nfrom 0.22% to 8.52%. In each dataset, we randomly\nselect 70% of instances as the training dataset, and\nthe remaining data as the test dataset. Detailed\ninformation of the eight datasets is listed in Table 2.\nThe Fbis, Re0, Re1, Tr41, Wap, Tr31, La2s\nand La1s datasets are classical text classification\nbenchmark datasets which were carefully selected and\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\nHybrid weighted random forests for classifying very high-dimensional data\npreprocessed by Han and Karypis [23]. Dataset Fbis\nwas compiled from the Foreign Broadcast Information\nService TREC-5 [24]. The datasets Re0 and Re1 were\nselected from the Reuters-21578 text categorization test\ncollection Distribution 1.0 [25]. The datasets Tr41 and\nTr31 were derived from TREC-5 [24], TREC-6 [24],\nand TREC-7 [24]. Dataset Wap is from the WebACE\nproject (WAP) [26]. The datasets La2s and La1s were\nselected from the Los Angeles Times for TREC-5 [24].\nThe classes of these datasets were generated from the\nrelevance judgment provided in these collections.\n4.2.\n\nPerformance Comparisons between Random Forest Methods\n\nThe purpose of this experiment was to evaluate\nthe effect of the hybrid weighted random forest\nmethod (H W RF) on strength, correlation, c/s2 ,\nand test accuracy.\nThe eight high dimensional\ndatasets were analyzed and results were compared\nwith seven other random forest methods, i.e., C4.5\nrandom forest (C4.5 RF), CART random forest\n(CART RF), CHAID random forest (CHAID RF),\nhybrid random forest (H RF), C4.5 weighted random\nforest (C4.5 W RF), CART weighted random forest\n(CART W RF), CHAID weighted random forest\n(CHAID W RF). For each dataset, we ran each\nrandom forest algorithm against different sizes of the\nfeature subspaces. Since the number of features in these\ndatasets was very large, we started with a subspace\nof 10 features and increased the subspace by 5 more\nfeatures each time. For a given subspace size, we built\n100 trees for each random forest model. In order to\nobtain a stable result, we built 80 random forest models\nfor each subspace size, each dataset and each algorithm,\nand computed the average values of the four measures\nof strength, correlation, c/s2 , and test accuracy as the\nfinal results for comparison. The performance of the\neight random forest algorithms on the four measures\nfor each of the 8 datasets is shown in Figs. 2, 3, 4, and\n5.\nFig. 2 plots the strength for the eight methods against\ndifferent subspace sizes on each of the 8 datasets.\nIn the same subspace, the higher the strength, the\nbetter the result. From the curves, we can see that\nthe new algorithm (H W RF) consistently performs\nbetter than the seven other random forest algorithms.\nThe advantages are more obvious for small subspaces.\nThe new algorithm quickly achieved higher strength\nas the subspace size increases.\nThe seven other\nrandom forest algorithms require larger subspaces to\nachieve a higher strength. These results indicate that\nthe hybrid weighted random forest algorithm enables\nrandom forest models to achieve a higher strength\nfor small subspace sizes compared to the seven other\nrandom forest algorithms.\nFig. 3 plots the curves for the correlations for the\neight random forest methods on the 8 datasets. For\n\n7\n\nsmall subspace sizes, H RF, C4.5 RF, CART RF,\nand CHAID RF produce higher correlations between\nthe trees on all datasets. The correlation decreases\nas the subspace size increases. For the random forest\nmodels the lower the correlation between the trees\nthen the better the final model.\nWith our new\nrandom forest algorithm (H W RF) a low correlation\nlevel was achieved with very small subspaces in all\n8 datasets. We also note that as the subspace size\nincreased the correlation level increased as well. This is\nunderstandable because as the subspace size increases,\nthe same informative features are more likely to be\nselected repeatedly in the subspaces, increasing the\nsimilarity of the decision trees. Therefore, the feature\nweighting method for subspace selection works well for\nsmall subspaces, at least from the point of view of the\ncorrelation measure.\nFig. 4 shows the error bound indicator c/s2 for the\neight methods on the 8 datasets. From these figures\nwe can observe that as the subspace size increases, c/s2\nconsistently reduces. The behaviour indicates that a\nsubspace size larger than log2 (M )+1 benefits all eight\nalgorithms. However, the new algorithm (H W RF)\nachieved a lower level of c/s2 at subspace size of\nlog2 (M ) + 1 than the seven other algorithms.\nFig. 5 plots the curves showing the accuracy of the\neight random forest models on the test datasets from\nthe 8 datasets. We can clearly see that the new random\nforest algorithm (H W RF) outperforms the seven\nother random forest algorithms in all eight data sets.\nIt can be seen that the new method is more stable\nin classification performance than other methods. In\nall of these figures, it is observed that the highest test\naccuracy is often obtained with the default subspace size\nof log2 (M ) + 1. This implies that in practice, large\nsize subspaces are not necessary to grow high-quality\ntrees for random forests.\n4.3.\n\nPerformance Comparisons\nClassification Methods\n\nwith\n\nOther\n\nWe conducted a further experimental comparison\nagainst three other widely used text classification\nmethods: support vector machines (SVM), Naive\nBayes (NB), and k-nearest neighbor (KNN). The\nsupport vector machine used a linear Kernel with a\nregularization parameter of 0.03125, which was often\nused in text categorization. For Naive Bayes, we\nadopted the multi-variate Bernoulli event model that\nis frequently used in text classification [27]. For knearest neighbor (KNN), we set the number k of\nneighbors to 13. In the experiments, we used WEKA's\nimplementation for these three text classification\nmethods [28]. We used a single subspace size of\nfeatures in all eight datasets to run the random forest\nalgorithms. For H RF, C4.5 RF, CART RF, and\nCHAID RF, we used a subspace size of 90 features in\nthe first 6 datasets (i.e., Fbis, Re0, Re1, Tr41, Wap, and\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n8\n\nBaoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye\nFbis\n\nRe0\n\n0.52\n\n0.52\n\n0.48\n\n0.48\n\n0.44\n\nStrength\n\nStrength\n\n0.44\n\n0.40\n\n0.40\n\nH_W_RF\n\n0.36\n\nC4.5_W_RF\nCART_W_RF\n\n0.32\n\nH_W_RF\nC4.5_W_RF\n\n0.36\n\nCART_W_RF\n\nCHAID_W_RF\n\nCHAID_W_RF\n\n0.32\n\nH_RF\n\n0.28\n\nH_RF\n\nC4.5_RF\n\nC4.5_RF\n\n0.28\n\nCART_RF\n\n0.24\n\nCART_RF\n\nCHAID_RF\n\nCHAID_RF\n\n0.24\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nRe1\n\n0.60\n\n40\n\nTr41\n\n0.8\n\n0.55\n\n0.7\n\n0.50\n0.6\n\n0.40\n\nH_W_RF\nC4.5_W_RF\n\n0.35\n\nCART_W_RF\n\nStrength\n\nStrength\n\n0.45\n0.5\nH_W_RF\nC4.5_W_RF\n\n0.4\n\nCART_W_RF\n\nCHAID_W_RF\n\n0.30\n\nCHAID_W_RF\n\nH_RF\n\nH_RF\n\n0.3\n\nC4.5_RF\n\n0.25\n\nC4.5_RF\n\nCART_RF\n\nCART_RF\n\n0.2\n\nCHAID_RF\n\nCHAID_RF\n\n0.20\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nWap\n\nTr31\n\n0.9\n\n0.44\n0.8\n\n0.40\n\n0.36\n\nStrength\n\nH_W_RF\n\n0.28\n\nC4.5_W_RF\nCART_W_RF\n\n0.24\n\nStrength\n\n0.7\n\n0.32\n\n0.6\nH_W_RF\nC4.5_W_RF\n\n0.5\n\nCART_W_RF\nCHAID_W_RF\n\nCHAID_W_RF\nH_RF\n\n0.20\n\nH_RF\n\n0.4\n\nC4.5_RF\n\nC4.5_RF\n\nCART_RF\n\nCART_RF\n\n0.16\n\n0.3\n\nCHAID_RF\n\nCHAID_RF\n\n0.12\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nLa2s\n\n60\n\n70\n\n80\n\n90\n\n100\n\nLa1s\n\n0.60\n\n0.55\n\n0.55\n\n0.50\n\n0.50\n\n0.45\n\n0.45\nH_W_RF\n\n0.40\n\nC4.5_W_RF\nCART_W_RF\nCHAID_W_RF\n\n0.35\n\nStrength\n\nStrength\n\n50\n\nNumber of features\n\nNumber of features\n\n0.60\n\n40\n\nH_W_RF\n\n0.40\n\nC4.5_W_RF\nCART_W_RF\n\n0.35\n\nCHAID_W_RF\n\nH_RF\nC4.5_RF\n\n0.30\n\nH_RF\n\n0.30\n\nC4.5_RF\n\nCART_RF\nCHAID_RF\n\nCART_RF\n\n0.25\n\nCHAID_RF\n\n0.25\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\n10\n\n20\n\nNumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nNumber of features\n\nFIGURE 2. Strength changes against the number of features in the subspace on the 8 high dimensional datasets\n\nTr31) to run the random forest algorithms, and used\na subspace size of 120 features in the last 2 datasets\n(La2s and La1s) to run these random forest algorithms.\nFor H W RF, C4.5 W RF, CART W RF, and\nCHAID W RF, we used Breiman's subspace size of\n\nlog2 (M ) + 1 to run these random forest algorithms.\nThis number of features provided a consistent result as\nshown in Fig. 5. In order to obtain stable results, we\nbuilt 20 random forest models for each random forest\nalgorithm and each dataset and present the average\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\nHybrid weighted random forests for classifying very high-dimensional data\nFbis\n\n9\n\nRe0\n\n0.216\n\n0.285\n\n0.208\n\n0.270\n\nCorrelation\n\nCorrelation\n\n0.255\n\n0.200\n\n0.240\n\n0.192\nH_W_RF\nC4.5_W_RF\n\n0.184\n\nCART_W_RF\nCHAID_W_RF\n\n0.176\n\nH_W_RF\n\n0.225\n\nC4.5_W_RF\nCART_W_RF\n\n0.210\n\nCHAID_W_RF\n\nH_RF\nC4.5_RF\n\n0.168\n\nH_RF\n\n0.195\n\nC4.5_RF\n\nCART_RF\nCHAID_RF\n\nCART_RF\n\n0.180\n\nCHAID_RF\n\n0.160\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nNumber of features\n\n40\n\n50\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nRe1\n\n0.27\n\n60\n\nTr41\n0.18\n\n0.26\n\n0.16\n\n0.25\n\n0.23\nH_W_RF\nC4.5_W_RF\n\n0.22\n\nCART_W_RF\nCHAID_W_RF\n\n0.21\n\nCorrelation\n\nCorrelation\n\n0.24\n\n0.14\n\nH_W_RF\n\n0.12\n\nC4.5_W_RF\nCART_W_RF\n\n0.10\n\nCHAID_W_RF\nH_RF\n\nH_RF\nC4.5_RF\n\n0.20\n\nC4.5_RF\n\n0.08\n\nCART_RF\n\nCART_RF\n\n0.19\n\nCHAID_RF\n\nCHAID_RF\n\n0.06\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nWap\n\nTr31\n\n0.27\n\n0.14\n\n0.26\n0.12\n\nCorrelation\n\n0.24\n\n0.23\n\nH_W_RF\nC4.5_W_RF\n\n0.22\n\nCART_W_RF\nCHAID_W_RF\n\n0.21\n\nCorrelation\n\n0.25\n\n0.10\n\nH_W_RF\nC4.5_W_RF\n\n0.08\n\nCART_W_RF\nCHAID_W_RF\n\n0.06\n\nH_RF\n\nH_RF\n\nC4.5_RF\n\n0.20\n\nC4.5_RF\n\nCART_RF\n\nCART_RF\n\n0.04\n\nCHAID_RF\n\n0.19\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nCHAID_RF\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nLa2s\n\nLa1s\n\n0.162\n\n0.165\n\n0.156\n\n0.160\n\n0.155\n\n0.150\n\nH_W_RF\n\n0.138\n\nC4.5_W_RF\nCART_W_RF\n\n0.132\n\nCHAID_W_RF\n\n0.126\n\nCorrelation\n\nCorrelation\n\n0.150\n\n0.144\n\n0.145\n\nH_W_RF\nC4.5_W_RF\n\n0.140\n\nCART_W_RF\nCHAID_W_RF\n\n0.135\n\nH_RF\n\nH_RF\nC4.5_RF\n\n0.120\n\nC4.5_RF\n\n0.130\n\nCART_RF\n\nCART_RF\nCHAID_RF\n\nCHAID_RF\n\n0.125\n\n0.114\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\n10\n\n20\n\nNumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nNumber of features\n\nFIGURE 3. Correlation changes against the number of features in the subspace on the 8 high dimensional datasets\n\nresults, noting that the range of values are less than\n0.005 and the hybrid trees are always more accurate.\nThe comparison results of classification performance\nof eleven methods are shown in Table 3.\nThe\nperformance is estimated using test accuracy (Acc),\n\nMicro F1 (Mic), and Macro F1 (Mac). Boldface\ndenotes best results between eleven classification\nmethods.\nWhile the improvement is often quite\nsmall, there is always an improvement demonstrated.\nWe observe that our proposed method (H W RF)\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n10\n\nBaoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye\nFbis\n\n3.50\n\nRe0\n4.0\n\n3.15\n\nlog (M)+1\n\n3.5\n\n2\n\n2.80\n\n3.0\n\n2.45\n\nC4.5_W_RF\n\n1.75\n\n2.5\n\n2\n\nH_W_RF\n\nC/S\n\nC/S\n\n2\n\n2.10\n\nCART_W_RF\n\nH_W_RF\nC4.5_W_RF\n\n2.0\n\nCART_W_RF\n\nCHAID_W_RF\n\n1.40\n\nCHAID_W_RF\n\n1.5\n\nH_RF\n\nH_RF\n\nlog (M)+1\n2\n\nC4.5_RF\n\n1.05\n\nC4.5_RF\n\n1.0\n\nCART_RF\n\nCART_RF\n\nCHAID_RF\n\n0.70\n\nCHAID_RF\n\n0.5\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nNumber of features\n\n70\n\n80\n\n90\n\n100\n\n3.5\n\n4.9\n\nlog (M)+1\n\n3.0\n\n4.2\n\n2\n\nC/S\n\nH_W_RF\n\n2\n\n2.5\n\n3.5\n\n2\n\n60\n\nTr41\n\n4.0\n\n5.6\n\nC/S\n\n50\n\nNumber of features\n\nRe1\n\n6.3\n\n40\n\nC4.5_W_RF\n\n2.8\n\n2.0\n\nH_W_RF\nC4.5_W_RF\n\n1.5\n\nCART_W_RF\n\nCART_W_RF\n\nCHAID_W_RF\n\n2.1\n\nCHAID_W_RF\n\n1.0\n\nH_RF\n\nH_RF\n\nC4.5_RF\n\nlog (M)+1\n\n1.4\n\n2\n\nC4.5_RF\n\n0.5\n\nCART_RF\n\nCART_RF\n\nCHAID_RF\n\n0.7\n\nCHAID_RF\n\n0.0\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nWap\n\nTr31\n\n14\n\n1.50\n\n12\n\nlog (M)+1\n\nlog (M)+1\n2\n\n1.25\n\n2\n\n10\n\nC4.5_W_RF\n\nC/S\n\n2\n\nC/S\n\nH_W_RF\n\n6\n\n2\n\n1.00\n\n8\n\nH_W_RF\n\n0.75\n\nC4.5_W_RF\nCART_W_RF\n\nCART_W_RF\n0.50\n\nCHAID_W_RF\n\n4\n\nCHAID_W_RF\nH_RF\n\nH_RF\nC4.5_RF\n\n2\n\nC4.5_RF\n\n0.25\n\nCART_RF\n\nCART_RF\nCHAID_RF\n\nCHAID_RF\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n0.00\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nLa1s\n\nLa2s\n2.2\n\n1.8\n\n2.0\n1.6\n1.8\n1.4\n1.6\n1.2\n\nC4.5_W_RF\nCART_W_RF\n\n0.8\n\n2\n\nH_RF\n\n0.6\n\nC4.5_RF\nCART_W_RF\n\n0.4\n\nCHAID_RF\n\n20\n\n30\n\n40\n\n2\n\nH_W_RF\n\n1.2\n\nC4.5_W_RF\nCART_W_RF\n\n1.0\n\nCHAID_W_RF\n\nlog (M)+1\n\n10\n\nC/S\n\nC/S\n\n2\n\n1.4\nH_W_RF\n\n1.0\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\nCHAID_W_RF\n\nlog (M)+1\n2\n\n0.8\n\nH_RF\nC4.5_RF\n\n0.6\n\nCART_RF\nCHAID_RF\n\n0.4\n\n130\n\n10\n\n20\n\nNumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nNumber of features\n\nFIGURE 4. c/s2 changes against the number of features in the subspace on the 8 high dimensional datasets\n\noutperformed the other classification methods in all\ndatasets.\n\n5.\n\nCONCLUSIONS\n\nIn this paper, we presented a hybrid weighted random\nforest algorithm by simultaneously using a feature\nweighting method and a hybrid forest method to classify\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\nHybrid weighted random forests for classifying very high-dimensional data\nFbis\n\n0.86\n\n11\n\nRe0\n\n0.88\n\n0.84\n0.84\n0.80\n\n0.76\n\n0.80\nH_W_RF\nC4.5_W_RF\n\n0.78\n\nCART_W_RF\n\nAccuracy\n\nAccuracy\n\n0.82\n\nCHAID_W_RF\nH_RF\n\n0.76\n\n0.72\nH_W_RF\n\n0.68\n\nC4.5_W_RF\nCART_W_RF\n\n0.64\n\nCHAID_W_RF\nH_RF\n\n0.60\n\nC4.5_RF\n\nC4.5_RF\n\nlog (M)+1\n\nCART_RF\n\n2\n\n0.74\n\nCART_RF\n\nlog (M)+1\n\n0.56\n\n2\n\nCHAID_RF\n\nCHAID_RF\n\n0.52\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nNumber of features\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nRe1\n\nTr41\n\n1.00\n\n0.86\n0.95\n\n0.84\n\nlog (M)+1\n\n0.82\n\n0.90\n\n2\n\n0.78\nH_W_RF\nC4.5_W_RF\n\n0.76\n\nCART_W_RF\n\n0.74\n\nAccuracy\n\nAccuracy\n\n0.80\n\nCHAID_W_RF\n\n0.85\n\nH_W_RF\n\n0.80\n\nC4.5_W_RF\nCART_W_RF\n\n0.75\n\nCHAID_W_RF\nH_RF\n\nH_RF\n\n0.72\n\nlog (M)+1\n\n0.70\n\nC4.5_RF\n\nC4.5_RF\n\n2\n\nCART_RF\n\nCART_RF\n\n0.70\n\n0.65\n\nCHAID_RF\n\nCHAID_RF\n\n0.68\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nWap\n\n0.84\n\n40\n\nTr31\n\n1.000\n\n0.81\n\n0.975\n\n0.78\n\n0.950\n\n0.75\n\nH_W_RF\nC4.5_W_RF\n\n0.69\n\nCART_W_RF\n\nAccuracy\n\nAccuracy\n\n0.925\n\n0.72\n\n0.900\n\nH_W_RF\nC4.5_W_RF\n\n0.875\n\nCART_W_RF\nCHAID_W_RF\n\nCHAID_W_RF\n\n0.66\n\nH_RF\n\nlog (M)+1\n2\n\n0.850\n\nH_RF\n\nC4.5_RF\n\n0.63\n\nCART_RF\n\n0.60\n\nC4.5_RF\n\nlog (M)+1\n2\n\n0.825\n\nCART_RF\nCHAID_RF\n\nCHAID_RF\n\n0.800\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of features\n\nNumber of features\n\nLa2s\n\n0.900\n\n40\n\nLa1s\n\n0.88\n\n0.885\n0.86\n0.870\n\nAccuracy\n\n0.840\nH_W_RF\nC4.5_W_RF\n\n0.825\n\nCART_W_RF\nCHAID_W_RF\n\n0.810\n\nAccuracy\n\n0.84\n\n0.855\n\n0.82\n\nH_W_RF\nC4.5_W_RF\nCART_W_RF\n\n0.80\n\nCHAID_W_RF\nH_RF\n\nH_RF\n\nlog (M)+1\n\n0.795\n\nC4.5_RF\n\n2\n\nC4.5_RF\n\nlog (M)+1\n\n0.78\n\n2\n\nCART_RF\n\nCART_RF\nCHAID_RF\n\nCHAID_RF\n\n0.780\n\n0.76\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\n10\n\n20\n\nNumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nNumber of features\n\nFIGURE 5. Test Accuracy changes against the number of features in the subspace on the 8 high dimensional datasets\n\nhigh dimensional data. Our algorithm not only retains\na small subspace size (Breiman's formula log2 (M ) + 1\nfor determining the subspace size) to create accurate\nrandom forest models, but also effectively reduces\nthe upper bound of the generalization error and\n\nimproves classification performance. From the results of\nexperiments on various high dimensional datasets, the\nrandom forest generated by our new method is superior\nto other classification methods. We can use the default\nlog2 (M ) + 1 subspace size and generally guarantee\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n12\n\nBaoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye\n\nTABLE 3. The comparison of results\ndatasets\nDataset\nFbis\nMeasures\nAcc\nMic\nSVM\n0.834 0.799\nKNN\n0.78\n0.752\nNB\n0.776 0.74\nH RF\n0.853 0.816\nC4.5 RF\n0.836 0.806\nCART RF\n0.829 0.797\nCHAID RF\n0.842 0.805\nH W RF\n0.856 0.825\nC4.5 W RF\n0.841 0.809\nCART W RF\n0.835 0.805\nCHAID W RF\n0.839 0.815\nDataset\nWap\nMeasures\nAcc\nMic\nSVM\n0.81\n0.772\nKNN\n0.752 0.622\nNB\n0.797 0.742\nH RF\n0.815 0.805\nC4.5 RF\n0.797 0.795\nCART RF\n0.793 0.793\nCHAID RF\n0.805 0.805\nH W RF\n0.815 0.805\nC4.5 W RF\n0.805 0.795\nCART W RF\n0.8\n0.792\nCHAID W RF\n0.811 0.795\n\n(best accuracy, Micro F1, and Macro F1 results) of the eleven methods on the 8\nRe0\nMic\n0.795\n0.752\n0.741\n0.82\n0.802\n0.798\n0.8\n0.825\n0.815\n0.81\n0.812\nTr31\nMac\nAcc\nMic\n0.663 0.955 0.907\n0.622 0.905 0.82\n0.559 0.925 0.832\n0.735 0.965 0.925\n0.732 0.962 0.902\n0.73\n0.958 0.892\n0.732 0.96\n0.9\n0.735 0.965 0.925\n0.732 0.962 0.911\n0.73\n0.96\n0.902\n0.73\n0.96\n0.905\n\nMac\n0.76\n0.722\n0.706\n0.816\n0.806\n0.787\n0.805\n0.82\n0.815\n0.81\n0.812\n\nAcc\n0.804\n0.779\n0.784\n0.845\n0.836\n0.826\n0.832\n0.855\n0.845\n0.839\n0.842\n\nto always produce the best models, on a variety of\nmeasures, by using the hybrid weighted random forest\nalgorithm.\nACKNOWLEDGEMENTS\nThis research is supported in part by NSFC under\nGrant NO.61073195, and Shenzhen New Industry Development Fund under Grant NO.CXB201005250021A\nREFERENCES\n[1] Breiman, L. (2001) Random forests. Machine learning,\n45, 5-32.\n[2] Ho, T. (1998) Random subspace method for constructing decision forests. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 20, 832-844.\n[3] Quinlan, J. (1993) C4.5: Programs for machine\nlearning. Morgan Kaufmann.\n[4] Breiman, L. (1984) Classification and regression trees.\nChapman & Hall/CRC.\n[5] Breiman, L. (1996) Bagging predictors.\nMachine\nlearning, 24, 123-140.\n[6] Ho, T. (1995) Random decision forests. Proceedings\nof the Third International Conference on Document\nAnalysis and Recognition, pp. 278-282. IEEE.\n[7] Dietterich, T. (2000) An experimental comparison of\nthree methods for constructing ensembles of decision\ntrees: Bagging, boosting, and randomization. Machine\nlearning, 40, 139-157.\n\nMac\n0.756\n0.752\n0.619\n0.82\n0.802\n0.798\n0.8\n0.822\n0.812\n0.805\n0.815\nMac\n0.87\n0.762\n0.81\n0.88\n0.87\n0.86\n0.852\n0.88\n0.87\n0.865\n0.855\n\nRe1\nMic\n0.826\n0.668\n0.732\n0.832\n0.811\n0.808\n0.815\n0.836\n0.826\n0.818\n0.83\nla2s\nAcc\nMic\n0.89\n0.832\n0.841 0.805\n0.896 0.815\n0.89\n0.84\n0.878 0.83\n0.882 0.832\n0.88\n0.83\n0.896 0.848\n0.886 0.835\n0.887 0.835\n0.887 0.833\nAcc\n0.829\n0.788\n0.816\n0.841\n0.825\n0.825\n0.838\n0.848\n0.838\n0.835\n0.84\n\nTr41\nMic\n0.915\n0.813\n0.856\n0.926\n0.92\n0.891\n0.903\n0.926\n0.922\n0.91\n0.915\nla1s\nMac\nAcc\nMic\n0.807 0.875 0.82\n0.786 0.827 0.798\n0.79\n0.87\n0.802\n0.82\n0.862 0.825\n0.81\n0.855 0.82\n0.81\n0.84\n0.815\n0.803 0.845 0.816\n0.825 0.875 0.836\n0.816 0.866 0.825\n0.812 0.87\n0.825\n0.81\n0.865 0.825\nMac\n0.706\n0.638\n0.58\n0.8\n0.781\n0.783\n0.795\n0.81\n0.795\n0.79\n0.8\n\nAcc\n0.95\n0.915\n0.935\n0.953\n0.948\n0.917\n0.926\n0.953\n0.95\n0.935\n0.942\n\nMac\n0.87\n0.765\n0.782\n0.895\n0.89\n0.88\n0.88\n0.895\n0.892\n0.88\n0.88\nMac\n0.803\n0.761\n0.775\n0.805\n0.798\n0.792\n0.795\n0.82\n0.81\n0.81\n0.805\n\n[8] Banfield, R., Hall, L., Bowyer, K., and Kegelmeyer, W.\n(2007) A comparison of decision tree ensemble creation\ntechniques. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 29, 173-180.\n\n[9] Robnik-Sikonja,\nM. (2004) Improving random forests.\nProceedings of the 15th European Conference on\nMachine Learning, pp. 359-370. Springer.\n[10] Ho, T. (1998) C4.5 decision forests. Proceedings of\nthe Fourteenth International Conference on Pattern\nRecognition, pp. 545-549. IEEE.\n[11] Dietterrich, T. (1997) Machine learning research: four\ncurrent direction. Artificial Intelligence Magzine, 18,\n97-136.\n[12] Amaratunga, D., Cabrera, J., and Lee, Y. (2008)\nEnriched random forests. Bioinformatics, 24, 2010-\n2014.\n[13] Ye, Y., Li, H., Deng, X., and Huang, J. (2008)\nFeature weighting random forest for detection of hidden\nweb search interfaces. The Journal of Computational\nLinguistics and Chinese Language Processing, 13, 387-\n404.\n[14] Xu, B., Huang, J., Williams, G., Wang, Q., and\nYe, Y. (2012) Classifying very high-dimensional data\nwith random forests built from small subspaces.\nInternational Journal of Data Warehousing and\nMining, 8, 45-62.\n[15] Xu, B., Huang, J., Williams, G., Li, J., and Ye, Y.\n(2012) Hybrid random forests: Advantages of mixed\ntrees in classifying text data. Proceedings of the 16th\nPacific-Asia Conference on Knowledge Discovery and\nData Mining. Springer.\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\nHybrid weighted random forests for classifying very high-dimensional data\n[16] Biggs, D., De Ville, B., and Suen, E. (1991) A method\nof choosing multiway partitions for classification and\ndecision trees. Journal of Applied Statistics, 18, 49-62.\n[17] Ture, M., Kurt, I., Turhan Kurum, A., and Ozdamar,\nK. (2005) Comparing classification techniques for\npredicting essential hypertension. Expert Systems with\nApplications, 29, 583-588.\n[18] Begum, N., M.A., F., and Ren, F. (2009) Automatic text summarization using support vector machine.\nInternational Journal of Innovative Computing, Information and Control, 5, 1987-1996.\n[19] Chen, J., Huang, H., Tian, S., and Qu, Y. (2009)\nFeature selection for text classification with naive\nbayes. Expert Systems with Applications, 36, 5432-\n5435.\n[20] Tan, S. (2005) Neighbor-weighted k-nearest neighbor\nfor unbalanced text corpus.\nExpert Systems with\nApplications, 28, 667-671.\n[21] Pearson, K. (1904) On the Theory of Contingency and\nIts Relation to Association and Normal Correlation.\nCambridge University Press.\n[22] Yang, Y. and Liu, X. (1999) A re-examination of\ntext categorization methods. Proceedings of the 22th\nInternational Conference on Research and Development\nin Information Retrieval, pp. 42-49. ACM.\n[23] Han, E. and Karypis, G. (2000) Centroid-based\ndocument classification: Analysis and experimental\nresults. Proceedings of the 4th European Conference on\nPrinciples of Data Mining and Knowledge Discovery,\npp. 424-431. Springer.\n[24] TREC.\n(2011)\nText\nretrieval\nconference,\nhttp://trec.nist.gov.\n[25] Lewis,\nD.\n(1999)\nReuters-21578\ntext\ncategorization\ntest\ncollection\ndistribution\n1.0,\nhttp://www.research.att.com/ lewis.\n[26] Han, E., Boley, D., Gini, M., Gross, R., Hastings,\nK., Karypis, G., Kumar, V., Mobasher, B., and\nMoore, J. (1998) Webace: A web agent for document\ncategorization and exploration. Proceedings of the 2nd\nInternational Conference on Autonomous Agents, pp.\n408-415. ACM.\n[27] McCallum, A. and Nigam, K. (1998) A comparison of\nevent models for naive bayes text classification. AAAI98 workshop on learning for text categorization, pp. 41-\n48.\n[28] Witten, I., Frank, E., and Hall, M. (2011) Data Mining:\nPractical machine learning tools and techniques.\nMorgan Kaufmann.\n\nThe Computer Journal, Vol. ??,\n\nNo. ??,\n\n????\n\n13\n
```

```
<- function(d, n) {d %>% extract2(n) %>% as.character() %>% writeLines()}
viewDocs viewDocs(docs, 16)
```

```
## Hybrid weighted random forests for
## classifying very high-dimensional data
## Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and
## Yunming Ye1
## 1
##
## Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate
## School, Shenzhen 518055, China
## 2
## Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen
## 518055, China
## Email: amusing002@gmail.com
## Random forests are a popular classification method based on an ensemble of a
## single type of decision trees from subspaces of data. In the literature, there
## are many different types of decision tree algorithms, including C4.5, CART, and
## CHAID. Each type of decision tree algorithm may capture different information
## and structure. This paper proposes a hybrid weighted random forest algorithm,
## simultaneously using a feature weighting method and a hybrid forest method to
## classify very high dimensional data. The hybrid weighted random forest algorithm
## can effectively reduce subspace size and improve classification performance
## without increasing the error bound. We conduct a series of experiments on eight
## high dimensional datasets to compare our method with traditional random forest
## methods and other classification methods. The results show that our method
## consistently outperforms these traditional methods.
## Keywords: Random Forests; Hybrid Weighted Random Forest; Classification; Decision tree;
##
## 1.
##
## INTRODUCTION
##
## Random forests [1, 2] are a popular classification
## method which builds an ensemble of a single type
## of decision trees from different random subspaces of
## data. The decision trees are often either built using
## C4.5 [3] or CART [4], but only one type within
## a single random forest. In recent years, random
## forests have attracted increasing attention due to
## (1) its competitive performance compared with other
## classification methods, especially for high-dimensional
## data, (2) algorithmic intuitiveness and simplicity, and
## (3) its most important capability - "ensemble" using
## bagging [5] and stochastic discrimination [2].
## Several methods have been proposed to grow random
## forests from subspaces of data [1, 2, 6, 7, 8, 9, 10]. In
## these methods, the most popular forest construction
## procedure was proposed by Breiman [1] to first use
## bagging to generate training data subsets for building
## individual trees.
## A subspace of features is then
## randomly selected at each node to grow branches of
## a decision tree. The trees are then combined as an
## ensemble into a forest. As an ensemble learner, the
## performance of a random forest is highly dependent
## on two factors: the performance of each tree and the
## diversity of the trees in the forests [11]. Breiman
## formulated the overall performance of a set of trees as
## the average strength and proved that the generalization
##
## error of a random forest is bounded by the ratio of the
## average correlation between trees divided by the square
## of the average strength of the trees.
## For very high dimensional data, such as text data,
## there are usually a large portion of features that are
## uninformative to the classes. During this forest building
## process, informative features would have the large
## chance to be missed, if we randomly select a small
## subspace (Breiman suggested selecting log2 (M ) + 1
## features in a subspace, where M is the number of
## independent features in the data) from high dimensional
## data [12]. As a result, weak trees are created from these
## subspaces, the average strength of those trees is reduced
## and the error bound of the random forest is enlarged.
## Therefore, when a large proportion of such "weak"
## trees are generated in a random forest, the forest has a
## large likelihood to make a wrong decision which mainly
## results from those "weak" trees' classification power.
## To address this problem, we aim to optimize decision
## trees of a random forest by two strategies. One
## straightforward strategy is to enhance the classification
## performance of individual trees by a feature weighting
## method for subspace sampling [12, 13, 14]. In this
## method, feature weights are computed with respect
## to the correlations of features to the class feature
## and regarded as the probabilities of the feature to
## be selected in subspaces. This method obviously
## increases the classification performance of individual
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 2
##
## Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye
##
## trees because the subspaces will be biased to contain
## more informative features. However, the chance of more
## correlated trees is also increased since the features with
## large weights are likely to be repeatedly selected.
## The second strategy is more straightforward: use
## several different types of decision trees for each training
## data subset, to increase the diversity of the trees,
## and then select the optimal tree as the individual
## tree classifier in the random forest model. The work
## presented here extends the algorithm developed in [15].
## Specifically, we build three different types of tree
## classifiers (C4.5, CART, and CHAID [16, 17]) for each
## training data subset. We then evaluate the performance
## of the three classifiers and select the best tree. In
## this way, we build a hybrid random forest which may
## include different types of decision trees in the ensemble.
## The added diversity of the decision trees can effectively
## improve the accuracy of each tree in the forest, and
## hence the classification performance of the ensemble.
## However, when we use this method to build the best
## random forest model for classifying high dimensional
## data, we can not be sure of what subspace size is best.
## In this paper, we propose a hybrid weighted random
## forest algorithm by simultaneously using a new feature
## weighting method together with the hybrid random
## forest method to classify high dimensional data. In
## this new random forest algorithm, we calculate feature
## weights and use weighted sampling to randomly select
## features for subspaces at each node in building different
## types of trees classifiers (C4.5, CART, and CHAID) for
## each training data subset, and select the best tree as
## the individual tree in the final ensemble model.
## Experiments were performed on 8 high dimensional
## text datasets with dimensions ranging from 2000 to
## 13195. We compared the performance of eight random
## forest methods and well-known classification methods:
## C4.5 random forest, CART random forest, CHAID
## random forest, hybrid random forest, C4.5 weighted
## random forest, CART weighted random forest, CHAID
## weighted random forest, hybrid weighted random
## forest, support vector machines [18], naive Bayes [19],
## and k-nearest neighbors [20].
## The experimental
## results show that our hybrid weighted random forest
## achieves improved classification performance over the
## ten competitive methods.
## The remainder of this paper is organized as follows.
## In Section 2, we introduce a framework for building
## a hybrid weighted random forest, and describe a new
## random forest algorithm. Section 3 summarizes four
## measures to evaluate random forest models. We present
## experimental results on 8 high dimensional text datasets
## in Section 4. Section 5 contains our conclusions.
##
## TABLE 1. Contingency table of input feature A and class
## feature Y
## Y = y1 . . .
## Y = yj . . .
## Y = yq Total
## A = a1
## 11
## ...
## 1j
## ...
## 1q
## 1*
## ..
## ..
## ..
## ..
## ..
## ..
## .
## .
## ...
## .
## .
## .
## .
## A = ai
## i1
## ...
## ij
## ...
## iq
## i*
## ..
## ..
## ..
## ..
## ..
## ..
## ...
## .
## .
## .
## .
## .
## .
## A = ap
## p1
## ...
## pj
## ...
## pq
## p*
## Total
## *1
## ...
## *j
## ...
## *q
##
##
## general framework for building hybrid random forests.
## By integrating these two methods, we propose a novel
## hybrid weighted random forest algorithm.
## 2.1.
##
## Let Y be the class (or target) feature with q distinct
## class labels yj for j = 1, * * * , q. For the purposes of
## our discussion we consider a single categorical feature
## A in dataset D with p distinct category values. We
## denote the distinct values by ai for i = 1, * * * , p.
## Numeric features can be discretized into p intervals with
## a supervised discretization method.
## Assume D has val objects. The size of the subset of
## D satisfying the condition that A = ai and Y = yj is
## denoted by ij . Considering all combinations of the
## categorical values of A and the labels of Y , we can
## obtain a contingency table [21] of A against Y as shown
## in Table 1. The far right column contains the marginal
## totals for feature A:
##
## HYBRID
## FORESTS
##
## WEIGHTED
##
## RANDOM
##
## In this section, we first introduce a feature weighting
## method for subspace sampling. Then we present a
##
## q
##
##
## i. =
##
## ij
##
## for i = 1, * * * , p
##
## (1)
##
## j=1
##
## and the bottom row is the marginal totals for class
## feature Y :
## .j =
##
## p
##
##
## ij
##
## for j = 1, * * * , q
##
## (2)
##
## i=1
##
## The grand total (the total number of samples) is in
## the bottom right corner:
## =
##
## q
## p
##
##
## ij
##
## (3)
##
## j=1 i=1
##
## Given a training dataset D and feature A we first
## compute the contingency table. The feature weights are
## then computed using the two methods to be discussed
## in the following subsection.
## 2.2.
##
## 2.
##
## Notation
##
## Feature Weighting Method
##
## In this subsection, we give the details of the feature
## weighting method for subspace sampling in random
## forests. Consider an M-dimensional feature space
## {A1 , A2 , . . . , AM }. We present how to compute the
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## Hybrid weighted random forests for classifying very high-dimensional data
## weights {w1 , w2 , . . . , wM } for every feature in the space.
## These weights are then used in the improved algorithm
## to grow each decision tree in the random forest.
## 2.2.1. Feature Weight Computation
## The weight of feature A represents the correlation
## between the values of feature A and the values of the
## class feature Y . A larger weight will indicate that the
## class labels of objects in the training dataset are more
## correlated with the values of feature A, indicating that
## A is more informative to the class of objects. Thus it
## is suggested that A has a stronger power in predicting
## the classes of new objects.
## In the following, we propose to use the chi-square
## statistic to compute feature weights because this
## method can quantify the correspondence between two
## categorical variables.
## Given the contingency table of an input feature A and
## the class feature Y of dataset D, the chi-square statistic
## of the two features is computed as:
## corr(A, Y ) =
##
## q
## p
##
## (ij - tij )2
## tij
## i=1 j=1
##
## (4)
##
## where ij is the observed frequency from the
## contingency table and tij is the expected frequency
## computed as
## i* x *j
## tij =
##
##
## (5)
##
## The larger the measure corr(A, Y ), the more
## informative the feature A is in predicting class Y .
## 2.2.2. Normalized Feature Weight
## In practice, feature weights are normalized for feature
## subspace sampling. We use corr(A, Y ) to measure the
## informativeness of these features and consider them
## as feature weights. However, to treat the weights as
## probabilities of features, we normalize the measures to
## ensure the sum of the normalized feature weights is
## equal to 1. Let corr(Ai , Y ) (1 i M ) be the set
## of M feature measures. We compute the normalized
## weights as
##
## corr(Ai , Y )
## wi = N
## (6)
## i=1 corr(Ai , Y )
## Here, we use the square root to smooth the values of
## the measures. wi can be considered as the probability
## that feature Ai is randomly sampled in a subspace. The
## more informative a feature is, the larger the weight and
## the higher the probability of the feature being selected.
##
## Diversity is commonly obtained by using bagging and
## random subspace sampling. We introduce a further
## element of diversity by using different types of trees.
## Considering an analogy with forestry, the different data subsets from bagging represent the "soil structures." Different decision tree algorithms represent "different tree species". Our approach has two key aspects:
## one is to use three types of decision tree algorithms to
## generate three different tree classifiers for each training data subset; the other is to evaluate the accuracy
## of each tree as the measure of tree importance. In this
## paper, we use the out-of-bag accuracy to assess the importance of a tree.
## Following Breiman [1], we use bagging to generate
## a series of training data subsets from which we build
## trees. For each tree, the data subset used to grow
## the tree is called the "in-of-bag" (IOB) data and the
## remaining data subset is called the "out-of-bag" (OOB)
## data. Since OOB data is not used for building trees
## we can use this data to objectively evaluate each tree's
## accuracy and importance. The OOB accuracy gives an
## unbiased estimate of the true accuracy of a model.
## Given n instances in a training dataset D and a tree
## classifier hk (IOBk ) built from the k'th training data
## subset IOBk , we define the OOB accuracy of the tree
## hk (IOBk ), for di D, as:
## n
## OOBAcck =
##
## Framework for Building a Hybrid Random
## Forest
##
## As an ensemble learner, the performance of a random
## forest is highly dependent on two factors: the diversity
## among the trees and the accuracy of each tree [11].
##
## i=1
##
## I(hk (di ) = yi ; di OOBk )
## n
## i=1 I(di OOBk )
##
## (7)
##
## where I(.) is an indicator function. The larger the
## OOBAcck , the better the classification quality of a tree.
## We use the out-of-bag data subset OOBi to calculate
## the out-of-bag accuracies of the three types of trees
## (C4.5, CART and CHAID) with evaluation values E1 ,
## E2 and E3 respectively.
## Fig. 1 illustrates the procedure for building a hybrid
## random forest model. Firstly, a series of IOB/OOB
## datasets are generated from the entire training dataset
## by bagging. Then, three types of tree classifiers (C4.5,
## CART and CHAID) are built using each IOB dataset.
## The corresponding OOB dataset is used to calculate the
## OOB accuracies of the three tree classifiers. Finally,
## we select the tree with the highest OOB accuracy as
## the final tree classifier, which is included in the hybrid
## random forest.
## Building a hybrid random forest model in this
## way will increase the diversity among the trees.
## The classification performance of each individual tree
## classifier is also maximized.
## 2.4.
##
## 2.3.
##
## 3
##
## Decision Tree Algorithms
##
## The core of our approach is the diversity of decision
## tree algorithms in our random forest. Different decision
## tree algorithms grow structurally different trees from
## the same training data. Selecting a good decision tree
## algorithm to grow trees for a random forest is critical
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 4
##
## Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye
## the difference lies in the way to split a node, such
## as the split functions and binary branches or multibranches. In this work we use these different decision
## tree algorithms to build a hybrid random forest.
##
## 2.5.
##
## FIGURE 1. The Hybrid Random Forests framework.
##
## for the performance of the random forest. Few studies
## have considered how different decision tree algorithms
## affect a random forest. We do so in this paper.
## The common decision tree algorithms are as follows:
## Classification Trees 4.5 (C4.5) is a supervised
## learning classification algorithm used to construct
## decision trees. Given a set of pre-classified objects, each
## described by a vector of attribute values, we construct
## a mapping from attribute values to classes. C4.5 uses
## a divide-and-conquer approach to grow decision trees.
## Beginning with the entire dataset, a tree is constructed
## by considering each predictor variable for dividing the
## dataset. The best predictor is chosen at each node
## using a impurity or diversity measure. The goal is
## to produce subsets of the data which are homogeneous
## with respect to the target variable. C4.5 selects the test
## that maximizes the information gain ratio (IGR) [3].
## Classification and Regression Tree (CART) is
## a recursive partitioning method that can be used for
## both regression and classification. The main difference
## between C4.5 and CART is the test selection and
## evaluation process.
## Chi-squared Automatic Interaction Detector
## (CHAID) method is based on the chi-square test of
## association. A CHAID decision tree is constructed
## by repeatedly splitting subsets of the space into two
## or more nodes. To determine the best split at any
## node, any allowable pair of categories of the predictor
## variables is merged until there is no statistically
## significant difference within the pair with respect to the
## target variable [16, 17].
## From these decision tree algorithms, we can see that
##
## Hybrid Weighted Random Forest Algorithm
##
## In this subsection we present a hybrid weighted
## random forest algorithm by simultaneously using the
## feature weights and a hybrid method to classify high
## dimensional data. The benefits of our algorithm has
## two aspects: Firstly, compared with hybrid forest
## method [15], we can use a small subspace size to
## create accurate random forest models.
## Secondly,
## compared with building a random forest using feature
## weighting [14], we can use several different types of
## decision trees for each training data subset to increase
## the diversities of trees. The added diversity of the
## decision trees can effectively improve the classification
## performance of the ensemble model. The detailed steps
## are introduced in Algorithm 1.
## Input parameters to Algorithm 1 include a training
## dataset D, the set of features A, the class feature Y ,
## the number of trees in the random forest K and the
## size of subspaces m. The output is a random forest
## model M . Lines 9-16 form the loop for building K
## decision trees. In the loop, Line 10 samples the training
## data D by sampling with replacement to generate an
## in-of-bag data subset IOBi for building a decision tree.
## Line 11-14 build three types of tree classifiers (C4.5,
## CART, and CHAID). In this procedure, Line 12 calls
## the function createT reej () to build a tree classifier.
## Line 13 calculates the out-of-bag accuracy of the tree
## classifier. After this procedure, Line 15 selects the tree
## classifier with the maximum out-of-bag accuracy. K
## decision tree trees are thus generated to form a hybrid
## weighted random forest model M .
## Generically, function createT reej () first creates a
## new node. Then, it tests the stopping criteria to decide
## whether to return to the upper node or to split this
## node. If we choose to split this node, then the feature
## weighting method is used to randomly select m features
## as the subspace for node splitting. These features
## are used as candidates to generate the best split to
## partition the node. For each subset of the partition,
## createT reej () is called again to create a new node under
## the current node. If a leaf node is created, it returns to
## the parent node. This recursive process continues until
## a full tree is generated.
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## Hybrid weighted random forests for classifying very high-dimensional data
## Algorithm 1 New Random Forest Algorithm
## 1: Input:
## 2: - D : the training dataset,
## 3: - A : the features space {A1 , A2 , ..., AM },
## 4: - Y : the class features space {y1 , y2 , ..., yq },
## 5: - K : the number of trees,
## 6: - m : the size of subspaces.
## 7: Output: A random forest M ;
## 8: Method:
## 9: for i = 1 to K do
## 10:
## draw a bootstrap sample in-of-bag data subset
## IOBi and out-of-bag data subset OOBi from
## training dataset D;
## 11:
## for j = 1 to 3 do
## 12:
## hi,j (IOBi ) = createT reej ();
## use out-of-bag data subset OOBi to calculate
## 13:
## the out-of-bag accuracy OOBAcci, j of the tree
## classifier hi,j (IOBi ) by Equation(1);
## 14:
## end for
## 15:
## select hi (IOBi ) with the highest out-of-bag
## accuracy OOBAcci as Optimal tree i;
## 16: end for
## 17: combine
## the
## K
## tree
## classifiers
## h1 (IOB1 ), h2 (IOB2 ), ..., hK (IOBK ) into a random
## forest M ;
## 18:
## 19:
## 20:
## 21:
## 22:
## 23:
## 24:
## 25:
## 26:
## 27:
## 28:
## 29:
## 30:
## 31:
## 32:
##
## 3.
##
## Function createTree()
## create a new node N ;
## if stopping criteria is met then
## return N as a leaf node;
## else
## for j = 1 to M do
## compute
## the
## informativeness
## measure
## corr(Aj , Y ) by Equation (4);
## end for
## compute feature weights {w1 , w2 , ..., wM } by
## Equation (6);
## use the feature weighting method to randomly
## select m features;
## use these m features as candidates to generate
## the best split for the node to be partitioned;
## call createTree() for each split;
## end if
## return N ;
## EVALUATION MEASURES
##
## In this paper, we use five measures, i.e., strength,
## correlation, error bound c/s2 , test accuracy, and F1
## metric, to evaluate our random forest models. Strength
## measures the collective performance of individual trees
## in a random forest and the correlation measures the
## diversity of the trees. The ratio of the correlation
## over the square of the strength c/s2 indicates the
## generalization error bound of the random forest model.
## These three measures were introduced in [1]. The
## accuracy measures the performance of a random forest
## model on unseen test data. The F1 metric is a
##
## 5
##
## commonly used measure of classification performance.
## 3.1.
##
## Strength and Correlation Measures
##
## We follow Breiman's method described in [1] to
## calculate the strength, correlation and the ratio c/s2 .
## Following Breiman's notation, we denote strength as
## s and correlation as . Let hk (IOBk ) be the kth
## tree classifier grown from the kth training data IOBk
## sampled from D with replacement.
## Assume the
## random forest model contains K trees. The out-of-bag
## proportion of votes for di D on class j is
## K
## I(hk (di ) = j; di
## / IOBk )
## Q(di , j) = k=1K
## (8)
## / IOBk )
## k=1 I(di
## This is the number of trees in the random forest
## which are trained without di and classify di into class
## j, divided by the number of training datasets not
## containing di .
## The strength s is computed as:
## 1
## (Q(di , yi ) - maxj=yi Q(di , j))
## n i=1
## n
##
## s=
##
## (9)
##
## where n is the number of objects in D and yi indicates
## the true class of di .
## The correlation is computed as:
## n
## 1
## 2
## 2
## i=1 (Q(di , yi ) - maxj=yi Q(di , j)) - s
## n
## (10)
## =
##
##
## K
## 1
## (K
## k + (pk - pk )2 )2
## k=1 pk + p
## where
##
## n
## pk =
##
## i=1
##
## I(hk (di ) = yi ; di
## / IOBk )
## n
## / IOBk )
## i=1 I(di
##
## (11)
##
## and
## n
## pk =
##
## i=1
##
## I(hk (di ) = j(di , Y ); di
## / IOBk )
## n
## I(d
##
## /
## IOB
## )
## i
## k
## i=1
##
## (12)
##
## where
## j(di , Y ) = argmaxj=yi Q(d, j)
##
## (13)
##
## is the class that obtains the maximal number of votes
## among all classes but the true class.
## 3.2.
##
## General Error Bound Measure c/s2
##
## Given the strength and correlation, the out-of-bag
## estimate of the c/s2 measure can be computed.
## An important theoretical result in Breiman's method
## is the upper bound of the generalization error of the
## random forest ensemble that is derived as
## P E (1 - s2 )/s2
##
## (14)
##
## where is the mean value of correlations between all
## pairs of individual classifiers and s is the strength of
## the set of individual classifiers that is estimated as the
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 6
##
## Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye
##
## average accuracy of individual classifiers on D with
## out-of-bag evaluation. This inequality shows that the
## generalization error of a random forest is affected by
## the strength of individual classifiers and their mutual
## correlations. Therefore, Breiman defined the c/s2 ratio
## to measure a random forest as
## c/s2 = /s2
##
## (15)
##
## The smaller the ratio, the better the performance of
## the random forest. As such, c/s2 gives guidance for
## reducing the generalization error of random forests.
## 3.3.
##
## Test Accuracy
##
## The test accuracy measures the classification performance of a random forest on the test data set. Let
## Dt be a test data and Yt be the class labels. Given
## di Dt , the number of votes for di on class j is
## N (di , j) =
##
## K
##
##
## I(hk (di ) = j)
##
## (16)
##
## TABLE 2.
## Summary statistic of 8 high-dimensional
## datasets
## Name
## Features
## Instances
## Classes % Minority
## Fbis
## 2000
## 2463
## 17
## 1.54
## Re0
## 2886
## 1504
## 13
## 0.73
## Re1
## 3758
## 1657
## 25
## 0.6
## Tr41
## 7454
## 878
## 10
## 1.03
## Wap
## 8460
## 1560
## 20
## 0.32
## Tr31
## 10,128
## 927
## 7
## 0.22
## La2s
## 12,432
## 3075
## 6
## 8.07
## La1s
## 13,195
## 3204
## 6
## 8.52
##
## It emphasizes the performance of a classifier on rare
## categories. Define and as follows:
##
## i =
##
## T Pi
## T Pi
## , i =
## (T Pi + F Pi )
## (T Pi + F Ni )
##
## (20)
##
## F 1 for each category i and the macro-averaged F1
## are computed as:
##
## k=1
##
## The test accuracy is calculated as
## F 1i =
## 1
## I(N (di , yi ) - maxj=yi N (di , j) > 0) (17)
## n i=1
##
## 2i i
## , M acroF 1 =
## i + i
##
## q
## i=1
##
## q
##
## F 1i
##
## (21)
##
## n
##
## Acc =
##
## where n is the number of objects in Dt and yi indicates
## the true class of di .
## 3.4.
##
## F1 Metric
##
## To evaluate the performance of classification methods
## in dealing with an unbalanced class distribution, we use
## the F1 metric introduced by Yang and Liu [22]. This
## measure is equal to the harmonic mean of recall ()
## and precision (). The overall F1 score of the entire
## classification problem can be computed by a microaverage and a macro-average.
## Micro-averaged F1 is computed globally over all
## classes, and emphasizes the performance of a classifier
## on common classes. Define and as follows:
## q
##
## q
## T Pi
## i=1 T Pi
## = q i=1
## , = q
## (18)
## i=1 (T Pi + F Pi )
## i=1 (T Pi + F Ni )
## where q is the number of classes. T Pi (True Positives)
## is the number of objects correctly predicted as class i,
## F Pi (False Positives) is the number of objects that are
## predicted to belong to class i but do not. The microaveraged F1 is computed as:
## M icroF 1 =
##
## 2
## +
##
## (19)
##
## Macro-averaged F1 is first computed locally over
## each class, and then the average over all classes is taken.
##
## The larger the MicroF1 and MacroF1 values are, the
## higher the classification performance of the classifier.
## 4.
##
## EXPERIMENTS
##
## In this section, we present two experiments that
## demonstrate the effectiveness of the new random
## forest algorithm for classifying high dimensional data.
## High dimensional datasets with various sizes and
## characteristics were used in the experiments. The
## first experiment is designed to show how our proposed
## method can reduce the generalization error bound
## c/s2 , and improve test accuracy when the size of
## the selected subspace is not too large. The second
## experiment is used to demonstrate the classification
## performance of our proposed method in comparison to
## other classification methods, i.e. SVM, NB and KNN.
## 4.1.
##
## Datasets
##
## In the experiments, we used eight real-world high
## dimensional datasets. These datasets were selected
## due to their diversities in the number of features, the
## number of instances, and the number of classes. Their
## dimensionalities vary from 2000 to 13,195. Instances
## vary from 878 to 3204 and the minority class rate varies
## from 0.22% to 8.52%. In each dataset, we randomly
## select 70% of instances as the training dataset, and
## the remaining data as the test dataset. Detailed
## information of the eight datasets is listed in Table 2.
## The Fbis, Re0, Re1, Tr41, Wap, Tr31, La2s
## and La1s datasets are classical text classification
## benchmark datasets which were carefully selected and
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## Hybrid weighted random forests for classifying very high-dimensional data
## preprocessed by Han and Karypis [23]. Dataset Fbis
## was compiled from the Foreign Broadcast Information
## Service TREC-5 [24]. The datasets Re0 and Re1 were
## selected from the Reuters-21578 text categorization test
## collection Distribution 1.0 [25]. The datasets Tr41 and
## Tr31 were derived from TREC-5 [24], TREC-6 [24],
## and TREC-7 [24]. Dataset Wap is from the WebACE
## project (WAP) [26]. The datasets La2s and La1s were
## selected from the Los Angeles Times for TREC-5 [24].
## The classes of these datasets were generated from the
## relevance judgment provided in these collections.
## 4.2.
##
## Performance Comparisons between Random Forest Methods
##
## The purpose of this experiment was to evaluate
## the effect of the hybrid weighted random forest
## method (H W RF) on strength, correlation, c/s2 ,
## and test accuracy.
## The eight high dimensional
## datasets were analyzed and results were compared
## with seven other random forest methods, i.e., C4.5
## random forest (C4.5 RF), CART random forest
## (CART RF), CHAID random forest (CHAID RF),
## hybrid random forest (H RF), C4.5 weighted random
## forest (C4.5 W RF), CART weighted random forest
## (CART W RF), CHAID weighted random forest
## (CHAID W RF). For each dataset, we ran each
## random forest algorithm against different sizes of the
## feature subspaces. Since the number of features in these
## datasets was very large, we started with a subspace
## of 10 features and increased the subspace by 5 more
## features each time. For a given subspace size, we built
## 100 trees for each random forest model. In order to
## obtain a stable result, we built 80 random forest models
## for each subspace size, each dataset and each algorithm,
## and computed the average values of the four measures
## of strength, correlation, c/s2 , and test accuracy as the
## final results for comparison. The performance of the
## eight random forest algorithms on the four measures
## for each of the 8 datasets is shown in Figs. 2, 3, 4, and
## 5.
## Fig. 2 plots the strength for the eight methods against
## different subspace sizes on each of the 8 datasets.
## In the same subspace, the higher the strength, the
## better the result. From the curves, we can see that
## the new algorithm (H W RF) consistently performs
## better than the seven other random forest algorithms.
## The advantages are more obvious for small subspaces.
## The new algorithm quickly achieved higher strength
## as the subspace size increases.
## The seven other
## random forest algorithms require larger subspaces to
## achieve a higher strength. These results indicate that
## the hybrid weighted random forest algorithm enables
## random forest models to achieve a higher strength
## for small subspace sizes compared to the seven other
## random forest algorithms.
## Fig. 3 plots the curves for the correlations for the
## eight random forest methods on the 8 datasets. For
##
## 7
##
## small subspace sizes, H RF, C4.5 RF, CART RF,
## and CHAID RF produce higher correlations between
## the trees on all datasets. The correlation decreases
## as the subspace size increases. For the random forest
## models the lower the correlation between the trees
## then the better the final model.
## With our new
## random forest algorithm (H W RF) a low correlation
## level was achieved with very small subspaces in all
## 8 datasets. We also note that as the subspace size
## increased the correlation level increased as well. This is
## understandable because as the subspace size increases,
## the same informative features are more likely to be
## selected repeatedly in the subspaces, increasing the
## similarity of the decision trees. Therefore, the feature
## weighting method for subspace selection works well for
## small subspaces, at least from the point of view of the
## correlation measure.
## Fig. 4 shows the error bound indicator c/s2 for the
## eight methods on the 8 datasets. From these figures
## we can observe that as the subspace size increases, c/s2
## consistently reduces. The behaviour indicates that a
## subspace size larger than log2 (M )+1 benefits all eight
## algorithms. However, the new algorithm (H W RF)
## achieved a lower level of c/s2 at subspace size of
## log2 (M ) + 1 than the seven other algorithms.
## Fig. 5 plots the curves showing the accuracy of the
## eight random forest models on the test datasets from
## the 8 datasets. We can clearly see that the new random
## forest algorithm (H W RF) outperforms the seven
## other random forest algorithms in all eight data sets.
## It can be seen that the new method is more stable
## in classification performance than other methods. In
## all of these figures, it is observed that the highest test
## accuracy is often obtained with the default subspace size
## of log2 (M ) + 1. This implies that in practice, large
## size subspaces are not necessary to grow high-quality
## trees for random forests.
## 4.3.
##
## Performance Comparisons
## Classification Methods
##
## with
##
## Other
##
## We conducted a further experimental comparison
## against three other widely used text classification
## methods: support vector machines (SVM), Naive
## Bayes (NB), and k-nearest neighbor (KNN). The
## support vector machine used a linear Kernel with a
## regularization parameter of 0.03125, which was often
## used in text categorization. For Naive Bayes, we
## adopted the multi-variate Bernoulli event model that
## is frequently used in text classification [27]. For knearest neighbor (KNN), we set the number k of
## neighbors to 13. In the experiments, we used WEKA's
## implementation for these three text classification
## methods [28]. We used a single subspace size of
## features in all eight datasets to run the random forest
## algorithms. For H RF, C4.5 RF, CART RF, and
## CHAID RF, we used a subspace size of 90 features in
## the first 6 datasets (i.e., Fbis, Re0, Re1, Tr41, Wap, and
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 8
##
## Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye
## Fbis
##
## Re0
##
## 0.52
##
## 0.52
##
## 0.48
##
## 0.48
##
## 0.44
##
## Strength
##
## Strength
##
## 0.44
##
## 0.40
##
## 0.40
##
## H_W_RF
##
## 0.36
##
## C4.5_W_RF
## CART_W_RF
##
## 0.32
##
## H_W_RF
## C4.5_W_RF
##
## 0.36
##
## CART_W_RF
##
## CHAID_W_RF
##
## CHAID_W_RF
##
## 0.32
##
## H_RF
##
## 0.28
##
## H_RF
##
## C4.5_RF
##
## C4.5_RF
##
## 0.28
##
## CART_RF
##
## 0.24
##
## CART_RF
##
## CHAID_RF
##
## CHAID_RF
##
## 0.24
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## Re1
##
## 0.60
##
## 40
##
## Tr41
##
## 0.8
##
## 0.55
##
## 0.7
##
## 0.50
## 0.6
##
## 0.40
##
## H_W_RF
## C4.5_W_RF
##
## 0.35
##
## CART_W_RF
##
## Strength
##
## Strength
##
## 0.45
## 0.5
## H_W_RF
## C4.5_W_RF
##
## 0.4
##
## CART_W_RF
##
## CHAID_W_RF
##
## 0.30
##
## CHAID_W_RF
##
## H_RF
##
## H_RF
##
## 0.3
##
## C4.5_RF
##
## 0.25
##
## C4.5_RF
##
## CART_RF
##
## CART_RF
##
## 0.2
##
## CHAID_RF
##
## CHAID_RF
##
## 0.20
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## Wap
##
## Tr31
##
## 0.9
##
## 0.44
## 0.8
##
## 0.40
##
## 0.36
##
## Strength
##
## H_W_RF
##
## 0.28
##
## C4.5_W_RF
## CART_W_RF
##
## 0.24
##
## Strength
##
## 0.7
##
## 0.32
##
## 0.6
## H_W_RF
## C4.5_W_RF
##
## 0.5
##
## CART_W_RF
## CHAID_W_RF
##
## CHAID_W_RF
## H_RF
##
## 0.20
##
## H_RF
##
## 0.4
##
## C4.5_RF
##
## C4.5_RF
##
## CART_RF
##
## CART_RF
##
## 0.16
##
## 0.3
##
## CHAID_RF
##
## CHAID_RF
##
## 0.12
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## La2s
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## La1s
##
## 0.60
##
## 0.55
##
## 0.55
##
## 0.50
##
## 0.50
##
## 0.45
##
## 0.45
## H_W_RF
##
## 0.40
##
## C4.5_W_RF
## CART_W_RF
## CHAID_W_RF
##
## 0.35
##
## Strength
##
## Strength
##
## 50
##
## Number of features
##
## Number of features
##
## 0.60
##
## 40
##
## H_W_RF
##
## 0.40
##
## C4.5_W_RF
## CART_W_RF
##
## 0.35
##
## CHAID_W_RF
##
## H_RF
## C4.5_RF
##
## 0.30
##
## H_RF
##
## 0.30
##
## C4.5_RF
##
## CART_RF
## CHAID_RF
##
## CART_RF
##
## 0.25
##
## CHAID_RF
##
## 0.25
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## 10
##
## 20
##
## Number of features
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## Number of features
##
## FIGURE 2. Strength changes against the number of features in the subspace on the 8 high dimensional datasets
##
## Tr31) to run the random forest algorithms, and used
## a subspace size of 120 features in the last 2 datasets
## (La2s and La1s) to run these random forest algorithms.
## For H W RF, C4.5 W RF, CART W RF, and
## CHAID W RF, we used Breiman's subspace size of
##
## log2 (M ) + 1 to run these random forest algorithms.
## This number of features provided a consistent result as
## shown in Fig. 5. In order to obtain stable results, we
## built 20 random forest models for each random forest
## algorithm and each dataset and present the average
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## Hybrid weighted random forests for classifying very high-dimensional data
## Fbis
##
## 9
##
## Re0
##
## 0.216
##
## 0.285
##
## 0.208
##
## 0.270
##
## Correlation
##
## Correlation
##
## 0.255
##
## 0.200
##
## 0.240
##
## 0.192
## H_W_RF
## C4.5_W_RF
##
## 0.184
##
## CART_W_RF
## CHAID_W_RF
##
## 0.176
##
## H_W_RF
##
## 0.225
##
## C4.5_W_RF
## CART_W_RF
##
## 0.210
##
## CHAID_W_RF
##
## H_RF
## C4.5_RF
##
## 0.168
##
## H_RF
##
## 0.195
##
## C4.5_RF
##
## CART_RF
## CHAID_RF
##
## CART_RF
##
## 0.180
##
## CHAID_RF
##
## 0.160
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## Number of features
##
## 40
##
## 50
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Re1
##
## 0.27
##
## 60
##
## Tr41
## 0.18
##
## 0.26
##
## 0.16
##
## 0.25
##
## 0.23
## H_W_RF
## C4.5_W_RF
##
## 0.22
##
## CART_W_RF
## CHAID_W_RF
##
## 0.21
##
## Correlation
##
## Correlation
##
## 0.24
##
## 0.14
##
## H_W_RF
##
## 0.12
##
## C4.5_W_RF
## CART_W_RF
##
## 0.10
##
## CHAID_W_RF
## H_RF
##
## H_RF
## C4.5_RF
##
## 0.20
##
## C4.5_RF
##
## 0.08
##
## CART_RF
##
## CART_RF
##
## 0.19
##
## CHAID_RF
##
## CHAID_RF
##
## 0.06
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## Wap
##
## Tr31
##
## 0.27
##
## 0.14
##
## 0.26
## 0.12
##
## Correlation
##
## 0.24
##
## 0.23
##
## H_W_RF
## C4.5_W_RF
##
## 0.22
##
## CART_W_RF
## CHAID_W_RF
##
## 0.21
##
## Correlation
##
## 0.25
##
## 0.10
##
## H_W_RF
## C4.5_W_RF
##
## 0.08
##
## CART_W_RF
## CHAID_W_RF
##
## 0.06
##
## H_RF
##
## H_RF
##
## C4.5_RF
##
## 0.20
##
## C4.5_RF
##
## CART_RF
##
## CART_RF
##
## 0.04
##
## CHAID_RF
##
## 0.19
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## CHAID_RF
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## La2s
##
## La1s
##
## 0.162
##
## 0.165
##
## 0.156
##
## 0.160
##
## 0.155
##
## 0.150
##
## H_W_RF
##
## 0.138
##
## C4.5_W_RF
## CART_W_RF
##
## 0.132
##
## CHAID_W_RF
##
## 0.126
##
## Correlation
##
## Correlation
##
## 0.150
##
## 0.144
##
## 0.145
##
## H_W_RF
## C4.5_W_RF
##
## 0.140
##
## CART_W_RF
## CHAID_W_RF
##
## 0.135
##
## H_RF
##
## H_RF
## C4.5_RF
##
## 0.120
##
## C4.5_RF
##
## 0.130
##
## CART_RF
##
## CART_RF
## CHAID_RF
##
## CHAID_RF
##
## 0.125
##
## 0.114
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## 10
##
## 20
##
## Number of features
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## Number of features
##
## FIGURE 3. Correlation changes against the number of features in the subspace on the 8 high dimensional datasets
##
## results, noting that the range of values are less than
## 0.005 and the hybrid trees are always more accurate.
## The comparison results of classification performance
## of eleven methods are shown in Table 3.
## The
## performance is estimated using test accuracy (Acc),
##
## Micro F1 (Mic), and Macro F1 (Mac). Boldface
## denotes best results between eleven classification
## methods.
## While the improvement is often quite
## small, there is always an improvement demonstrated.
## We observe that our proposed method (H W RF)
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 10
##
## Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye
## Fbis
##
## 3.50
##
## Re0
## 4.0
##
## 3.15
##
## log (M)+1
##
## 3.5
##
## 2
##
## 2.80
##
## 3.0
##
## 2.45
##
## C4.5_W_RF
##
## 1.75
##
## 2.5
##
## 2
##
## H_W_RF
##
## C/S
##
## C/S
##
## 2
##
## 2.10
##
## CART_W_RF
##
## H_W_RF
## C4.5_W_RF
##
## 2.0
##
## CART_W_RF
##
## CHAID_W_RF
##
## 1.40
##
## CHAID_W_RF
##
## 1.5
##
## H_RF
##
## H_RF
##
## log (M)+1
## 2
##
## C4.5_RF
##
## 1.05
##
## C4.5_RF
##
## 1.0
##
## CART_RF
##
## CART_RF
##
## CHAID_RF
##
## 0.70
##
## CHAID_RF
##
## 0.5
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## Number of features
##
## 70
##
## 80
##
## 90
##
## 100
##
## 3.5
##
## 4.9
##
## log (M)+1
##
## 3.0
##
## 4.2
##
## 2
##
## C/S
##
## H_W_RF
##
## 2
##
## 2.5
##
## 3.5
##
## 2
##
## 60
##
## Tr41
##
## 4.0
##
## 5.6
##
## C/S
##
## 50
##
## Number of features
##
## Re1
##
## 6.3
##
## 40
##
## C4.5_W_RF
##
## 2.8
##
## 2.0
##
## H_W_RF
## C4.5_W_RF
##
## 1.5
##
## CART_W_RF
##
## CART_W_RF
##
## CHAID_W_RF
##
## 2.1
##
## CHAID_W_RF
##
## 1.0
##
## H_RF
##
## H_RF
##
## C4.5_RF
##
## log (M)+1
##
## 1.4
##
## 2
##
## C4.5_RF
##
## 0.5
##
## CART_RF
##
## CART_RF
##
## CHAID_RF
##
## 0.7
##
## CHAID_RF
##
## 0.0
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## Wap
##
## Tr31
##
## 14
##
## 1.50
##
## 12
##
## log (M)+1
##
## log (M)+1
## 2
##
## 1.25
##
## 2
##
## 10
##
## C4.5_W_RF
##
## C/S
##
## 2
##
## C/S
##
## H_W_RF
##
## 6
##
## 2
##
## 1.00
##
## 8
##
## H_W_RF
##
## 0.75
##
## C4.5_W_RF
## CART_W_RF
##
## CART_W_RF
## 0.50
##
## CHAID_W_RF
##
## 4
##
## CHAID_W_RF
## H_RF
##
## H_RF
## C4.5_RF
##
## 2
##
## C4.5_RF
##
## 0.25
##
## CART_RF
##
## CART_RF
## CHAID_RF
##
## CHAID_RF
##
## 0
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 0.00
##
## 100
##
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## La1s
##
## La2s
## 2.2
##
## 1.8
##
## 2.0
## 1.6
## 1.8
## 1.4
## 1.6
## 1.2
##
## C4.5_W_RF
## CART_W_RF
##
## 0.8
##
## 2
##
## H_RF
##
## 0.6
##
## C4.5_RF
## CART_W_RF
##
## 0.4
##
## CHAID_RF
##
## 20
##
## 30
##
## 40
##
## 2
##
## H_W_RF
##
## 1.2
##
## C4.5_W_RF
## CART_W_RF
##
## 1.0
##
## CHAID_W_RF
##
## log (M)+1
##
## 10
##
## C/S
##
## C/S
##
## 2
##
## 1.4
## H_W_RF
##
## 1.0
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## CHAID_W_RF
##
## log (M)+1
## 2
##
## 0.8
##
## H_RF
## C4.5_RF
##
## 0.6
##
## CART_RF
## CHAID_RF
##
## 0.4
##
## 130
##
## 10
##
## 20
##
## Number of features
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## Number of features
##
## FIGURE 4. c/s2 changes against the number of features in the subspace on the 8 high dimensional datasets
##
## outperformed the other classification methods in all
## datasets.
##
## 5.
##
## CONCLUSIONS
##
## In this paper, we presented a hybrid weighted random
## forest algorithm by simultaneously using a feature
## weighting method and a hybrid forest method to classify
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## Hybrid weighted random forests for classifying very high-dimensional data
## Fbis
##
## 0.86
##
## 11
##
## Re0
##
## 0.88
##
## 0.84
## 0.84
## 0.80
##
## 0.76
##
## 0.80
## H_W_RF
## C4.5_W_RF
##
## 0.78
##
## CART_W_RF
##
## Accuracy
##
## Accuracy
##
## 0.82
##
## CHAID_W_RF
## H_RF
##
## 0.76
##
## 0.72
## H_W_RF
##
## 0.68
##
## C4.5_W_RF
## CART_W_RF
##
## 0.64
##
## CHAID_W_RF
## H_RF
##
## 0.60
##
## C4.5_RF
##
## C4.5_RF
##
## log (M)+1
##
## CART_RF
##
## 2
##
## 0.74
##
## CART_RF
##
## log (M)+1
##
## 0.56
##
## 2
##
## CHAID_RF
##
## CHAID_RF
##
## 0.52
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## Number of features
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Re1
##
## Tr41
##
## 1.00
##
## 0.86
## 0.95
##
## 0.84
##
## log (M)+1
##
## 0.82
##
## 0.90
##
## 2
##
## 0.78
## H_W_RF
## C4.5_W_RF
##
## 0.76
##
## CART_W_RF
##
## 0.74
##
## Accuracy
##
## Accuracy
##
## 0.80
##
## CHAID_W_RF
##
## 0.85
##
## H_W_RF
##
## 0.80
##
## C4.5_W_RF
## CART_W_RF
##
## 0.75
##
## CHAID_W_RF
## H_RF
##
## H_RF
##
## 0.72
##
## log (M)+1
##
## 0.70
##
## C4.5_RF
##
## C4.5_RF
##
## 2
##
## CART_RF
##
## CART_RF
##
## 0.70
##
## 0.65
##
## CHAID_RF
##
## CHAID_RF
##
## 0.68
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## Wap
##
## 0.84
##
## 40
##
## Tr31
##
## 1.000
##
## 0.81
##
## 0.975
##
## 0.78
##
## 0.950
##
## 0.75
##
## H_W_RF
## C4.5_W_RF
##
## 0.69
##
## CART_W_RF
##
## Accuracy
##
## Accuracy
##
## 0.925
##
## 0.72
##
## 0.900
##
## H_W_RF
## C4.5_W_RF
##
## 0.875
##
## CART_W_RF
## CHAID_W_RF
##
## CHAID_W_RF
##
## 0.66
##
## H_RF
##
## log (M)+1
## 2
##
## 0.850
##
## H_RF
##
## C4.5_RF
##
## 0.63
##
## CART_RF
##
## 0.60
##
## C4.5_RF
##
## log (M)+1
## 2
##
## 0.825
##
## CART_RF
## CHAID_RF
##
## CHAID_RF
##
## 0.800
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 10
##
## 20
##
## 30
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## Number of features
##
## Number of features
##
## La2s
##
## 0.900
##
## 40
##
## La1s
##
## 0.88
##
## 0.885
## 0.86
## 0.870
##
## Accuracy
##
## 0.840
## H_W_RF
## C4.5_W_RF
##
## 0.825
##
## CART_W_RF
## CHAID_W_RF
##
## 0.810
##
## Accuracy
##
## 0.84
##
## 0.855
##
## 0.82
##
## H_W_RF
## C4.5_W_RF
## CART_W_RF
##
## 0.80
##
## CHAID_W_RF
## H_RF
##
## H_RF
##
## log (M)+1
##
## 0.795
##
## C4.5_RF
##
## 2
##
## C4.5_RF
##
## log (M)+1
##
## 0.78
##
## 2
##
## CART_RF
##
## CART_RF
## CHAID_RF
##
## CHAID_RF
##
## 0.780
##
## 0.76
## 10
##
## 20
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## 10
##
## 20
##
## Number of features
##
## 30
##
## 40
##
## 50
##
## 60
##
## 70
##
## 80
##
## 90
##
## 100
##
## 110
##
## 120
##
## 130
##
## Number of features
##
## FIGURE 5. Test Accuracy changes against the number of features in the subspace on the 8 high dimensional datasets
##
## high dimensional data. Our algorithm not only retains
## a small subspace size (Breiman's formula log2 (M ) + 1
## for determining the subspace size) to create accurate
## random forest models, but also effectively reduces
## the upper bound of the generalization error and
##
## improves classification performance. From the results of
## experiments on various high dimensional datasets, the
## random forest generated by our new method is superior
## to other classification methods. We can use the default
## log2 (M ) + 1 subspace size and generally guarantee
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 12
##
## Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Yunming Ye
##
## TABLE 3. The comparison of results
## datasets
## Dataset
## Fbis
## Measures
## Acc
## Mic
## SVM
## 0.834 0.799
## KNN
## 0.78
## 0.752
## NB
## 0.776 0.74
## H RF
## 0.853 0.816
## C4.5 RF
## 0.836 0.806
## CART RF
## 0.829 0.797
## CHAID RF
## 0.842 0.805
## H W RF
## 0.856 0.825
## C4.5 W RF
## 0.841 0.809
## CART W RF
## 0.835 0.805
## CHAID W RF
## 0.839 0.815
## Dataset
## Wap
## Measures
## Acc
## Mic
## SVM
## 0.81
## 0.772
## KNN
## 0.752 0.622
## NB
## 0.797 0.742
## H RF
## 0.815 0.805
## C4.5 RF
## 0.797 0.795
## CART RF
## 0.793 0.793
## CHAID RF
## 0.805 0.805
## H W RF
## 0.815 0.805
## C4.5 W RF
## 0.805 0.795
## CART W RF
## 0.8
## 0.792
## CHAID W RF
## 0.811 0.795
##
## (best accuracy, Micro F1, and Macro F1 results) of the eleven methods on the 8
## Re0
## Mic
## 0.795
## 0.752
## 0.741
## 0.82
## 0.802
## 0.798
## 0.8
## 0.825
## 0.815
## 0.81
## 0.812
## Tr31
## Mac
## Acc
## Mic
## 0.663 0.955 0.907
## 0.622 0.905 0.82
## 0.559 0.925 0.832
## 0.735 0.965 0.925
## 0.732 0.962 0.902
## 0.73
## 0.958 0.892
## 0.732 0.96
## 0.9
## 0.735 0.965 0.925
## 0.732 0.962 0.911
## 0.73
## 0.96
## 0.902
## 0.73
## 0.96
## 0.905
##
## Mac
## 0.76
## 0.722
## 0.706
## 0.816
## 0.806
## 0.787
## 0.805
## 0.82
## 0.815
## 0.81
## 0.812
##
## Acc
## 0.804
## 0.779
## 0.784
## 0.845
## 0.836
## 0.826
## 0.832
## 0.855
## 0.845
## 0.839
## 0.842
##
## to always produce the best models, on a variety of
## measures, by using the hybrid weighted random forest
## algorithm.
## ACKNOWLEDGEMENTS
## This research is supported in part by NSFC under
## Grant NO.61073195, and Shenzhen New Industry Development Fund under Grant NO.CXB201005250021A
## REFERENCES
## [1] Breiman, L. (2001) Random forests. Machine learning,
## 45, 5-32.
## [2] Ho, T. (1998) Random subspace method for constructing decision forests. IEEE Transactions on Pattern
## Analysis and Machine Intelligence, 20, 832-844.
## [3] Quinlan, J. (1993) C4.5: Programs for machine
## learning. Morgan Kaufmann.
## [4] Breiman, L. (1984) Classification and regression trees.
## Chapman & Hall/CRC.
## [5] Breiman, L. (1996) Bagging predictors.
## Machine
## learning, 24, 123-140.
## [6] Ho, T. (1995) Random decision forests. Proceedings
## of the Third International Conference on Document
## Analysis and Recognition, pp. 278-282. IEEE.
## [7] Dietterich, T. (2000) An experimental comparison of
## three methods for constructing ensembles of decision
## trees: Bagging, boosting, and randomization. Machine
## learning, 40, 139-157.
##
## Mac
## 0.756
## 0.752
## 0.619
## 0.82
## 0.802
## 0.798
## 0.8
## 0.822
## 0.812
## 0.805
## 0.815
## Mac
## 0.87
## 0.762
## 0.81
## 0.88
## 0.87
## 0.86
## 0.852
## 0.88
## 0.87
## 0.865
## 0.855
##
## Re1
## Mic
## 0.826
## 0.668
## 0.732
## 0.832
## 0.811
## 0.808
## 0.815
## 0.836
## 0.826
## 0.818
## 0.83
## la2s
## Acc
## Mic
## 0.89
## 0.832
## 0.841 0.805
## 0.896 0.815
## 0.89
## 0.84
## 0.878 0.83
## 0.882 0.832
## 0.88
## 0.83
## 0.896 0.848
## 0.886 0.835
## 0.887 0.835
## 0.887 0.833
## Acc
## 0.829
## 0.788
## 0.816
## 0.841
## 0.825
## 0.825
## 0.838
## 0.848
## 0.838
## 0.835
## 0.84
##
## Tr41
## Mic
## 0.915
## 0.813
## 0.856
## 0.926
## 0.92
## 0.891
## 0.903
## 0.926
## 0.922
## 0.91
## 0.915
## la1s
## Mac
## Acc
## Mic
## 0.807 0.875 0.82
## 0.786 0.827 0.798
## 0.79
## 0.87
## 0.802
## 0.82
## 0.862 0.825
## 0.81
## 0.855 0.82
## 0.81
## 0.84
## 0.815
## 0.803 0.845 0.816
## 0.825 0.875 0.836
## 0.816 0.866 0.825
## 0.812 0.87
## 0.825
## 0.81
## 0.865 0.825
## Mac
## 0.706
## 0.638
## 0.58
## 0.8
## 0.781
## 0.783
## 0.795
## 0.81
## 0.795
## 0.79
## 0.8
##
## Acc
## 0.95
## 0.915
## 0.935
## 0.953
## 0.948
## 0.917
## 0.926
## 0.953
## 0.95
## 0.935
## 0.942
##
## Mac
## 0.87
## 0.765
## 0.782
## 0.895
## 0.89
## 0.88
## 0.88
## 0.895
## 0.892
## 0.88
## 0.88
## Mac
## 0.803
## 0.761
## 0.775
## 0.805
## 0.798
## 0.792
## 0.795
## 0.82
## 0.81
## 0.81
## 0.805
##
## [8] Banfield, R., Hall, L., Bowyer, K., and Kegelmeyer, W.
## (2007) A comparison of decision tree ensemble creation
## techniques. IEEE Transactions on Pattern Analysis
## and Machine Intelligence, 29, 173-180.
##
## [9] Robnik-Sikonja,
## M. (2004) Improving random forests.
## Proceedings of the 15th European Conference on
## Machine Learning, pp. 359-370. Springer.
## [10] Ho, T. (1998) C4.5 decision forests. Proceedings of
## the Fourteenth International Conference on Pattern
## Recognition, pp. 545-549. IEEE.
## [11] Dietterrich, T. (1997) Machine learning research: four
## current direction. Artificial Intelligence Magzine, 18,
## 97-136.
## [12] Amaratunga, D., Cabrera, J., and Lee, Y. (2008)
## Enriched random forests. Bioinformatics, 24, 2010-
## 2014.
## [13] Ye, Y., Li, H., Deng, X., and Huang, J. (2008)
## Feature weighting random forest for detection of hidden
## web search interfaces. The Journal of Computational
## Linguistics and Chinese Language Processing, 13, 387-
## 404.
## [14] Xu, B., Huang, J., Williams, G., Wang, Q., and
## Ye, Y. (2012) Classifying very high-dimensional data
## with random forests built from small subspaces.
## International Journal of Data Warehousing and
## Mining, 8, 45-62.
## [15] Xu, B., Huang, J., Williams, G., Li, J., and Ye, Y.
## (2012) Hybrid random forests: Advantages of mixed
## trees in classifying text data. Proceedings of the 16th
## Pacific-Asia Conference on Knowledge Discovery and
## Data Mining. Springer.
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## Hybrid weighted random forests for classifying very high-dimensional data
## [16] Biggs, D., De Ville, B., and Suen, E. (1991) A method
## of choosing multiway partitions for classification and
## decision trees. Journal of Applied Statistics, 18, 49-62.
## [17] Ture, M., Kurt, I., Turhan Kurum, A., and Ozdamar,
## K. (2005) Comparing classification techniques for
## predicting essential hypertension. Expert Systems with
## Applications, 29, 583-588.
## [18] Begum, N., M.A., F., and Ren, F. (2009) Automatic text summarization using support vector machine.
## International Journal of Innovative Computing, Information and Control, 5, 1987-1996.
## [19] Chen, J., Huang, H., Tian, S., and Qu, Y. (2009)
## Feature selection for text classification with naive
## bayes. Expert Systems with Applications, 36, 5432-
## 5435.
## [20] Tan, S. (2005) Neighbor-weighted k-nearest neighbor
## for unbalanced text corpus.
## Expert Systems with
## Applications, 28, 667-671.
## [21] Pearson, K. (1904) On the Theory of Contingency and
## Its Relation to Association and Normal Correlation.
## Cambridge University Press.
## [22] Yang, Y. and Liu, X. (1999) A re-examination of
## text categorization methods. Proceedings of the 22th
## International Conference on Research and Development
## in Information Retrieval, pp. 42-49. ACM.
## [23] Han, E. and Karypis, G. (2000) Centroid-based
## document classification: Analysis and experimental
## results. Proceedings of the 4th European Conference on
## Principles of Data Mining and Knowledge Discovery,
## pp. 424-431. Springer.
## [24] TREC.
## (2011)
## Text
## retrieval
## conference,
## http://trec.nist.gov.
## [25] Lewis,
## D.
## (1999)
## Reuters-21578
## text
## categorization
## test
## collection
## distribution
## 1.0,
## http://www.research.att.com/ lewis.
## [26] Han, E., Boley, D., Gini, M., Gross, R., Hastings,
## K., Karypis, G., Kumar, V., Mobasher, B., and
## Moore, J. (1998) Webace: A web agent for document
## categorization and exploration. Proceedings of the 2nd
## International Conference on Autonomous Agents, pp.
## 408-415. ACM.
## [27] McCallum, A. and Nigam, K. (1998) A comparison of
## event models for naive bayes text classification. AAAI98 workshop on learning for text categorization, pp. 41-
## 48.
## [28] Witten, I., Frank, E., and Hall, M. (2011) Data Mining:
## Practical machine learning tools and techniques.
## Morgan Kaufmann.
##
## The Computer Journal, Vol. ??,
##
## No. ??,
##
## ????
##
## 13
```

Your

**donation**will support ongoing availability and give you access to the

**PDF version of this book**. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0