{"id":43,"date":"2013-09-07T18:19:10","date_gmt":"2013-09-08T00:19:10","guid":{"rendered":"http:\/\/inductivebias.com\/Blog\/?p=43"},"modified":"2015-10-10T23:59:58","modified_gmt":"2015-10-11T05:59:58","slug":"naive-bayes","status":"publish","type":"post","link":"http:\/\/inductivebias.com\/Blog\/naive-bayes\/","title":{"rendered":"Naive Bayes"},"content":{"rendered":"<div style=\"word-wrap: normal; -webkit-hyphens: none; -moz-hyphens: none; hyphens: none;\"><a href=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/08\/NormalDistriubtions.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-thumbnail wp-image-44 alignleft\" src=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/08\/NormalDistriubtions-150x150.jpg\" alt=\"Normal Distriubtions\" width=\"150\" height=\"150\" srcset=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/08\/NormalDistriubtions-150x150.jpg 150w, http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/08\/NormalDistriubtions-300x300.jpg 300w, http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/08\/NormalDistriubtions.jpg 330w\" sizes=\"auto, (max-width: 150px) 100vw, 150px\" \/><\/a><span style=\"font-size: small;\"><strong>Assume a distribution and conditional independence, calculate means and standard deviations, and use it to make predictions. It&#8217;s about the simplest thing that qualifies as machine learning. I&#8217;m not really a fan, but we&#8217;ve got to start somewhere, right?<\/strong><\/span><!--more--><\/div>\n<div style=\"word-wrap: normal; -webkit-hyphens: none; -moz-hyphens: none; hyphens: none;\">Today we&#8217;re talking about the Naive Bayes Classifier. Not to be confused with the <em>everything else<\/em> that has Bayes&#8217; name on it.\u00a0 There&#8217;s a whole branch of Bayesian statistics. It&#8217;s all very interesting, but today we&#8217;re just talking about the parts it takes to build the most basic of classifiers, why it&#8217;s done the way it is and when it&#8217;s likely to work.<\/p>\n<h2><span style=\"color: #000080;\">Building the Classifier<\/span><\/h2>\n<p>The Naive Bayes classifier is a generative model (as opposed to a discriminative model), which means it works by building an approximate statistical model, and then using that model to calculate probabilities which are then used to make predictions. What makes the Naive Bayes model so naive is that all the inputs are assumed to be independent of each other (i.e. knowing one input doesn&#8217;t give you any information about any of the others), and typically the distributions are assumed to be a fixed shape.\u00a0 With these assumptions the &#8220;training&#8221; portion of the algorithm just consists of calculating the means and standard deviations for each input separately for each class.<\/p>\n<p>For the sake of brevity I&#8217;m going to assume you can figure out how to calculate variances. If not there&#8217;s <a href=\"http:\/\/en.wikipedia.org\/wiki\/Algorithms_for_calculating_variance\">a nice Wikipedia article<\/a> devoted to it. The important part here is that you calculate a separate distribution for each class, so that we can compare their likelihoods on new inputs. For my example I use the binary version of the algorithm, but there&#8217;s nothing stopping you from making a version that works on an arbitrary number of classes.<\/p>\n<script src=\"https:\/\/gist.github.com\/6482644.js\"><\/script><noscript><pre><code class=\"language-java java\">\/\/constructs a naive Bayes binary classifier\npublic NaiveBayes(double in[][], boolean out[]){\n   int inputs = in[0].length ;\n   \/\/initialize sums and sums of squares for each class\n   double[] poss = new double[inputs], poss2 = new double[inputs];\n   double[] negs = new double[inputs], negs2 = new double[inputs];\n   \/\/calculate amount of each class, sums, and sums of squares\n   for(int k=0;k&lt;in.length;k++){\/\/for each data point\n      if(out[k]){\n         positives++;\/\/keep track of total positives\n         for(int j=0;j&lt;inputs;j++){\/\/for each input\n            poss[j] += in[k][j] ;\n            poss2[j] += in[k][j]*in[k][j] ;\n         }\n      }else{\n         negatives++;\/\/keep track of total negatives\n         for(int j=0;j&lt;inputs;j++){\/\/for each input\n            negs[j] += in[k][j] ;\n            negs2[j] += in[k][j]*in[k][j] ;\n         }\n      }\n   }\n   \/\/initialize means and variances\n   posmean = new double[inputs] ;\n   posvariance = new double[inputs] ;\n   negmean = new double[inputs] ;\n   negvariance = new double[inputs] ;\n   \/\/calculate means and variances\n   for(int j=0;j&lt;inputs;j++){\/\/for each input\n      posmean[j] = poss[j]\/positives;\n      negmean[j] = negs[j]\/negatives;\n      posvariance[j] = (poss2[j] - ((poss[j]*poss[j])\/positives))\/(positives - 1) ;\n      negvariance[j] = (negs2[j] - ((negs[j]*negs[j])\/negatives))\/(negatives - 1) ;\n   }\n}<\/code><\/pre><\/noscript>\n<p>Alright, so we&#8217;ve determined the means and variances, but we still have to decide what type of distribution to use before we can calculate any probabilities.\u00a0 We could use our data to determine what type of distributions to use, or we could just pick a winner and hope for the best.\u00a0 Let&#8217;s go with the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Normal_distribution\">normal distribution<\/a>. This algorithm is called naive after all, and I&#8217;m feeling lucky.<\/p>\n<figure id=\"attachment_48\" aria-describedby=\"caption-attachment-48\" style=\"width: 432px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/normalpdf.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-48  \" src=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/normalpdf.jpg\" alt=\"Exponential decay, bitches.\" width=\"432\" height=\"154\" srcset=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/normalpdf.jpg 432w, http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/normalpdf-300x106.jpg 300w\" sizes=\"auto, (max-width: 432px) 100vw, 432px\" \/><\/a><figcaption id=\"caption-attachment-48\" class=\"wp-caption-text\">The probability density of a normal distribution.<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>Why the normal distribution you say? Normal distributions tend to pop up frequently in &#8220;natural&#8221; data. To understand this we need look no further than <a href=\"http:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem\">the central limit theorem,<\/a> which states &#8220;given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed&#8221;. That is, if the thing you&#8217;re interested in is actually the result of adding a whole bunch of smaller random things together, no matter what their distributions are, then it&#8217;ll be close to a normal distribution.<\/p>\n<h2 style=\"word-wrap: normal; -webkit-hyphens: none; -moz-hyphens: none; hyphens: none;\"><span style=\"color: #000080;\">Applying the Classifier<br \/>\n<\/span><\/h2>\n<p>Assuming a normal distribution all we have to do is plug our new x&#8217;s into the above formula with our mean (\u00b5) and our variance ( \u03c3^2 is variance since \u03c3 is standard deviation in that formula).\u00a0 Since each of our variables is assumed to be independent from the others the probability of seeing the entire x vector is the multiple of seeing each piece.<\/p>\n<script src=\"https:\/\/gist.github.com\/6482692.js\"><\/script><noscript><pre><code class=\"language-java java\">public double ProbabilityOfInputIfPositive(double in[]){\n   double prob = 1\/Math.sqrt(2 * Math.PI) ;\n   for(int j=0; j&lt;in.length;j++){\n      prob*= Math.exp(- (in[j]-posmean[j])*(in[j]-posmean[j]) \/ (2*posvariance[j]) ) ;\n   }\n   return prob ;\n}<\/code><\/pre><\/noscript>\n<p>That&#8217;s all well and good, but we&#8217;re not interested in the probability of seeing this X. We&#8217;re interested in the probability of this X being in each class.\u00a0 We can get the probability we actually want by applying <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bayes%27_theorem\">Bayes&#8217; Theorem<\/a>:<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/BayesTheorem.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-50\" src=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/BayesTheorem.png\" alt=\"BayesTheorem\" width=\"216\" height=\"48\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: left;\">In this case A is the probability of being positive, and B is the probability of this input vector. Thus the probability of being positive given this input vector ( what we want) is the probability of this input vector given that we&#8217;re positive (P( B|A) = what we calculated above) times the probability of being positive overall divided by the probability of this input vector overall.\u00a0 We can get the probability of being positive overall by counting up what fraction of the total training data was positive, but what about the overall probability of the input data point? We didn&#8217;t calculate that.<\/p>\n<p style=\"text-align: left;\">Here&#8217;s the trick: we&#8217;re going to assume all data points are either positive or negative cases. That is P(positive) + P(negative) = 1.\u00a0 Given that assumption we only have to care about the relative probability of positive versus negative. When we get to the end we can just divide out by the sum to get the absolute probabilities. Since P(B) above is the same for both positive and negative cases it doesn&#8217;t effect the final result and we can safely ignore it. Likewise, that 1\/sqrt(2*pi) term on the normal is the same for both and has no effect on the final result.\u00a0 All we really need to do is calculate the chance of the input being in each class and then multiply that by the overall chance of that class.<\/p>\n<p style=\"text-align: left;\">One final trick before I get to the final code: move the exponential outside of the loop. The multiple of a bunch of exponentials\u00a0 is the exponential of a sum: e^a * e^b = e^(a+b).\u00a0 It&#8217;s not a big deal, but it saves a few unnecessary computations.<\/p>\n<script src=\"https:\/\/gist.github.com\/6482703.js\"><\/script><noscript><pre><code class=\"language-java java\">\/\/Calculate the probability that the given input is in the positive class\npublic double probability(double in[]){\n   double relativepositive=0,relativenegative=0;\n   for(int j=0; j&lt;in.length; j++){\n      relativepositive += (in[j]-posmean[j])*(in[j]-posmean[j]) \/ posvariance[j] ;\n      relativenegative += (in[j]-negmean[j])*(in[j]-negmean[j]) \/ negvariance[j] ;\n   }\n   relativepositive = positives*Math.exp(0.5*relativepositive) ;\n   relativenegative = negatives*Math.exp(0.5*relativenegative) ;   \n   return relativepositive \/ (relativepositive + relativenegative) ;\n}<\/code><\/pre><\/noscript>\n<h2 style=\"word-wrap: normal; -webkit-hyphens: none; -moz-hyphens: none; hyphens: none;\"><span style=\"color: #000080;\">Conclusion<\/span><\/h2>\n<p>This function and the constructor at the top are all you need to make a Naive Bayes Classifier. Congratulations, you can now predictively model things! Unfortunately, this is one of the simpler, and generally weaker algorithms. That being said, <del><\/del> it&#8217;s a lot better than nothing, and sometimes it&#8217;s &#8220;good enough&#8221;. Let&#8217;s talk about when that is.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/naive_bayes_graph.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-52\" src=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/naive_bayes_graph-300x184.png\" alt=\"naive_bayes_graph\" width=\"300\" height=\"184\" srcset=\"http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/naive_bayes_graph-300x184.png 300w, http:\/\/inductivebias.com\/Blog\/wp-content\/uploads\/2013\/09\/naive_bayes_graph.png 888w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p style=\"text-align: left;\">This is a picture of a Naive Bayesian network. I&#8217;m not covering Bayesian networks in this article, but what&#8217;s important is that each arrow in a Bayesian network represents a dependence. In a Naive Bayesian network each input is treated as being dependent on only the output class or label. This and the assumption of normal distributions make up the inductive bias of the Naive Bayes classifier.<\/p>\n<p style=\"text-align: left;\">If your inputs are dependent on the class but not dependent on each other and those inputs are random variables that could be made by adding up lots of other little random variables then this Naive Bayes Classifier will work pretty well. Naive Bayes also has one other significant advantage: missing data. Since all the variables are considered independently, removing one from the probability calculation produces the same result as if you had trained without that variable to begin with; It&#8217;s pretty straightforward to build a Naive Bayes classifier that handles data with lots of null values.\u00a0 Of course, the primary reason these get used as much as they do is because they&#8217;re easy to understand and quick to program and train. Training, if you can call it that, can be done in a single loop over the data, which is to say this is one of the fastest &#8220;machine learning&#8221; algorithms in existence.<\/p>\n<p style=\"text-align: left;\">The reason not to use a Naive Bayes classifier? Accuracy.\u00a0 More sophisticated techniques can get much higher accuracy and deal with much more complex problems. I will be getting into all of the fancier techniques I can find as well as delving a bit more into general topics such as optimization, numerical and data structure tricks, and testing (so you don&#8217;t have to take my word on the accuracy).<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Assume a distribution and conditional independence, calculate means and standard deviations, and use it to make predictions. It&#8217;s about the simplest thing that qualifies as machine learning. I&#8217;m not really a fan, but we&#8217;ve got to start somewhere, right?<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,5],"tags":[],"class_list":["post-43","post","type-post","status-publish","format-standard","hentry","category-machine-learning-algorithm","category-tutorial"],"_links":{"self":[{"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/posts\/43","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/comments?post=43"}],"version-history":[{"count":31,"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/posts\/43\/revisions"}],"predecessor-version":[{"id":432,"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/posts\/43\/revisions\/432"}],"wp:attachment":[{"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/media?parent=43"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/categories?post=43"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/inductivebias.com\/Blog\/wp-json\/wp\/v2\/tags?post=43"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}