Candidate Elimination implemented in Haskell by Emanuel Berg embe8573@student.uu.se Updated: Sat Nov 2012 (24/11) 06:49 http://user.it.uu.se/~embe8573 (general) http://user.it.uu.se/~embe8573/candelim/ce.hs (Haskell implementation) http://user.it.uu.se/~embe8573/candelim/docu.txt (this document) http://user.it.uu.se/~embe8573/candelim/docu_wrap.txt (ditto with line breaks) Candidate Elimination (CE) is an AI algorithm. Its purpose is to detect patterns (concepts) as answers to yes-or-no questions. It reaches a conclusion based on a training set of uniform examples: property vectors, all tagged with "yes" or "no". Yes-or-no question Start with a yes-or-no question. For example, say you have some controlled environment, like an ecological system inside a big bottle. Your question is: "what organisms will survive in this bottle?" To make it work with CE (that only deals with yes-or-no questions) you rephrase it like this: "Will [some organism] survive?" Describe the entities of your world Second, you must find a syntax for describing organisms in a uniform way, as vectors. For examples, [a, b, c] and [i, j, k] may describe two organisms in terms of three properties. Note that in the real world, distinct organisms may share hundreds of properties; still, as far as CE is concerned, two identical vectors represent the same organism. AI note 1: Inductive bias of CE CE only considers training set examples in terms of the properties that are defined by you. As a consequence, CE can only learn concepts in terms of those properties. There is one exception: although examples are completely described property-by-property, concepts may also include * for "don't care". Concepts are patterns: they are generalizations of training sets (that contain *specific* examples). The purpose of the generalization is to include all positive examples, and exclude those negative. So, if you ask the CE what organisms will survive, and you get the answer [a, *, *], this should be read as "all organisms with as first property will survive. The second and third properties do not influence whether an organism will survive". AI note 2: Search space reduction Most AI methods have some form of search as its center. This is fine as long as it is possible to cover as much of the search space as is necessary to find an answer (and that within reasonable time, and with reasonable computer resources). If this is not possible, it is said that the search space is intractable. In our CE case, we could, if we end up with an intractable search space, reduce its size by changing the organism representation. Apart from reducing the search space size, this could also work to our advantage by streamlining the search to focus on relevant concepts (although a lot of them will probably still be meaningless). But, it may also work to our disadvantage as, if we pick properties carelessly or without enough understanding, we might exclude concepts (making them impossible to find) that *do* matter to our investigation. Tag your entities Third, you transplant the organisms into the bottle. When you feel confident whether they live or die, you tag their respective descriptions with "yes" or "no", indicating "lives" or "dies". Note that the same organism cannot both live and die, although that would probably be common in your real-life bottle. In CE, that would represent inconsistent data and any run on that would crash the algorithm. Although CE cannot recover from an inconsistent (crash) state, it can detect when it cannot proceed and notify the user. The annotation of the examples makes the yes-and-no question implicit. The *order* you extract examples from the training set and feed the CE algorithm matters. Notably, the first example must be positive. (Apart from that, I did not examine how the order matters.) So, strictly speaking, the training "set" is not a set but a list. AI note: CE learning is supervised CE is an example of "supervised learning": as we annotate our example's property vectors with "yes" or "no", we do this based on external (non-CE) observations (or calculations). Run the training set The first organism in the training set must be a survivor. Also, as said, all organisms must be completely described property-by-property (no unknown or do-not-care values). Also, the vectors must be uniform for all organisms: the same properties, the same number of properties, in the same order. If all is good, you feed the examples (in order) into the CE, and it gives you an answer. How it works Beside the training set, you use a concept S and a set of concepts G. The S concept represents the *inner* bound, and the concepts of G the *outer* bound. A concept differs from a property vector only in that a concept may include an * for "don't care". In CE, the search for the concept that answers your question is done *bidirectionally*: from S to G, and from G to S, simultaneously, until they intersect (or you run out of examples): what is left is your answer. (S is for specific, G for general.) To initialize CE, you remove the first example from the training set. You set S to the properties of that example (i.e., you remove the plus sign). This will be the starting point for S' movement outward, toward G. At this selfsame moment, set G to the so-called top concept, which includes everything. So, if your examples look like this (sign, [property_a, property_b, property_c]), G looks like this [*, *, *]. This will be the starting point for G's movement inward, toward S. In general, S could be thought of as a set. But I implemented it as a concept as: "If concepts are vectors of independent properties and * (don't care), then two concepts always have a unique most specific generalization. In that case S will always have 1 element. Thus if you remove a concept from S you have inconsistency." (http://www.it.uu.se/edu/course/homepage/ai/ht11/learning-logic.html) What CE does on S on a positive example On a positive example, S must be generalized to include it, if S indeed excludes it. This is done by replacing all properties in S with * (don't care) that differs from the respective properties of the example. E.g, e = +[a, b, c] S = [a, *, d] The first property is fine. As it is the same in S and e, it does not make S exclude e. The second property is fine. As it is * (don't care) in S, it does not make S exclude e ("don't care" includes e's second property value, b, as well as all other possible values for that property). But the third property value must be generalized in S. As it is d in S, this is equivalent of saying that e cannot be a positive example (an organism that lives), as e has value c for its third property. But e *is* a positive example (indicated by the plus sign), and this is an indisputable observation made externally of the CE algorithm. (That is, the training set examples have the final say.) So, S has to drop its insistence on d. Thus, we set S to [a, *, *], making S more general. Last, we must make sure that the new S does not include any previous negative examples. If it does, the data is inconsistent. What CE does on G on a positive example On a positive example, we must iterate G and drop all concepts in G that do not include our positive example. E.g., e = +[a, b, c] G = { [a, b, *], [*, *, d] } The first concept in G is fine, as it includes e. But the second concept of G must be dropped as it, by insisting on a third property value of d, is excluding e. Note the difference here compared with the action on S. As we are dropping (not modifying) concepts, we don't have to check for consistency. What CE does on S on a negative example On a negative example, we need to check weather it is included or not by S. If it is, the data (the entire training set, collectively) is inconsistent. What CE does on G on a negative example On a negative example, all concepts in G must be specialized to exclude the negative example (that is, those that indeed include it). In addition, checks must make sure that new concepts in G does not exclude any previous positive examples. Those who do are dropped. (I'm unsure if this is wise. Perhaps, if a drop has to be made here, the data is inconsistent.) Also, too specific concepts, that are overlapped by more general concepts in G, should be removed. Specialization of concepts in G is done in terms of S. E.g., S = [a, b, d] e = -[a, b, c] G = { [*, *, d], [*, b, *] } Here, the first concept of G is fine: it already excludes e. The second element of G does include e. We need to find an * in it and replace it with the corresponding value in S. The first such instance -- the first property -- does not help us. As it is the same -- a -- in both S and e, it cannot help the second concept of G exclude e. But, the second (and last) such instance, the third property -- * in the G concept, d in S -- differs from the corresponding third property value of e, namely c. So, the second concept of G is redefined into [*, b, d]. However, this is a specification of an existing concept in G, namely [*, *, d]. So, [*, b, d] should be dropped altogether, leaving G as = { [*, *, d] }. As we did not introduce any new concepts in G this time around (we did, but we instantly dropped it), there is no need to check weather the concepts of G excludes any previous positive examples. What just happened? Recall, we just made S more *general*. Now, we made G less general, thus more *specific*. We are encircling our target (the answer to our question) from two directions, from the specific to the general, as well as the other way around. This can't be stressed enough as it is the core of CE. When CE terminates When the last example has been fed (and CE is in a consistent state) the answer is all concepts that lie between the S concept (the "inside") and the "outside", whose borders are defined by the concepts of G. If G contains a single concept, and that concept is equal to the S concept, then no more narrowing down is possible.