I- introduction:

II- Basic concepts:

For example, for the application of the shopping basket, the list of items corresponds to the set of items available in the supermarket {Milk; Bread; Cheese; etc}

For example, two possible transactions that describe purchases in a supermarket are:  t1 = {Milk, Cheese, Bread} and t2 = {Milk, Cheese, Chocolate}

Example:

The support of {Bread, Milk, Cheese} equal to 2.

 

Example:

We have the association rule below:

                                                 R1: if {Bread, Milk} then {Cheese}           

 

In this rule, 100% is the confidence of the rule and 40% is the rule support.
"100% of transactions that contain Bread and Milk items also contain Cheese and 40% of transactions contain these three items »

"Good" rule = rule with high support and confidence

III- APRIORI Algorithm:

Definition:

Inputs of the Apriori algorithm:

Principle of the algorithm:

Principle of the algorithm:

By using the property of frequency of the itemset, we can build the set of items:

Example:

We consider the itemset of 3 items below:

S = {(A,B,C), (A,B,D), (A,C,D), (A,C,E), (B,C,D)}

Functioning of the algorithm:

Example:

I = {bread, milk, cheese, jam, chocolate}

                                                                                         In this table, we give to each item a number in order                                                                                  to simplify the next steps

                 

 

Iteration 1:

We have the following transactions base. We suppose that Minimum support count = 2 Since we have the support of the itemset {4} = 1 which is lower than 2, so we will remove it in order to create L1.

Iteration 2:

We will generate the list of itemsets of length 2, then remove the itemsets that have a support lower than the minimum support, to have a list of frequent itemsets of length 2.

Since we have the support of the itemset {1,2} = 1 which is lower than 2, so we will remove it in order to create L2.

Iteration 3:

We will generate the list of itemsets of length 3, in order to create this time the list of frequent itemsets of length 3.

We will apply the pruning method. First, we will start by dividing each set into sub-sets of length 2, then we will check if one of its sub-sets is not in the list of frequent candidate set items of length 2, in this case the set will be eliminated and we will keep only those sets that don’t contain a rare subset which are {1, 3, 5} and {2, 3, 5}, finally we will calculate their supports through the transaction database.

Iteration 4:

We will generate the list of itemsets of length 4, in order to create the list of frequent itemsets of length 4.

By pruning, this set must be eliminated since it has two rare sub-sets. And even if we compare its support with the minimum support, it will also be eliminated.

Now, we don't have any more frequent itemsets so the algorithm is finished. 

Frequent items:

The set L of the frequent subsets is the union of the sets L1,...LK.

L1 = { 1,2,3,5 }                 L2 = { (1,3), (1,5), (2,3), (2,5), (3,5) }         L3 = { (1,3,5), (2,3,5) }

                     L = { 1,2,3,5,(1,3),(1,5),(2,3),(2,5), (3,5), (1,3,5), (2,3,5) }

 

Construction of rules:

For each frequent set, rules are constructed to verify the threshold constraint of confidence (Minconf).

We suppose that Minconf=100%

These are the rules generated:

In other words, the products that can be bought together are:

The disadvantages of the Apriori algorithm:

IV- Frequent Pattern-Tree Growth Algorithm:

Definition:

Principle of the algorithm:

It consists first of compressing the database into a compact structure called FP-tree (Frequent Pattern tree), then dividing it into sub-projections of the database called conditional databases.

An FP-tree is a compact structure consisting of:

  1. Tree: is composed of a null root, each node of the tree contains three pieces of information:

 

  1. Index: is a header table that contains the list of frequent items and points to the first occurrence of each element. Each input in this table contains:

The advantage of this representation of the data is that it is sufficient to follow the inter-node links to know all the frequent associations where the frequent element appears.

Example:

Let's now consider the following transaction table:

We will count for each candidate the frequency, then we will put the items by ordered frequency and finally we will compare the support count for each candidate with min_supp.

Now we will build an FP tree from this database.

For the transaction I2,I1,I5:

The sets of articles are considered in order of their value descending from the number of supports.

For the transaction I2,I4:

We will add the articles of the transaction I2,I4 , if we find an article already added in the FP tree we will increment its support, else we will insert the new article and give it the value 1 of support.

In the FP tree below, we want to add the article I2, since it existed already we will increment its support from 1 to 2, then we will add the article I4.

For the transaction I2,I3:

Now we will add the transaction I2,I3, we will increment the support of I2 because it exists at the beginning of the FP tree, then we will insert the article I3, as shown in the FP tree below.

 

For the transaction I2,I1,I4:

We will increment here the support of the article I2 and I1, also, then we will insert the article I4 in the FP tree because it doesn't exist after the article I1.

For the transaction I1,I3:

As shown below, we will add the article I1 in the FP tree because it doesn’t exist at the beginning of the FP tree, then we will insert the I3 because it doesn't exist after I1.

For the transaction I2,I3:

Since we have at the beginning of the FP tree the article I2 we will just increment its support, and after I2 we can find the article I3 so we will also increment its support.

For the transaction I1,I3:

Now we want to add the transaction I1, I3, for the article I1 it exists at the beginning of the FP tree, so we will increment its support to 2, then we have after I1 the article I3 that we need insert so we will increment its support also.

For the transaction I2,I1,I3,I5:

We will apply the same thing to the transaction I2,I1,I3,I5, so we will increment the support of the article I2 from 5 to 6, then after I2 we can find I1 so we will also increment its support from 2 to 3, after that we will insert the article I3 because it doesn’t exist after I1, then we will insert after I3 the article I5.

For the transaction I2,I1,I3:

As shown in the FP tree below, we increment the support of the article I2 from 6 to 7 because we find it at the beginning of the FP tree, after I2 we can find the article I1 so we will increment its support from 3 to 4, and the same thing for the article I3 we will increment its support from 1 to 2.

Now we need to find the conditional basis and conditional FP Tree for each element.

To facilitate the tree crossing, an element header table is constructed so that each element points to its occurrences in the tree via a chain of node links.

For I5:

Conditional FP-tree: {I2: 2 I1: 2}

Conditional pattern base: {{I2,I1:1},{I2,I1,I3:1}} 

Frequent patterns generated: {I2,I5:2}, {I1,I5:2}, {I2,I1,I5:2}

For I4:

Conditional FP-tree: I2:2} 

Conditional pattern base: {{I2,I1:1},{I2:1}}

Frequent patterns generated: {I2,I4:2}

For I3:

Conditional FP-tree: {I2:4,I1:2},{I1:2}

Conditional pattern base: {{I2,I1:2},{I2:2},{I1:2}}

Frequent patterns generated: {I2,I3:4}, {I1.I3:4}, {I2,I1,I3:2}

For I1:

Conditional FP-tree: {I2:4}

Conditional pattern base: {{I2:4}}

Frequent patterns generated: {I2,I1:4}

The table below contains the conditional pattern base, the conditional FP-tree and the frequent patterns generated of each item.

Advantages of the algorithm:

Disadvantages of the algorithm:

V- Evaluation of the rules :

Most association rules search algorithms use support and confidence measures using minimum thresholds to avoid uninteresting rules, but generating association rules can produce some of them uninteresting. So how to evaluate the association rules found?    

There are several correlation measures, among them the lift test.

 Lift:

    Lift calculation

      lift(X-->Y) = (sup(X U Y)/ N) / (sup(X)/ N*sup(Y)/ N ), where:

Lift Interpretation

VI- Conclusion :

 

Author:

Kawtar Najmani

PhD Condidat At Faculty of Science Ben M'Sik