September 16, 2008

Introduction to Association Rules Mining

Association rules are part of every data miner's arsenal. Haven't heard about it? I am pretty sure you've had. Association rules are a substantial part of every e-shop, of every supermarket and every tool that aims to analyze data.


Does the picture look familiar? If you've ever bought something at amazon, you might have noticed that they are kinda obsessed with showing you items related to your order. Where do they get this information? It is not stored statically in the database, instead it is computed from the overall orders using the association rules mining algorithms.

Do you think that items in your favorite supermarket are organized randomly? No, they are organized in a way that maximizes a chance that the items are bought. Again, this is information, that can be easily discovered using association rules mining algorithms.

From the previous paragraphs you might suspect, that association rules express relations (associations) between items. More formally, association rule is an implication of form A -> B, where the left side, A, is called premise and it represents a condition which must be true, for the right side, B (conclusion) to hold. A rule A->B can be interpreted as

If A happens, than B happens.
This is a very generic interpretation, because the true interpretation depends on the domain.I am now a supermarket employee and I got following rule from the mining software:

Bread -> Milk
The rule can be translated as:

Customers, who bought bread, also bought milk
Now I magically transform into a website traffic analyst and see a rule

/news/obama.html -> /sport/tour-de-france.html
and I instantly know that

those, who read news about Barrack Obama, also read news about le Tour and not only that, I know that those who are interested in Barrack Obama are interested in Tour de France.
Woosh, flash of light and I am now a doctor, looking at the rule

vasculitis -> paraneoplastic syndrome
and I see that there is a serious chance that my vasculitis patients will suffer paraneoplastic syndrome.

The important thing is, that association rules helped me to discover hidden knowledge (that's why they call it data mining), but the more important thing is, that I can act based on the knowledge. I can move the milk closer to bread to sell more of it together and generate more income. I can recommend stuff to my e-shop visitors, I can treat my vasculitis patients and run some tests to detect paraneoplastic syndrome early and maybe save lives.

So what do you need to get started? You need data of course, but not just any data, you need data in a form of transactions. These transactions have nothing to do with the database transactions. Instead, the transaction is a logical group of somehow related items. You might have groups of market basket items, groups of links clicked on one web page visit, group of one patient's diseases.. Such groups are then called transactions.

When I said, that rule interpretation depends on domain, it was only half of the truth. The other half is, that the interpretation also depends on your transactions. The interpretation simply depends on what you are mining, and what you are mining is based on how you define your transaction.

I'll now do a simple, manual association rule mining, using the classical market basket analysis example. We define our transaction as a content of a basket.

Transaction IdItems
1bread, milk, butter, cocoa, cheese
2bread, butter, milk, cheese
3bread, butter, olives
4milk, sugar, butter, cheese

We have four baskets, four customers and their data. Looking at the items, we see, that transactions 1,2,3 contain bread and butter. We have just found our very first rule.

bread -> butter
There are other rules in our data, for example rule

milk -> cheese
found in transactions 1,2,4. Although association rule mining may seem like a very trivial task at the first look, imagine finding the rules in dataset of billions of transactions.

The rules presented so far have all one big downside. There is no way to tell which rule is better, it is impossible to compare them. To get past this limitation, we can add several classifiers to the rule, which will represent the strength of the rule. They are commonly known as interestingness measures, because the strength of the rule is equal to its interestingness.

The two classical measures, which were introduced by R. Agrawal, an association rule pioneer, are called support and confidence. Support is a measure, which represents how often did the rule apply. It is a percentage of all transaction, where the items in the rule were found.

Confidence is a percentage of all transactions, which contain items on the left and on the right side of the rule.

IdTransactionsbread + cheesebread -> cheesecheese -> bread
1bread, cheese, honey, applesOOO
2milk, bread, cheese, pastaOOO
3milk, bread, apples
X
4bread, milk
X
5milk, pasta, cheese

X
6milk, bread, cheeseOOO

Look at the table above. Bread and cheese can be found in transactions 1,2,6, we have total six transactions, so the support of the rule bread -> cheese and cheese -> bread is 3/6 or 50%. Now take the rule bread -> cheese. Our customers bought bread in transactions 1,2,3,4,6, but bought cheese only in transactions 1,2 and 6. So five customers bought bread, but only three of them bought also a cheese, so the support of the rule is 3/5.



It should be pretty clear from the examples, that both interestingness measures are important, because they both quantify the rule and express its strength. But not only that, the interestingness measure are the key concept, that actually enables the mining.

Association rule mining is formally defined as a process of finding the rules, where the support and confidence of the rule are greater than the user provided values of minimum support and minimum confidence, further referred to as minconf and minsup. The two values actually prune the search space and make mining possible.

Take the last example. There is a rule milk -> apples [1/6, 1/1], which can be found in only one transaction. Is this rule interesting? It isn't and yet it is there. This is not a problem if we have six items in six transactions, but it would be great problem, had we thousands of items in billions of transactions. If you specify minsup=0.5, minconf=0.8 you will effectively filter out all uninteresting items. If you specify the values too low, you will end up with tons of rules, because the items will be associated with each other in all possible ways. On the other side, if you specify the values too high, you might not find a single rule. There is no universal advice as to what values should you set, the best way is to experiment.

What we'll be talking about next time? In the next posts I will show some practical examples using RapidMiner mining tool, explain the algorithm behind, tell about the problems this model has and explain why support and confidence are bad measures.


This is the first post in the Association Rule Mining series. Interested? Consider subscribing to my feed to catch up with updates.