What is Apriori Algorithm?
Apriori algorithm is used for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
Apriori uses a "bottom-up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.
Using breadth-first search and a Hash tree structure, Apriori counts candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub-pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.
What are the steps of the Apriori Algorithm?
The following are the main steps of the algorithm:
- Calculate the support of item sets (of size k = 1) in the transactional database (note that support is the frequency of occurrence of an itemset).
- In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will count the occurrences of each item. This is called generating the candidate set.
- Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is satisfying the min sup is determined. Only those candidates which count more than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
- Prune the candidate set by eliminating items with a support less than the given threshold.
- Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is generated by forming a group of 2 by combining items with itself.
- Join the frequent itemsets to form sets of size k + 1, and repeat the above sets until no more itemsets can be formed. This will happen when the set(s) formed to have a spport less than the given support.
- The 2-itemset candidates are pruned using a min-sup threshold value. Now the table will have 2 –itemsets with min-sup only.
- The next iteration will form 3 –itemsets using the join and prune step. This iteration will follow the antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it is pruned.
- The next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is achieved.
What are the advantages of Apriori Algorithm?
The Apriori algorithm advantages are as follows:
- This is the most simple and easy-to-understand algorithm among association rule learning algorithms
- The resulting rules are intuitive and easy to communicate to an end-user
- It doesn't require labeled data as it is fully unsupervised; as a result, you can use it in many different situations because unlabeled data is often more accessible
- Many extensions were proposed for different use cases based on this implementation—for example, there are association learning algorithms that take into account the ordering of items, their number, and associated timestamps
- The algorithm is exhaustive, so it finds all the rules with the specified support and confidence
What are the disadvantages of Apriori Algorithm?
One of the biggest limitations of the Apriori Algorithm is that it is slow. This is so because of the bare decided by the:
- A large number of itemsets in the Apriori algorithm dataset.
- Low minimum support in the data set for the Apriori algorithm.
- The time needed to hold a large number of candidate sets with many frequent itemsets.
- Thus it is inefficient when used with large volumes of datasets.
As an example, if we assume there is a frequent-1 itemset with 10^4 from the set. The Apriori algorithm code needs to generate greater than 10^7 candidates with a 2-length which will then be tested and collected as an accumulation. To detect a size frequent pattern of size 100 (having v1, v2… v100) the algorithm generates 2^100 possible itemsets or candidates which is an example of an application of the Apriori algorithm.
Hence, the yield costs escalate and a lot of time is wasted in candidate generation aka time complexity of the Apriori algorithm. Also, in its attempts to improve the Apriori algorithm to check the many candidate itemsets obtained from the many sets, it scans the database many times using expensive resources. This in turn impacts the algorithm when the system memory is insufficient and there are a large number of frequent transactions. That’s why the algorithm becomes inefficient and slow with large databases.
How can we improve the Apriori Algorithm's efficiency?
Many methods are available for improving the efficiency of the algorithm.
- Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-itemsets and their corresponding count. It uses a hash function for generating the table.
- Transaction Reduction: This method reduces the number of transactions scanned in iterations. The transactions which do not contain frequent items are marked or removed.
- Partitioning: This method requires only two database scans to mine the frequent itemsets. It says that for any itemset to be potentially frequent in the database, it should be frequent in at least one of the partitions of the database.
- Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in S. It may be possible to lose a global frequent itemset. This can be reduced by lowering the min_sup.
- Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the database during the scanning of the database.
What are the components of the Apriori algorithm?
There are three major components of the Apriori algorithm which are as follows.
For example, you have 5000 customer transactions in a Zara Store. You have to calculate the Support, Confidence, and Lift for two products, and you may say Men's Wear and Women Wears.
Out of 5000 transactions, 300 contain Men's Wear, whereas 700 contain women's wear, and these 700 transactions include 250 transactions of both men's & women's wear.
Support denotes the average popularity of any product or data item in the data set. We need to divide the total number of transactions containing that product by the total number of transactions.
Support (Men's wear)= (transactions relating MW) / (total transaction)
= 16.67 %
Confidence is the sum average of transactions/data items present in pairs/combinations in the universal dataset. To find out confidence, we divide the number of transactions that comprise both men's & women's wear by the total number of transactions.
Confidence = (Transactions with men's & women's wear) / (total transaction)
It helps find out the ratio of the sales of women's wear when you sell men's wear. The mathematical equation of lift is mentioned below.
Lift = (Confidence ( Men's wear- women's wear)/ (Support (men's wear)
What are the applications of Apriori Algorithm?
Apriori Algorithm has picked up a pace in recent years and is used in different industries for data mining and handling.
Some fields where Apriori is used:
Hospitals are generally trashed with data every day and need to retrieve a lot of past data for existing patience. Apriori algorithm help hospitals to manage the database of patients without jinxing it with other patients.
The educational institute can use the Apriori algorithm to store and monitor students' data like age, gender, traits, characteristics, parent's details, etc.
On the same line as the education and medical industry, forestry can also use the Apriori algorithm to store, analyze and manage details of every flora and fauna of the given territory.
4. New Tech Firms
Tech firms use the Apriori algorithm to maintain the record of various items of products that are purchased by various customers for recommender systems.
5. Mobile Commerce
Big data can help mobile e-commerce companies to deliver an easy, convenient and personalized shopping experience. With the Apriori algorithm, the real-time product recommendation accuracy increases, which creates an excellent customer experience and increases sales for the company.