Explaining the Calculations of Probability and Importance for Complex Association Rules in SQL Server 2012 Analysis Services

By:   |   Comments (2)   |   Related: > Analysis Services Development


Problem

The output from the Association Rules data mining model in SQL Server Analysis Services can be difficult to understand in terms of the calculations named probability and importance, especially when the generated rules become more complex.  In this tip we go through an example to provide a better understanding.

Solution

In this tip, we show how the Association Rules mining model calculates the probability and importance displayed in the mining model viewer for more complex association rules. In a previous tip, I presented a SQL Server 2012 Analysis Services Association Rules data mining example along with how the probability and importance are calculated for a simple rule. The aforementioned tip will be our starting point for this demonstration. The solution presented here uses data from a banking example provided by Bamshad Mobasher at DePaul University.

To calculate the importance value, we need the count of four different sets of records.

  • The count where the antecedent is A and the consequent is positive.
  • The count where the antecedent is A and the consequent is negative.
  • The count where the antecedent is NOT A and the consequent is positive.
  • The count where the antecedent is NOT A and the consequent is negative.

    (The antecedent is also known as the precondition and the consequent is also known as the conclusion.)

    In the image of the Mining Model Viewer below, we see two sets of rules with their values of Probability and Importance. In the first rule, the antecedent A is Income = High and Number of Children >= 2. The consequent for the first rule is Did Buy A Personal Equity Plan = YES, which we will treat as positive for all rules. For this first rule the antecedent NOT A is Income <> High OR Number of Children < 2. Please note the OR in the previous sentence because it will be very important in our calculations below. The negative consequent is Buy A Personal Equity Plan = NO, which we will treat as negative for all rules.

    Explaining the Calculations of Probability and Importance for Complex Association Rules in SQL Server 2012 Analysis Services

    The T-SQL below shows how the Importance is calculated.

    declare @HiIncomeAnd2PlusKidsYesCount float
    declare @NotHiIncomeOr0To1KidsNoCount float
    declare @HiIncomeAnd2PlusKidsNoCount float
    declare @NotHiIncomeOr0To1KidsYesCount float
    declare @Numerator float
    declare @Denominator float
    -- The algorithm adds one to each count to prevent divide by zero errors
    -- The count where the antecedent is A and the consequent is positive.
    set @HiIncomeAnd2PlusKidsYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
       and DidBuyAPersonalEquityPlan='YES'
    )
    --The count where the antecedent is NOT A and the consequent is negative.
    set @NotHiIncomeOr0To1KidsNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where 
       (Income<>'HIGH' OR NumberOfChildren<2) 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is A and the consequent is negative.
    set @HiIncomeAnd2PlusKidsNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is NOT A and the consequent is positive.
    set @NotHiIncomeOr0To1KidsYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where (Income<>'HIGH' OR NumberOfChildren<2)  
       and DidBuyAPersonalEquityPlan='YES'
    )
    --echo the values
    select @HiIncomeAnd2PlusKidsYesCount as HiIncomeAnd2PlusKidsYes, 
           @NotHiIncomeOr0To1KidsNoCount as NotHiIncomeOr0To1KidsNo, 
           @HiIncomeAnd2PlusKidsNoCount as HiIncomeAnd2PlusKidsNo,
           @NotHiIncomeOr0To1KidsYesCount as NotHiIncomeOr0To1KidsYes
    set @Numerator=(select @HiIncomeAnd2PlusKidsYesCount/
                          (@HiIncomeAnd2PlusKidsYesCount+@HiIncomeAnd2PlusKidsNoCount))
    set @Denominator=(select @NotHiIncomeOr0To1KidsYesCount/
                          (@NotHiIncomeOr0To1KidsYesCount+@NotHiIncomeOr0To1KidsNoCount))
    select @Numerator as Numerator, 
           @Denominator as Denominator, 
           @Numerator/@Denominator as Quotient
    select round(LOG10(@Numerator/@Denominator),3) as Importance
    

    The output of the above T-SQL code is shown here. Notice how the Importance calculation matches the value in the Mining Model Viewer screen shot above.

    The output of the above T-SQL code

    The T-SQL below shows how the Probability is calculated for the first rule shown.

    declare @AntecedentA_PositiveCount float
    declare @AntecedentACount float
    --Get the total count for the rule AntecedentA implies DidBuyAPersonalEquityPlan='YES'
    set @AntecedentA_PositiveCount=
    (
       select cast(count(*) as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
       and DidBuyAPersonalEquityPlan='YES'
    )
    --Get the total count for all AntecedentA
    set @AntecedentACount=
    (
       select cast(count(*) as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
    )
    select @AntecedentA_PositiveCount as AntecedentAAndYesCount, 
           @AntecedentACount as AntecedentACount,
           round(@AntecedentA_PositiveCount/@AntecedentACount,3) as Probability
    

    The output of this query is shown here. Notice how the probability calculated here matches the probability shown in the Mining Model Viewer above.

    Notice how the probability calculated here matches the probability shown in the Mining Model Viewer.

    To validate the accuracy of the calculations, we run the following T-SQL queries for the second rule displayed above. In the second rule, the antecedent A is Income = High and Region = Suburban. The antecedent NOT A is Income <> High OR Region <> Suburban. Again, please note the OR in the previous sentence because it will be very important in our calculations below.

    declare @HiIncomeAndSuburbanYesCount float
    declare @NotHiIncomeOrNotSuburbanNoCount float
    declare @HiIncomeAndSuburbanNoCount float
    declare @NotHiIncomeOrNotSuburbanYesCount float
    declare @Numerator float
    declare @Denominator float
    -- The algorithm adds one to each count to prevent divide by zero errors
    -- The count where the antecedent is A and the consequent is positive.
    set @HiIncomeAndSuburbanYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and Region='Suburban' 
       and DidBuyAPersonalEquityPlan='YES'
    )
    --The count where the antecedent is NOT A and the consequent is negative.
    set @NotHiIncomeOrNotSuburbanNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where 
       (Income<>'HIGH' OR Region<>'Suburban') 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is A and the consequent is negative.
    set @HiIncomeAndSuburbanNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and Region='Suburban' 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is NOT A and the consequent is positive.
    set @NotHiIncomeOrNotSuburbanYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where (Income<>'HIGH' OR Region<>'Suburban')  
       and DidBuyAPersonalEquityPlan='YES'
    )
    --echo the values
    select @HiIncomeAndSuburbanYesCount as HiIncomeAndSuburbanYes, 
           @NotHiIncomeOrNotSuburbanNoCount as NotHiIncomeOrNotSuburbanNo, 
           @HiIncomeAndSuburbanNoCount as HiIncomeAndSuburbanNo,
           @NotHiIncomeOrNotSuburbanYesCount as NotHiIncomeOrNotSuburbanYes
    set @Numerator=(select @HiIncomeAndSuburbanYesCount/
                   (@HiIncomeAndSuburbanYesCount+@HiIncomeAndSuburbanNoCount))
    set @Denominator=(select @NotHiIncomeOrNotSuburbanYesCount/
                     (@NotHiIncomeOrNotSuburbanYesCount+@NotHiIncomeOrNotSuburbanNoCount))
    select @Numerator as Numerator, 
           @Denominator as Denominator, 
           @Numerator/@Denominator as Quotient
    select round(LOG10(@Numerator/@Denominator),3) as Importance
    

    The output of this query is shown here.

    The T-SQL shows how the Probability is calculated for the second rule shown.

    The T-SQL below shows how the Probability is calculated for the second rule shown.

    declare @AntecedentA_PositiveCount float
    declare @AntecedentACount float
    --Get the total count for the rule Antecedent A implies DidBuyAPersonalEquityPlan='YES'
    set @AntecedentA_PositiveCount=
       (
          select cast(count(*) as float) from dbo.tblAssociationRulesExample 
          where Income='HIGH' 
          and Region='Suburban'
          and DidBuyAPersonalEquityPlan='YES'
       )
    --Get the total count for all Antecedent A
    set @AntecedentACount=
       (
          select cast(count(*) as float) from dbo.tblAssociationRulesExample 
          where Income='HIGH' 
          and Region='Suburban' 
       )
    select @AntecedentA_PositiveCount as AntecedentAAndYesCount, 
           @AntecedentACount as AntecedentACount,
           round(@AntecedentA_PositiveCount/@AntecedentACount,3) as Probability
    

    The output of this query is shown here.

    The output of this query

    Please take note that the above examples hold true because the HoldoutMaxCases and HoldoutMaxPercent attributes are set to zero for the mining structure. The image below is from the Visual Studio Properties Window for the Mining Structure.

    The Visual Studio Properties Window for the Mining Structure.

    If the values for these attributes are not zero, then the query results will likely not match the figures in the Model Mining Viewer. When we set the value for HoldoutMaxPercent to 30 and then rebuild and deploy the mining structure, we see that our values for probability and importance have changed. With larger volumes of data, this change might not be as noticeable.

    The Model Mining Viewer


    With larger volumes of data, this change might not be as noticeable.
    Next Steps

    Explore the different association rules in the Mining Model Viewer to discover interesting patterns in the data. Also, be sure to read the following tips to find out more about data mining functionality in SQL Server 2012 Analysis Services.



  • sql server categories

    sql server webinars

    subscribe to mssqltips

    sql server tutorials

    sql server white papers

    next tip



    About the author
    MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

    This author pledges the content of this article is based on professional experience and not AI generated.

    View all my tips



    Comments For This Article




    Sunday, July 13, 2014 - 7:01:56 AM - mouse Back To Top (32676)

    Oh, and since I'm here soliciting help, I would like offer input in return. Consider windows functions to reduce repetitive queries. On a large dataset, the power of one query that uses window functions for aggregations can greatly improve performance. While reducing your object pulls, you can still implement case filters to handle the alernate where clauses, and your select statement can include in-line variable assignment.

    select @this = sum(case when dog = cat then 1 else 0 end) over (), @that = count(1) over() from dbo.Something


    Sunday, July 13, 2014 - 6:47:50 AM - mouse Back To Top (32675)

    Pardon my lack of data mining/statistics education, but I'm curiously confounded about the holdout - is there a gernally accepted standard for determining best holdout counts or percents; which I assume would be based on some factor of total population? Or is it very much up to the miner to determine based on the goal, where perhaps overfitting is acceptable, and/or some known/determined generalizability of the population? There must be a cheatsheet somewhere for beginners like myself?















    get free sql tips
    agree to terms