Learn more about SQL Server tools

mssqltips logo
giveaway
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Explaining the Calculations of Probability and Importance for Complex Association Rules in SQL Server 2012 Analysis Services


By:   |   Read Comments (2)   |   Related Tips: > Analysis Services Development

Attend a SQL Server Conference for FREE >> click to learn more


Problem

The output from the Association Rules data mining model in SQL Server Analysis Services can be difficult to understand in terms of the calculations named probability and importance, especially when the generated rules become more complex.  In this tip we go through an example to provide a better understanding.

Solution

In this tip, we show how the Association Rules mining model calculates the probability and importance displayed in the mining model viewer for more complex association rules. In a previous tip, I presented a SQL Server 2012 Analysis Services Association Rules data mining example along with how the probability and importance are calculated for a simple rule. The aforementioned tip will be our starting point for this demonstration. The solution presented here uses data from a banking example provided by Bamshad Mobasher at DePaul University.

To calculate the importance value, we need the count of four different sets of records.

  • The count where the antecedent is A and the consequent is positive.
  • The count where the antecedent is A and the consequent is negative.
  • The count where the antecedent is NOT A and the consequent is positive.
  • The count where the antecedent is NOT A and the consequent is negative.

    (The antecedent is also known as the precondition and the consequent is also known as the conclusion.)

    In the image of the Mining Model Viewer below, we see two sets of rules with their values of Probability and Importance. In the first rule, the antecedent A is Income = High and Number of Children >= 2. The consequent for the first rule is Did Buy A Personal Equity Plan = YES, which we will treat as positive for all rules. For this first rule the antecedent NOT A is Income <> High OR Number of Children < 2. Please note the OR in the previous sentence because it will be very important in our calculations below. The negative consequent is Buy A Personal Equity Plan = NO, which we will treat as negative for all rules.

    Explaining the Calculations of Probability and Importance for Complex Association Rules in SQL Server 2012 Analysis Services

    The T-SQL below shows how the Importance is calculated.

    declare @HiIncomeAnd2PlusKidsYesCount float
    declare @NotHiIncomeOr0To1KidsNoCount float
    declare @HiIncomeAnd2PlusKidsNoCount float
    declare @NotHiIncomeOr0To1KidsYesCount float
    declare @Numerator float
    declare @Denominator float
    -- The algorithm adds one to each count to prevent divide by zero errors
    -- The count where the antecedent is A and the consequent is positive.
    set @HiIncomeAnd2PlusKidsYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
       and DidBuyAPersonalEquityPlan='YES'
    )
    --The count where the antecedent is NOT A and the consequent is negative.
    set @NotHiIncomeOr0To1KidsNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where 
       (Income<>'HIGH' OR NumberOfChildren<2) 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is A and the consequent is negative.
    set @HiIncomeAnd2PlusKidsNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is NOT A and the consequent is positive.
    set @NotHiIncomeOr0To1KidsYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where (Income<>'HIGH' OR NumberOfChildren<2)  
       and DidBuyAPersonalEquityPlan='YES'
    )
    --echo the values
    select @HiIncomeAnd2PlusKidsYesCount as HiIncomeAnd2PlusKidsYes, 
           @NotHiIncomeOr0To1KidsNoCount as NotHiIncomeOr0To1KidsNo, 
           @HiIncomeAnd2PlusKidsNoCount as HiIncomeAnd2PlusKidsNo,
           @NotHiIncomeOr0To1KidsYesCount as NotHiIncomeOr0To1KidsYes
    set @Numerator=(select @HiIncomeAnd2PlusKidsYesCount/
                          (@HiIncomeAnd2PlusKidsYesCount+@HiIncomeAnd2PlusKidsNoCount))
    set @Denominator=(select @NotHiIncomeOr0To1KidsYesCount/
                          (@NotHiIncomeOr0To1KidsYesCount+@NotHiIncomeOr0To1KidsNoCount))
    select @Numerator as Numerator, 
           @Denominator as Denominator, 
           @Numerator/@Denominator as Quotient
    select round(LOG10(@Numerator/@Denominator),3) as Importance
    

    The output of the above T-SQL code is shown here. Notice how the Importance calculation matches the value in the Mining Model Viewer screen shot above.

    The output of the above T-SQL code

    The T-SQL below shows how the Probability is calculated for the first rule shown.

    declare @AntecedentA_PositiveCount float
    declare @AntecedentACount float
    --Get the total count for the rule AntecedentA implies DidBuyAPersonalEquityPlan='YES'
    set @AntecedentA_PositiveCount=
    (
       select cast(count(*) as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
       and DidBuyAPersonalEquityPlan='YES'
    )
    --Get the total count for all AntecedentA
    set @AntecedentACount=
    (
       select cast(count(*) as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and NumberOfChildren>=2 
    )
    select @AntecedentA_PositiveCount as AntecedentAAndYesCount, 
           @AntecedentACount as AntecedentACount,
           round(@AntecedentA_PositiveCount/@AntecedentACount,3) as Probability
    

    The output of this query is shown here. Notice how the probability calculated here matches the probability shown in the Mining Model Viewer above.

    Notice how the probability calculated here matches the probability shown in the Mining Model Viewer.

    To validate the accuracy of the calculations, we run the following T-SQL queries for the second rule displayed above. In the second rule, the antecedent A is Income = High and Region = Suburban. The antecedent NOT A is Income <> High OR Region <> Suburban. Again, please note the OR in the previous sentence because it will be very important in our calculations below.

    declare @HiIncomeAndSuburbanYesCount float
    declare @NotHiIncomeOrNotSuburbanNoCount float
    declare @HiIncomeAndSuburbanNoCount float
    declare @NotHiIncomeOrNotSuburbanYesCount float
    declare @Numerator float
    declare @Denominator float
    -- The algorithm adds one to each count to prevent divide by zero errors
    -- The count where the antecedent is A and the consequent is positive.
    set @HiIncomeAndSuburbanYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and Region='Suburban' 
       and DidBuyAPersonalEquityPlan='YES'
    )
    --The count where the antecedent is NOT A and the consequent is negative.
    set @NotHiIncomeOrNotSuburbanNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where 
       (Income<>'HIGH' OR Region<>'Suburban') 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is A and the consequent is negative.
    set @HiIncomeAndSuburbanNoCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where Income='HIGH' 
       and Region='Suburban' 
       and DidBuyAPersonalEquityPlan='NO'
    )
    --The count where the antecedent is NOT A and the consequent is positive.
    set @NotHiIncomeOrNotSuburbanYesCount=
    (
       select cast(count(*)+1 as float) 
       from dbo.tblAssociationRulesExample 
       where (Income<>'HIGH' OR Region<>'Suburban')  
       and DidBuyAPersonalEquityPlan='YES'
    )
    --echo the values
    select @HiIncomeAndSuburbanYesCount as HiIncomeAndSuburbanYes, 
           @NotHiIncomeOrNotSuburbanNoCount as NotHiIncomeOrNotSuburbanNo, 
           @HiIncomeAndSuburbanNoCount as HiIncomeAndSuburbanNo,
           @NotHiIncomeOrNotSuburbanYesCount as NotHiIncomeOrNotSuburbanYes
    set @Numerator=(select @HiIncomeAndSuburbanYesCount/
                   (@HiIncomeAndSuburbanYesCount+@HiIncomeAndSuburbanNoCount))
    set @Denominator=(select @NotHiIncomeOrNotSuburbanYesCount/
                     (@NotHiIncomeOrNotSuburbanYesCount+@NotHiIncomeOrNotSuburbanNoCount))
    select @Numerator as Numerator, 
           @Denominator as Denominator, 
           @Numerator/@Denominator as Quotient
    select round(LOG10(@Numerator/@Denominator),3) as Importance
    

    The output of this query is shown here.

    The T-SQL shows how the Probability is calculated for the second rule shown.

    The T-SQL below shows how the Probability is calculated for the second rule shown.

    declare @AntecedentA_PositiveCount float
    declare @AntecedentACount float
    --Get the total count for the rule Antecedent A implies DidBuyAPersonalEquityPlan='YES'
    set @AntecedentA_PositiveCount=
       (
          select cast(count(*) as float) from dbo.tblAssociationRulesExample 
          where Income='HIGH' 
          and Region='Suburban'
          and DidBuyAPersonalEquityPlan='YES'
       )
    --Get the total count for all Antecedent A
    set @AntecedentACount=
       (
          select cast(count(*) as float) from dbo.tblAssociationRulesExample 
          where Income='HIGH' 
          and Region='Suburban' 
       )
    select @AntecedentA_PositiveCount as AntecedentAAndYesCount, 
           @AntecedentACount as AntecedentACount,
           round(@AntecedentA_PositiveCount/@AntecedentACount,3) as Probability
    

    The output of this query is shown here.

    The output of this query

    Please take note that the above examples hold true because the HoldoutMaxCases and HoldoutMaxPercent attributes are set to zero for the mining structure. The image below is from the Visual Studio Properties Window for the Mining Structure.

    The Visual Studio Properties Window for the Mining Structure.

    If the values for these attributes are not zero, then the query results will likely not match the figures in the Model Mining Viewer. When we set the value for HoldoutMaxPercent to 30 and then rebuild and deploy the mining structure, we see that our values for probability and importance have changed. With larger volumes of data, this change might not be as noticeable.

    The Model Mining Viewer


    With larger volumes of data, this change might not be as noticeable.
    Next Steps

    Explore the different association rules in the Mining Model Viewer to discover interesting patterns in the data. Also, be sure to read the following tips to find out more about data mining functionality in SQL Server 2012 Analysis Services.



  • Last Update:


    signup button

    next tip button



    About the author
    MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

    View all my tips





    Post a comment or let the author know this tip helped.

    All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

    *Name    *Email    Notify for updates 


    SQL tips:

    *Enter Code refresh code     



    Sunday, July 13, 2014 - 7:01:56 AM - mouse Back To Top

    Oh, and since I'm here soliciting help, I would like offer input in return. Consider windows functions to reduce repetitive queries. On a large dataset, the power of one query that uses window functions for aggregations can greatly improve performance. While reducing your object pulls, you can still implement case filters to handle the alernate where clauses, and your select statement can include in-line variable assignment.

    select @this = sum(case when dog = cat then 1 else 0 end) over (), @that = count(1) over() from dbo.Something


    Sunday, July 13, 2014 - 6:47:50 AM - mouse Back To Top

    Pardon my lack of data mining/statistics education, but I'm curiously confounded about the holdout - is there a gernally accepted standard for determining best holdout counts or percents; which I assume would be based on some factor of total population? Or is it very much up to the miner to determine based on the goal, where perhaps overfitting is acceptable, and/or some known/determined generalizability of the population? There must be a cheatsheet somewhere for beginners like myself?


    Learn more about SQL Server tools