Selecting a Simple Random Sample from a SQL Server Database

By:   |   Comments   |   Related: > TSQL


Problem

In the business world, useful information about a population usually is gathered by studying a selected portion of the population. The portion is called a sample, and the process of selecting the sample is called sampling. Sampling is a fundamental operation for auditing and statistical analysis of large databases [1]. Many people in the database community are required to select a sample from a SQL server database. A simple solution on the web is to use the SQL statement "ORDER BY NEWID()". This solution may not fit all populations. For example, in FM radio markets, a station considers the age group of the target audience as an important determinant for the type of programming [2]. If 30 listeners are chosen randomly from a SQL server database that contains all listeners through using this method, the sample may not include listeners of a certain age subgroup. Thus, the sample might misrepresent the population. Additional sampling techniques are required to reduce this sample selection bias. Those database professionals who are unfamiliar with statistics may want to know more sampling techniques and the nature of the uncertainties created by these sampling techniques.

Solution

To sensitize that sampling is not merely to select some random items from a population, I quoted a statement from William Edwards Deming [3]: Sampling is not a mere substitution of a partial coverage for total coverage. Sampling is the science and art of controlling and measuring the reliability of useful statistical information through the theory of probability. To explore this science and art, the tip is devoted to two topics: sampling techniques, and sampling distributions.

The focus of the first part is to introduce sampling techniques. Section 1.1 covers some basic concepts of sampling. Then, two categories of sampling techniques are briefly introduced in Section 1.2. Next, Section 1.3 adopts the lottery method of the simple random sampling to select a sample from a SQL server database.

The second part is devoted to sampling distributions. Section 2.1 explains the expected value and variance of a random variable. Sections 2.2 studies one of the most important theorems in statistics: The Central Limit Theorem (CLT). Section 2.3 introduces the well-known normal distributions. Section 2.4, then, studies variation among many samples. Finally, Section 2.5 briefly introduces the sample size determination technique.

All the source codes used in this tip were tested with SQL Server Management Studio V18.3.1, Microsoft Visual Studio Community 2017 and Microsoft R Client 3.4.3 on Windows 10 Home 10.0 <X64>. The DBMS is Microsoft SQL Server 2017 Enterprise Edition (64-bit).

1 – Sampling Techniques

When we study the characteristics of an entire population, because of physical constraints, economical constraints, time constraints or other constraints, it is usually impractical to gather information from every unit within a population. We use sampling techniques to estimate the characteristics of the entire population.

1.1 Basic Concepts of Sampling

To precisely explain sampling techniques, let’s study some definitions that were excerpted from [4,5,6].

Population: A population is a collection of data measured on all experimental units of interest to the researcher. The concept of experimental units herein refers to those objects upon which the measurements (or observations) are made. The collection of data, which is typically large, possibly infinite, either exists in fact or is part of an ongoing operation and hence is conceptual [4]. For example, the population could be "all customers of a manufacturer over the recent past and in the future".

Finite Population: A finite population is a population that consists of a finite number of experimental units.

Infinite Population: An infinite population is a population in which it is theoretically impossible to measure all the experimental units. In practice, a finite population with many experiment units is considered to be an infinite population.

Target Population: The complete collection of observations that the researcher wants to study [5]. The research objective determines the choice of the target population.  For example, when we study customers of a manufacturer, the target population could be all individual customers, all corporate customers, or all registered customers. Each target population possesses its characteristics.

Sample: A sample is a subset of data selected from a population [4], which is representative of the population.

Sample Size: The number of experimental units to be included in a sample [6].

Sampling: The process of selecting a sample from the population is called sampling [6].

Sampling Unit: An experimental unit that is selected for a sample [5].

Sampled Population: The population from which the sample was taken [5]. Ideally, the sampled population is the same as the target population. In practice, the sampled population is usually smaller than the target population. For example, we want to study all customers of a retailer, i.e. target population. But not all customers have registered their personal information into the Customer Relationship Management (CRM) system. All those customers in the CRM system form the sample population.

Sampling Frame: A list, map, or other specification of sampling units in the population from which a sample may be selected [5]. The sampling frame differs from the population in that the sampling frame is more specific. For example, the sampling frame of the manufacturer customers is an actual list of all customers in the database tables. The population, on the other side, is abstract, for example, all customers of the manufacturer. We select a sample from the sampling frame rather than the population.

Parameters: Numerical descriptive measures of the population [4], for example, the population mean statistical formula and the population standard deviation statistical formula.

Statistics: Numerical descriptive measures calculated from sample data [4], for example, the sample mean statistical formula and the sample standard deviation statistical formula.

1.2 Sampling Techniques

Based on the method of selecting a sample, various sampling techniques are broadly categorized into two groups: probability sampling and non-probability sampling. When sampling techniques of probability sampling are used, sampling units are selected randomly by known probabilities from a sampling frame. On the other side, in non-probability sampling, each unit is selected without the use of probability. Some other factors, for example, the researcher’s judgment, convenience, etc. determine the selection of a sample.

Table 1 shows the broad classification of probability sampling and non-probability sampling. Most sampling solutions on the web provided by the database community are using the simple random sampling method. Database professionals can find an appropriate method in this table according to the nature of the population.

statistical formula

One purpose of this tip is to present a list of sampling techniques to database professionals. All these techniques bring risks of sampling errors. Depending on the nature of populations, some techniques might carry smaller risks of sampling errors. Furthermore, some techniques can predict the risks of sampling errors. A detailed exploration of each sampling technique is beyond the scope of this tip. In Section 1.3, we will give practice in the use of simple random sampling for selecting a sample.

1.3 Selecting a Simple Random Sample Without Replacement

A simple random sample (hereinafter referred to as the "SRS") is one of the simplest forms of probability sample, and it is the foundation for more complex sampling designs [5]. There are two ways of selecting a unit for a simple random sample: with replacement (hereinafter referred to as the "SRSWR") and without replacement (hereinafter referred to as the "SRSWOR").  In this tip, we will take a SRSWR. That means the units once chosen are not placed back in the sampling frame.

If we take samples of size statistical formula from a population of size is statistical formula, the total number of possible samples is computed by the combinations rule:

statistical formula

In a SRSWOR, every possible sample should have an equal chance of being selected as the representative of the population. Therefore, the probability of one sampling unit, such as unit statistical formula, being selected into any samples is obtained by the following equation:

statistical formula

There are two methods of randomly selecting a sampling unit [6]:

  1. The lottery method;
  2. Using random numbers;

In the lottery method, each sampling unit is assigned a number. We select numbers one by one and all selected numbers will not be selected again. The process is analogous to drawing lottery numbers in a box. We can use a computer program to choose these sampling units randomly.

AdventureWorks is a fictional company that sells bicycles and cycling accessories. Microsoft provides a transactional database [8] with 27,659 online orders placed by individual customers. I have used the lottery method to take a SRSWOR of size 30 from this population.

The first step is to assign a number to each sampling unit. I created a temporary table with an identity column, then inserted all online orders placed by individual customers into the temporary table. This process assigned all orders with consecutive whole numbers:

CREATE TABLE #all_online_orders
(
  order_id      int identity(1,1) primary key,
  order_no      nvarchar(25) not null,
  customer_name nvarchar(150) not null,
  order_date    datetime not null,
  total_amount  money not null,
);
 
INSERT INTO #all_online_orders
SELECT so.[SalesOrderNumber]
         ,CONCAT(p.FirstName, ' '
         ,p.LastName) AS [Customer Name]
         ,so.[OrderDate]
         ,so.[SubTotal]
FROM [Sales].[SalesOrderHeader] so 
INNER JOIN [Sales].[Customer] sc ON so.CustomerID = sc.CustomerID
INNER JOIN [Person].[Person] p ON sc.PersonID = p.BusinessEntityID
WHERE so.[OnlineOrderFlag] = 1
  and so.[Status] = 5
  and sc.StoreID IS NULL
order by [OrderDate] 

The second step is to generate random numbers. I used the 4-step procedure [9] for generating 30 random integer numbers from the range [1, 27659]. Note that duplicates were discarded. Comparing to the population size 27659, the sample size 30 is so small that the chance of generating numbers with duplicates is low.

CREATE TABLE #lottery_number 
( 
   lottery_number int primary key 
); 

WHILE (SELECT COUNT(*) FROM #lottery_number) < 30 
BEGIN 
   BEGIN TRY  
      INSERT INTO #lottery_number 
      SELECT CEILING(RAND(CHECKSUM(NEWID())) * 27659.0) 
   END TRY  
   BEGIN CATCH  
      PRINT 'discards duplicates' 
   END CATCH 
END 

The last step is to select sampling units according to these random numbers. I used these random numbers to find their corresponding online orders, and therefore a SRSWOR has been selected:

SELECT 
    order_id, 
    order_no, 
    customer_name, 
    order_date, 
    total_amount 
FROM #all_online_orders o 
INNER JOIN #lottery_number l ON o.order_id = l.lottery_number 			

Table 2 shows the SRSWOR of size 30. It is noteworthy that we must resist a temptation of making any adjustment of the sample, even though some units in the sample does not look random.

sample data

Bajpai has provided a 5-step sampling design process in his book [6]: (1) define the target population; (2) determine the sampling frame; (3) select an appropriate sampling technique; (4) determine sample size; (5) execute the sampling process. Kabir also has summarized a 7-step sampling design process [10]. By following these systematic design processes, we have more chances to get a representative sample.

2 – Sampling Distribution

Table 2 showed the selected sample of size 30. I assume that the characteristic of interest is the total amount of each order. Using the sample mean and sample standard deviation definitions introduced in [11], the sample mean is 1,267.56 and the sample standard deviation is 1,244.96. When we repeatedly take a SRSWOR of size 30, we get different sample means and sample standard deviations. The following procedure was used to select 20,000 samples, calculate sample means, and then produce a histogram of the sample means:

Step 1: Run the following T-SQL script to compute sample means and save them into a global temporary table:

CREATE TABLE #all_online_orders 
( 
   order_id int identity(1,1) primary key, 
   order_no nvarchar(25) not null, 
   customer_name nvarchar(150) not null, 
   order_date datetime not null, 
   total_amount money not null, 
); 

CREATE TABLE #lottery_number 
( 
   lottery_number int primary key 
); 

CREATE TABLE  ##order_avg_amount  
( 
   avg_amount money not null 
); 

INSERT INTO #all_online_orders 
SELECT so.[SalesOrderNumber] 
      ,CONCAT(p.FirstName, ' ', p.LastName) AS [Customer Name] 
      ,so.[OrderDate] 
      ,so.[SubTotal] 
FROM [Sales].[SalesOrderHeader] so 
INNER JOIN [Sales].[Customer] sc ON so.CustomerID = sc.CustomerID 
INNER JOIN [Person].[Person] p ON sc.PersonID = p.BusinessEntityID 
WHERE so.[OnlineOrderFlag] = 1 and so.[Status] = 5 and sc.StoreID IS NULL 
order by [OrderDate] 

WHILE (select COUNT(*) from ##order_avg_amount) < 20000 
BEGIN 
   TRUNCATE TABLE #lottery_number 
   WHILE (select COUNT(*) from #lottery_number) < 30 
   BEGIN 
      BEGIN TRY  
         INSERT INTO #lottery_number 
         SELECT CEILING(RAND(CHECKSUM(NEWID())) * 27659.0) 
      END TRY  
      BEGIN CATCH  
         print 'disregards duplicates' 
      END CATCH 
   END 
   INSERT INTO ##order_avg_amount 
   SELECT AVG(total_amount) as avg_amount 
   FROM #all_online_orders o 
   INNER JOIN #lottery_number l ON o.order_id = l.lottery_number 
END 
DROP TABLE #lottery_number 
DROP TABLE #all_online_orders 

Step 2: Run the following R script to retrieve data from the global temporary table and then plot a histogram:

# Reset the compute context to your local workstation. 
rxSetComputeContext("local") 
# Define connection string to SQL Server 
sql.server.conn.string <- "Driver=SQL Server;Server=.;Database=AdventureWorks2017;Trusted_Connection={Yes}" 
# Create an RxSqlServerData data source object 
data.source.object <- RxSqlServerData(sqlQuery = "                                     
                                select avg_amount from ##order_avg_amount 
                                ", 
                                connectionString = sql.server.conn.string) 
# Read the data into a data frame in the local R session. 
data.frame.avg.amount <- rxImport(data.source.object) 
# Create a Histogram of Random Numbers 
hist(data.frame.avg.amount$avg_amount, main = "The Histogram of Sample Means", 
     xlab = "Sample Mean", freq = TRUE) 

Step 3: Drop the global temporary table:

DROP TABLE ##order_avg_amount			
histogram

Figure 1 The Histogram of Sample Means

Figure 1 exhibits the frequency distribution of sample means. The histogram is close to a bell-shaped curve. To have a better understanding of the variability in these sample statistics, and then to estimate population parameters, we need to study the probability distributions of these statistics, called sampling distributions. In this tip, I place my focus on the sampling distribution of the means.

2.1 Expected Value and Variance

I have explored discrete probability distributions and continuous probability distributions in [9]. Through probability distributions, we can compute the chance of a random variable taking a value within a range. This section studies some measures of the probability distributions: expected value, variance and standard deviation.

When we arrange a set of observations into an array, one measure of the array is the mean of the array. The mean is a central value around which the data tend to cluster. The probability distribution of a continuous random variable also has some descriptive measures such as expected value. The expected value is the mean of a random variable that represents the mean outcome when we repeat a random experiment many times. The expected value of a discrete random variable is defined by:

statistical formula

where statistical formula is the random variable with distinct possible values statistical formula; statistical formula is the size of the support, possibly infinitely;statistical formula is the probability mass function(PMF).

The discrete uniform distribution, the Bernoulli distribution, and the binomial distribution have been discussed in [9]. We use PMFs of these distributions to compute expected values of these random variables:

(1), statistical formula has a discrete uniform distribution, denoted by statistical formula.

statistical formula

(2), statistical formula has a Bernoulli distribution, denoted by statistical formula.

statistical formula

(3), statistical formula has a binomial distribution, denoted by statistical formula.

statistical formula
statistical formula

Let statistical formula and statistical formula, then

statistical formula

The probability distribution of a continuous random variable does not have a PMF, instead, it has a probability density function (PDF). We interpret the expected value of a continuous random variable in the same way as we did for the discrete random variables. The expected value is the mean value of the continuous random variable over a large number of experiments. The expected value of the continuous random variable is computed by

statistical formula

where statistical formula is the random variable with PDF statistical formula.

The PDFs of uniform distribution and the standard distribution have been given in [9]. We use these PDFs to calculate expected values of these random variables:

(1), statistical formula has a uniform distribution, denoted by statistical formula.

statistical formula

(2), statistical formula has the standard normal distribution, denoted by statistical formula.

Before computing the expected value, let’s review some properties of a function statistical formula. If statistical formula is an odd function and the integral of the odd function exists, the area under the function from -a to 0 cancels the area under the function from 0 to a:

statistical formula

If statistical formula is an even function, and the integral of the even function exists, the area under the function from -a to 0 is equal to the area under the function from 0 to a:

statistical formula

Since statistical formula is an odd function, therefore the expected value of the standard normal random variable is 0:

statistical formula

The most important property of expected value is linearity, which is represented by the following equations [7]:

statistical formula
statistical formula

Variance and standard deviation, the most widely used measures of variability, describe the dispersion of a random variable. The variance is the average of the squared deviations about the expected value of the random variable, and the standard deviation is the positive square root of the variance:

statistical formula
statistical formula

The variance of random variables has some useful properties. If statistical formula and statistical formula are independent continuous random variables, the variance of the random variable statistical formula is obtained by the following equation [7]:

statistical formula

If statistical formula is a continuous random variable and statistical formula is a constant, the variance of the random variable statistical formula is obtained by the following equation [7]:

statistical formula

Since we have already calculated expected values of some random variables, we use these expected values to compute variances of random variables:

(1), statistical formula has a discrete uniform distribution, denoted by statistical formula.

statistical formula
statistical formula

(2), statistical formula has a Bernoulli distribution, denoted by statistical formula.

statistical formula
statistical formula

(3), statistical formula has a binomial distribution, denoted by statistical formula.

The process to compute the statistical formula is not straightforward. We have already known that a binomial distribution with parameters statistical formula and statistical formula can be considered to perform statistical formula independent Bernoulli trials with the probability of success statistical formula in each trial. We use statistical formula to denote a success of the statistical formula trial, then statistical formula can be represented by:

statistical formula

It is noting that all random variables from statistical formula have the same value of variance, that is statistical formula. Then, we can use the property of variance to compute the variance of the random variable statistical formula:

statistical formula

(4), statistical formula has a uniform distribution, denoted by statistical formula.

statistical formula
statistical formula

(5), statistical formula has the standard normal distribution, denoted by statistical formula.

Assuming that we have already known the following two equations, we use them to compute the variance of the standard normal distribution. The proof of the first equation is very tricky and requires more mathematical background. The second equation uses the fact that statistical formula decays much faster than statistical formula grows [7].

statistical formula
statistical formula

Then, we compute the term statistical formula in the variance equation:

statistical formula

Let statistical formula and statistical formula, then

statistical formula
statistical formula
statistical formula

Therefore,

statistical formula
statistical formula

2.2 The Central Limit Theorem

Let’s represent the characteristic of each unit in a population by a random variable statistical formula, and assume statistical formula are independent, identically distributed with finite mean statistical formula and finite variance statistical formula. The mean of the sample of size statistical formula is computed by

statistical formula

The sample mean is a random variable because a function of random variables is a random variable. When we repeatedly take a sample of size statistical formula from the population, we obtain different sample means and sample standard deviations. We compute expected value and variance of the sample means:

statistical formula
statistical formula
statistical formula

We usually denote the expected value of the sample means by statistical formula and denote the standard deviation of the sample means by statistical formula, which is also referred to as the standard error of the mean. We denote the standard error of the mean by statistical formula. Then, we have these equations:

statistical formula
statistical formula
statistical formula

The law of large numbers (LLN) says that, as the sample size statistical formula, the sample mean statistical formula converges to the true mean statistical formula pointwise with probability 100% [7]. LLN has two versions, and the proof of the weak law of large numbers is provided in book [7]. We have implicitly applied LLN in many simulations already. For example, I repeatedly ran programming scripts in [9] to find the probability of event occurrences.

The central limit theorem (CLT) describes the distribution of the random variable statistical formula:

Random variables statistical formula are independent and identically distributed with mean statistical formula and variance statistical formula. If these statistical formula random variables constitute a random sample from an infinite population, as statistical formula, the distribution of statistical formula  approaches standard normal.

To prove this theorem, we need to have some knowledge of the moment generating function (MGF), which is not required in this tip. In practice, we often use the approximation form of the CLT: for a large statistical formula, the distribution of statistical formula is approximately normal distributed, denoted as statistical formula. Usually, we consider sample size statistical formula to be large if statistical formula. The approximation form of the CLT states that even if statistical formula independent observations from a population that is not normally distributed, the sample mean of the observations is approximately normally distributed if statistical formula is large.

2.3 Normal Distributions

The three main reasons make normal distributions be a keystone in statistics [12]:

  1. Many random variables in science, business and industry are approximately normally distributed;
  2. The shape of some discrete probability distributions approximates a normal distribution when the size of the support is large enough;
  3. The central limit theorem provides the basis for statistical inference;

Normal Distributions have the well-known bell-shaped curve shown in Figure 2. They have the probability density function (PDF) with two parameters statistical formula and statistical formula:

statistical formula

When a random variable has this PDF, the variable has a normal distribution and it is referred to as a normal random variable.

normal probablitity density

Figure 2 Normal Probability Density Function

The following are some of the important characteristics of the normal curve [12,13]:

  • The curve is bell-shaped and has one peak at the center of the distribution, on which the mean, median and mode locate;
  • The curve is symmetrical about the mean statistical formula. The area under the curve to the left of the mean equals to the right of the mean;
  • The curve has inflection points at statistical formula and statistical formula, where statistical formula is the standard deviation;
  • The area under the curve represents probability. The total area under the curve is 1. The area lies within the interval from statistical formula to statistical formula is about 68%. The area lies within the interval from statistical formula to statistical formula is about 95%. The area lies within the interval from statistical formula to statistical formula is about 99.7%.

Since both PDF and CDF of normal distributions are not closed-form expressions, it is not effective to compute the area under the curve by using these two functions. In practice, we transform a normal distribution statistical formula into the standard normal distribution, denoted by statistical formula, which has a mean of 0 and a variance of 1:

statistical formula

where statistical formula is called z-score or z-value that represents the distance from the mean in standard deviation units, and statistical formula is the value of the random variable statistical formula.

The areas under the standardized normal curve have been tabulated. We also can compute the areas by using a built-in function in programming languages, for example, R language. Let’s look at an example in [13]:

Let statistical formula be statistical formula, where the random variable represents the length of life (in years) of an electric can opener. If this can opener has a 1 yr warranty, what fraction of original purchases will require replacement? (Panik, 2012, Example 6.3)   

I would like to point out that statistical formula in book [13] denotes that normal distribution has a mean of 2.9 and a standard deviation of 0.9. Other books, for example [7], interprets that the normal distribution has a variance of 0.9. In this example, I consider that the distribution has a standard deviation of 0.9 so that the answer is the same as the one in book [13].

First, we transform the normal distribution statistical formula into the standard normal distribution statistical formula:

statistical formula

The question asked to find the probability when statistical formula. Let’s express statistical formula in terms of statistical formula:

statistical formula
statistical formula

The gray area in Figure 3 represents the probability. We can either use the table of probabilities and Z-scores or use a computer program to compute the gray area. I used the following R function, and the function returned a value of 0.0174. It seems about 1.74% of original sales will require replacement because of the 1-year warranty.

> pnorm(-2.11, lower.tail = TRUE) 
[1] 0.01742918 
normal probablitity density

Figure 3 The Area to the Left of (z=-2.21)

2.4 Sampling Distribution of the Means

Figure 1 plotted means of 20,000 samples and illustrated the sampling distribution of the means. We usually cannot produce all possible sample means to study the sample distributions of the means. Statistical theory concludes three important characteristics of the sampling distribution of the means [14]:

  1. Because of the CLT, if the sample size statistical formula is large (rule of thumb statistical formula), the sample means are approximately normally distributed, even though the population does not have a normal distribution;
  2. The mean of the sampling distribution of the means is equal to the population mean:
statistical formula
  1. The standard deviation of the sampling distribution of the means, commonly referred to as the standard error, is computed as the population standard deviation divided by the square root of the sample size when the population is infinite:
statistical formula

The finite correction factor is used for a finite population:

statistical formula

In practice, the finite correction factor usually is ignored unless statistical formula[12].

Because the population standard deviation statistical formula usually is unknow, statistical formula is estimated by the sample standard deviation statistical formula if the sample size is greater than 30 [14,15]:

statistical formula

We have defined a population with 27,659 online orders placed by individual customers. I used the following T-SQL statement to find the populations parameters:

SELECT AVG([SubTotal])  AS Mean 
   ,VAR([SubTotal]) AS Variance 
   ,COUNT(*) AS N 
FROM [Sales].[SalesOrderHeader] so JOIN [Sales].[Customer] sc ON so.CustomerID = sc.CustomerID 
WHERE so.[OnlineOrderFlag] = 1 
and so.[Status] = 5 
and sc.StoreID IS NULL 

The population has a mean of 1061.45, a variance of 1320260.75 and a standard deviation of 1149.03. When we select an SRSWOR of size 30 from the population, Let’s find the probability that the difference between the sample mean and the population mean is not over 5% of the population mean:

We have already known

statistical formula
statistical formula
statistical formula
statistical formula

Compute the range of the z-scores

statistical formula
statistical formula
statistical formula
statistical formula

The probability of the event statistical formula as shown in Figure 4 was computed by the following R commands:

> 1-2*pnorm(-0.26, lower.tail = TRUE) 
[1] 0.2051362 
> 
normal probablitity density

Figure 4 The Probability of the Event P(-0.26≤z≤0.26)

There are 20.5% of the chance that the sample mean is in the interval statistical formula, i.e. [1008.377, 1114.523]. To verify the theorical calculation, I use T-SQL script to select 20,000 samples, then compute the probability of the sample mean being in this interval. The following script returned the probability of 20.7%, which is close to the theoretical calculation.

CREATE TABLE #all_online_orders 
( 
   order_id int identity(1,1) primary key, 
   order_no nvarchar(25) not null, 
   customer_name nvarchar(150) not null, 
   order_date datetime not null, 
   total_amount money not null, 
);
 
CREATE TABLE #lottery_number 
( 
   lottery_number int primary key 
);
 
CREATE TABLE  ##order_avg_amount  
( 
   avg_amount money not null, 
   within_interval smallint not null 
); 

INSERT INTO #all_online_orders 
SELECT so.[SalesOrderNumber] 
      ,CONCAT(p.FirstName, ' ', p.LastName) AS [Customer Name] 
      ,so.[OrderDate] 
      ,so.[SubTotal] 
FROM [Sales].[SalesOrderHeader] so INNER JOIN [Sales].[Customer] sc ON so.CustomerID = sc.CustomerID 
INNER JOIN [Person].[Person] p ON sc.PersonID = p.BusinessEntityID 
WHERE so.[OnlineOrderFlag] = 1 and so.[Status] = 5 and sc.StoreID IS NULL 
order by [OrderDate] 

WHILE (select COUNT(*) from ##order_avg_amount) < 20000 
BEGIN 
   TRUNCATE TABLE #lottery_number 
   WHILE (select COUNT(*) from #lottery_number) < 30 
   BEGIN 
      BEGIN TRY  
         INSERT INTO #lottery_number 
         SELECT CEILING(RAND(CHECKSUM(NEWID())) * 27659.0) 
      END TRY  
      BEGIN CATCH  
         print 'disregards duplicates' 
      END CATCH 
   END 
   INSERT INTO ##order_avg_amount 
   SELECT AVG(total_amount) as avg_amount, 
      IIF(AVG(total_amount) >= 1008.377 and AVG(total_amount) <= 1114.523, 1, 0) 
   FROM #all_online_orders o 
   INNER JOIN #lottery_number l ON o.order_id = l.lottery_number 
END 

SELECT CAST(SUM(within_interval) as float)/CAST(COUNT(*) as float) AS probability 
FROM ##order_avg_amount 

DROP TABLE #lottery_number 
DROP TABLE #all_online_orders 
DROP TABLE ##order_avg_amount 

2.5 Determining Sample Size

The sample size is the number of units selected for a sample. Based on the definition of the standard error of the sample mean, as the sample size increases, the standard error decreases. However, a larger sample size means more cost of sampling. Determination of sample size is one of the most important steps in the sampling process. To determine the sample size, we should at least know these two criteria:

  1. The level of precision, which is a range to include the true value of the population. For example, we may need to determine a sample size so that the estimate mean should be within 5% error of the true mean;
  2. The confidence level, which describes the uncertainty associated with a sampling method. We cannot guarantee a single sample to be representative of the population. The true mean may not in the estimated interval by using the single sample. A confidence level, for example, 95%, means that, by repeating the procedure over and over again, we obtain many computed intervals, and 95% of these intervals contain the true mean.

When determining the size of the sample, a typical requirement is that the estimate should be within a 5% error of the true parameter. [16, 17] review other criteria: the purpose of the study, the population size, the degree of variability in the attributes being measured [18], the statistical power and one- or two-tailed statistical analysis.

[16,17] also have covered several approaches to determine the sample size: using a census for small populations, imitating a sample size of similar studies, using published tables, and applying formulas to calculate the sample size.

Summary

A population is a collection of data measured on all experimental units of interest to the researcher. A sample is a subset of data selected from a population. The researcher usually studies samples to gather useful information about the population. The two main types of sampling techniques are probability sampling and non-probability sampling. In the probability sampling process, the probability of units being selected into the sample is known. In non-probability sampling methods, the chance of units being selected within the population is unknown. A typical sampling technique is the simple random sampling without replacement.

The expected value represents the mean outcome when we repeat a random experiment many times. In the inferential process, sample means are used to estimate population means because of the central limit theorem. The central limit theorem states that for a population with any distribution shape if a sample of sufficiently large sample size n (n ≥30) is drawn from the population, the sample means are approximately normally distributed.

The probability density function (PDF) of a normal distribution is the well-known bell-shaped curve, which is symmetrical about its mean and asymptotic to the horizontal axis. The two parameters mean and standard deviation together determine a normal probability distribution. The mean of the sampling distribution of the means is equal to the population mean, and the standard deviation of the sampling distribution of the means is computed as the population standard deviation divided by the square root of the sample size.

Sample size determination is the technique of determining the number of units selected for a sample. The level of precision and the confidence level need to be specified to determine the appropriate sample size. Several strategies are available to determine sample size, such as using a census for small populations, imitating a sample size of similar studies, using published tables, and applying formulas to calculate the sample size.

References

[1] Olken, F., & Rotem, D. (1986). Simple Random Sampling from Relational Databases. VLDB.

[2] Black, K. (2013). Business Statistics: For Contemporary Decision Making, 8th Edition. Hoboken, NJ: Wiley.

[3] Deming, W. E. (1950). Some Theory of Sampling. Mineola, NY: Dover.

[4] William, M., & Sincich, T. (2012).  A Second Course in Statistics: Regression Analysis (7th Edition). Boston, MA: Prentice Hall.

[5] Lohr, L. S. (2019). Sampling: Design and Analysis. Boca Raton, FL: Chapman and Hall/CRC.

[6] Bajpai, N. (2018).  Business Research Methods Second Edition. India: Pearson Education India.

[7] Hwang, J. & Blitzstein, K. J. (2015). Introduction to Probability. Boca Raton, FL: CRC Press.  

[8] Kess, B. (2017). AdventureWorks sample databases. Retrieved from https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks/.

[9] Zhou, N. (2020). Using SQL Server RAND Function Deep Dive. Retrieved from https://www.mssqltips.com/sqlservertip/6301/using-sql-server-rand-function-deep-dive/.

[10] Kabir, S. M. (2016). Sample and Sampling Design. Retrieve from Research Gate: https://www.researchgate.net/publication/325846982.

[11] Zhou, N. (2019). Numerically Describing Dispersion of a Data Set with SQL Server and R. Retrieved from https://www.mssqltips.com/sqlservertip/6058/numerically-describing-dispersion-of-a-data-set-with-sql-server-and-r/.

[12] Hummelbrunner, S. A., Rak, L. J., Fortura, P., & Taylor, P. (2003). Contemporary Business Statistics with Canadian Applications (3rd Edition). Toronto, ON: Prentice Hall.  

[13] Panik, I. M. (2012). Statistics Inference: A Short Course. Hoboken, NJ: John Wiley & Sons.  

[14] Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business Statistics. Hoboken, NJ: John Wiley & Sons.  

[15] Jost, S. (2017). CSC 423: Data Analysis and Regression. Retrieved from DePaul University Website: http://facweb.cs.depaul.edu/sjost/csc423/

[16] Israel, G. D. (1992). Determining Sample Size. University of Florida, FL: EDIS

[17] Singh, A. S. & Masuku, M. B. (2014). Sampling Techniques & Determination of Sample Size in Applied Statistics Research: An Overview. International Journal of Economics, Commerce and Management, Vol. II, Issue 11, Nov 2014. ISSN 2348 0386.

[18] Miaoulis, G. & Michener, R. D. (1976). An Introduction to Sampling. Dubuque, IA: Kendall Hunt Publishing

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Nai Biao Zhou Nai Biao Zhou is a Senior Software Developer with 20+ years of experience in software development, specializing in Data Warehousing, Business Intelligence, Data Mining and solution architecture design.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article

















get free sql tips
agree to terms