OPTICS Clustering Algorithm in SQL Server

Problem

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters very similar do DBSCAN. However, OPTICS handles it more effectively in the case of a cluster with varying densities gaining deeper insights exploring the hierarchical structure of your data. This algorithm is generally more computationally intensive.

Is it possible to have the OPTICS clustering algorithm implemented in SQL Server without using an external solution?

Solution

Optics Clustering Algorithm

The OPTICS clustering algorithm is useful in geospatial data and maps, market and customer segmentation, anomaly detection, image and pattern analysis, biology, medicine, general exploratory analysis, and more. It adapts to different cluster shapes and densities, which happens a lot in real-world data.

In a dataset of points, the OPTICS algorithm walks through these points, one by one, trying to keep track of how close they are to each other, saving this number called reachability distance. This represents how easily you can reach that point from another cluster, creating an ordering of points. Those clusters will appear as a valley where low reachability means inside a cluster and high reachability means gaps between clusters. Later you can decide where to cut the valleys to define the clusters. From the ordering points you can extract clusters of different densities, without fixing the cluster size in advance.

It is necessary to supply two parameters. The first one is the epsilon which is the maximum distance to consider and the minimum points required to form a cluster inside this radius.

Create SQL Server Objects

First we will create some tables, views, functions, and stored procedures in SQL Server for our solution.

Table

We need just one table to hold our data and calculations.

-- MSSQLTips (TSQL)
 
CREATE TABLE [dbo].[OpticsData](
   [PointId] [int] IDENTITY(1,1) NOT NULL,
   [X] [float] NOT NULL,
   [Y] [float] NOT NULL,
   [ClusterId] [int] NULL,
   [Ordering] [int] NULL,
   [CoreDist] [float] NULL,
   [ReachDist] [float] NULL,
   [IsSeed] [bit] DEFAULT 0,
   [IsProcessed] [bit] DEFAULT 0,
   [SpatialPoint]  AS ([geometry]::Point([x],[y],(0))),
 CONSTRAINT [PK_OpticsData] PRIMARY KEY CLUSTERED 
(
   [X] ASC,
   [Y] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

View

To show only the main columns.

-- MSSQLTips (TSQL)
 
CREATE OR ALTER VIEW [dbo].[vOpticsData]
AS
SELECT PointId, Ordering, ClusterId, CoreDist, ReachDist, X, Y
FROM dbo.OpticsData
GO

User defined functions

This function will calculate the distances using the Euclidean method.

-- =============================================
-- Author:      SCP - MSSQLTips
-- Create date: 20280828
-- Description: Optics Euclidean distance
-- =============================================
CREATE OR ALTER FUNCTION [dbo].[ufnOpticsEuclideanDist] 
            (@PointId1 int
            ,@PointId2 int)
RETURNS float
AS
BEGIN
   DECLARE   @Dist float
            ,@x1 float
            ,@y1 float;
 
   SELECT    @x1 = [X]
            ,@y1 = [Y]
      FROM  [dbo].[OpticsData] 
      WHERE [PointId] IN (@PointId1);
 
   SELECT    @Dist = SQRT(SQUARE(@x1 - [X]) + SQUARE(@y1 - [Y]))
      FROM   [dbo].[OpticsData] 
      WHERE  [PointId] IN (@PointId2);
 
   RETURN @Dist;
END
GO

This function will calculate the distances using the Manhattan method.

-- =============================================
-- Author:      SCP - MSSQLTips
-- Create date: 20280828
-- Description: Optics Euclidean distance
-- =============================================
CREATE OR ALTER FUNCTION [dbo].[ufnOpticsManhattanDist] 
            (@PointId1 int
            ,@PointId2 int)
RETURNS float
AS
BEGIN
   DECLARE  @Dist float
           ,@x1 float
           ,@y1 float;
 
   SELECT    @x1 = [X]
            ,@y1 = [Y]
      FROM  [dbo].[OpticsData] 
      WHERE  [PointId] IN (@PointId1);
 
   SELECT   @Dist = (ABS(@x1 - [X]) + ABS(@y1 - [Y]))
      FROM  [dbo].[OpticsData] 
      WHERE [PointId] IN (@PointId2);
 
   RETURN @Dist;
END
GO

Stored procedures

The first store procedure is for the OPTICS algorithm. I used the value of 1E50 as infinitum.

-- =============================================
-- Author:      SCP - MSSQLTips
-- Create date: 20250609
-- Description: OPTICS clustering
-- =============================================
CREATE OR ALTER PROCEDURE [dbo].[uspOpticsClustering] 
                (@Epsilon decimal(18,6)
                ,@MinPts int
                ,@Threshold float
                ,@E_Or_M_Calc char(1) = 'E')
WITH EXECUTE AS CALLER
AS
BEGIN
   SET NOCOUNT ON;
 
    IF (SELECT COUNT(*) FROM [dbo].[OpticsData]) < 1
        RETURN;
 
   BEGIN TRY
 
      --  E_Or_M_Calc: (E) for Euclidian and (M) for Manhattan distances
 
      DECLARE   @i int
               ,@j int
               ,@cd float
               ,@rd float
               ,@ordering int = 0
               ,@clusterId int = 0;
 
      UPDATE    [dbo].[OpticsData]
         SET    [ClusterId] = NULL
               ,[CoreDist] = NULL
               ,[ReachDist] = 1E50
               ,[IsProcessed] = 0
               ,[IsSeed] = 0;
 
      -- Core distance ==========================================================
 
      DECLARE     @Neighbors
         TABLE (p1 int NOT NULL
               ,p2 int NOT NULL
               ,dist float
               ,pos int
               ,PRIMARY KEY (p1, p2));
 
      IF @E_Or_M_Calc = 'E' 
         INSERT INTO  @Neighbors
            SELECT    O1.PointId
                     ,O2.PointId
                     ,[dbo].[ufnOpticsEuclideanDist] (O1.PointId, O2.PointId)
                     ,ROW_NUMBER() OVER(PARTITION BY O1.PointId ORDER BY [dbo].[ufnOpticsManhattanDist] (O1.PointId, O2.PointId), O2.PointId)
               FROM  [dbo].[OpticsData] O1 CROSS JOIN 
                      [dbo].[OpticsData] O2 
               WHERE  O1.PointId <> O2.PointId AND
                      [dbo].[ufnOpticsEuclideanDist] (O1.PointId, O2.PointId) <= @epsilon;
      ELSE
         INSERT INTO  @Neighbors
            SELECT    O1.PointId
                     ,O2.PointId
                     ,[dbo].[ufnOpticsManhattanDist] (O1.PointId, O2.PointId)
                     ,ROW_NUMBER() OVER(PARTITION BY O1.PointId ORDER BY [dbo].[ufnOpticsManhattanDist] (O1.PointId, O2.PointId), O2.PointId)
               FROM  [dbo].[OpticsData] O1 CROSS JOIN 
                      [dbo].[OpticsData] O2 
               WHERE  O1.PointId <> O2.PointId AND
                      [dbo].[ufnOpticsManhattanDist] (O1.PointId, O2.PointId) <= @epsilon;
 
      UPDATE        [dbo].[OpticsData]
         SET        [CoreDist] = dist
         FROM     @Neighbors N
         WHERE     [PointId] = N.p1 AND
                   N.pos = 1;
 
      -- Reachability distance ==================================================
 
      WHILE EXISTS 
         (SELECT    1 
            FROM   [dbo].[OpticsData] 
            WHERE  [IsProcessed] = 0) BEGIN
 
         UPDATE     [dbo].[OpticsData]
            SET     [IsSeed] = 0;
 
         SELECT TOP 1 @i = [PointId]
                  ,@cd = [CoreDist]
            FROM  [dbo].[OpticsData]
            WHERE  [IsProcessed] = 0
            ORDER BY [PointId];
 
         UPDATE     [dbo].[OpticsData]
            SET     [IsProcessed] = 1
                  ,[Ordering] = @ordering
            WHERE  [PointId] = @i;
      
         SET @ordering += 1;
 
         WITH        cteReach AS
            (SELECT     p2
                     ,CASE WHEN @cd > dist 
                          THEN @cd
                          ELSE dist
                      END AS NewReach
               FROM  [dbo].[OpticsData] LEFT OUTER JOIN
                      @Neighbors ON  
                      [PointId] = p1
               WHERE  p1 = @i)
            UPDATE     [dbo].[OpticsData]
               SET     [ReachDist] = NewReach
                     ,[IsSeed] = 1
               FROM  cteReach
               WHERE  [ReachDist] > NewReach AND
                      [PointId] = p2 AND
                      [IsProcessed] = 0;
 
         WHILE EXISTS 
            (SELECT     1 
               FROM  [dbo].[OpticsData] 
               WHERE  [IsSeed] = 1) BEGIN
 
            SELECT TOP 1 @j = [PointId]
                     ,@cd = [CoreDist]
               FROM  [dbo].[OpticsData]
               WHERE  [IsProcessed] = 0
               ORDER BY [ReachDist];
 
            WITH        cteReach AS
               (SELECT     p2
                        ,CASE WHEN @cd > dist 
                             THEN @cd
                             ELSE dist
                         END AS NewReach
                  FROM  [dbo].[OpticsData] LEFT OUTER JOIN
                         @Neighbors ON  
                         [PointId] = p1
                  WHERE  p1 = @j)
               UPDATE     [dbo].[OpticsData]
                  SET     [ReachDist] = NewReach
                        ,[IsSeed] = 1
                  FROM  cteReach
                  WHERE  [ReachDist] > NewReach AND
                         [PointId] = p2 AND
                         [IsProcessed] = 0;
 
            UPDATE     [dbo].[OpticsData]
               SET     [IsProcessed] = 1
                     ,[IsSeed] = 0
                     ,[Ordering] = @ordering
               WHERE  [PointId] = @j;
 
            SET @ordering += 1;
 
         END
 
      END;
 
      -- Clustering Id ==========================================================
 
      DECLARE crsCluster CURSOR FOR 
         SELECT        [CoreDist]
                     ,[ReachDist]
            FROM     [dbo].[OpticsData]
            ORDER BY  [Ordering]
         FOR UPDATE OF  [ClusterId];
 
      OPEN crsCluster
      FETCH NEXT FROM crsCluster INTO @cd,@rd
         
      WHILE @@FETCH_STATUS = 0 BEGIN
 
         IF @rd = 1E50
            SET @clusterId += 1;
         
         IF @cd <= @threshold BEGIN
            UPDATE    [dbo].[OpticsData]
               SET    [ClusterId] = @clusterId
               WHERE CURRENT OF crsCluster;
         END ELSE
            SET @clusterId += 1;
 
         FETCH NEXT FROM crsCluster INTO @cd,@rd
      END
 
      CLOSE crsCluster
      DEALLOCATE crsCluster
 
      -- Results ================================================================
 
      SELECT    [PointId]
               ,[Ordering]
               ,[ClusterId]
               ,[CoreDist]
               ,[ReachDist]
               ,[X]
               ,[Y]
         FROM  [dbo].[vOpticsData]
         ORDER BY [Ordering];
 
   END TRY
   BEGIN CATCH
      IF @@TRANCOUNT > 0
         BEGIN
            ROLLBACK TRANSACTION;
         END
         
      -- Print error information. 
      PRINT 'Error: ' + CONVERT(varchar(50), ERROR_NUMBER()) +
            ', Severity: ' + CONVERT(varchar(5), ERROR_SEVERITY()) +
            ', State: ' + CONVERT(varchar(5), ERROR_STATE()) + 
            ', Procedure: ' + ISNULL(ERROR_PROCEDURE(), '-') + 
            ', Line: ' + CONVERT(varchar(5), ERROR_LINE());
      PRINT ERROR_MESSAGE();
   END CATCH;
END
GO

The second stored procedure is to visualize the plotted clusters.

-- =============================================
-- Author:      SCP - MSSQLTips
-- Create date: 20250904
-- Description: Optics clustering show
-- =============================================
CREATE OR ALTER PROCEDURE [dbo].[uspOpticsDataShow] 
            (@Scale float = 0.01)
WITH EXECUTE AS CALLER 
AS
BEGIN;
   BEGIN TRY
 
      DECLARE  @OverallExtent geometry
            ,@ProportionalBuffer float;
 
        SELECT       @OverallExtent = geometry::EnvelopeAggregate([SpatialPoint])
            FROM     [dbo].[OpticsData];
 
        DECLARE @MinX float = @OverallExtent.STPointN(1).STX;
        DECLARE @MinY float = @OverallExtent.STPointN(1).STY;
        DECLARE @MaxX float = @OverallExtent.STPointN(3).STX;
        DECLARE @MaxY float = @OverallExtent.STPointN(3).STY;
 
        DECLARE @ExtentWidth float = @MaxX - @MinX;
        DECLARE @ExtentHeight float = @MaxY - @MinY;
 
        DECLARE @MinDimension float = IIF(@ExtentWidth < @ExtentHeight, @ExtentWidth, @ExtentHeight);
 
        DECLARE @Ratio float = @MinDimension * @scale; 
 
        -- Returning the clusters
 
        SELECT       [ClusterId]
                    ,COUNT(*) AS N
                    ,geometry::CollectionAggregate(SpatialPoint) AS geom
            FROM    (SELECT      ClusterId
                                ,CASE 
                                    WHEN ClusterId IS NULL THEN  SpatialPoint.STBuffer(@Ratio / 2) 
                                    ELSE SpatialPoint.STBuffer(@Ratio * 0.75 ) 
                                    END AS SpatialPoint 
                        FROM     [dbo].[OpticsData]) AS X
            GROUP BY [ClusterId];
 
   END TRY
   BEGIN CATCH
      PRINT 'Error: ' + CONVERT(varchar(50), ERROR_NUMBER()) +
            ', Severity: ' + CONVERT(varchar(5), ERROR_SEVERITY()) +
            ', State: ' + CONVERT(varchar(5), ERROR_STATE()) + 
            ', Procedure: ' + ISNULL(ERROR_PROCEDURE(), '-') + 
            ', Line: ' + CONVERT(varchar(5), ERROR_LINE());
      PRINT ERROR_MESSAGE();
   END CATCH;
END
GO

Examples of using the algorithm

Below are three different examples of using the algorithm using the SQL code we created.

Example 1

-- MSSQLTips (TSQL)
 
TRUNCATE TABLE [dbo].[OpticsData];
INSERT INTO [dbo].[OpticsData]
         ([X]
         ,[Y])
      VALUES (1.0, 1.0),(1.2, 1.1),(0.8, 0.9),(5.0, 5.0),
         (5.2, 5.1),(4.9, 5.2),(10.0, 10.0),(10.1, 9.9),
         (9.8, 10.2),(13.0, 13.0);
 
EXECUTE [dbo].[uspOpticsClustering] 2,2,0.5;
EXECUTE [dbo].[uspOpticsDataShow] 0.04;
GO

The first stored procedure results:

Example 1 summary

Visualize the plotted clusters:

Example 1 graph

Example 2

-- MSSQLTips (TSQL)
 
TRUNCATE TABLE  [dbo].[OpticsData];
INSERT INTO  [dbo].[OpticsData]
         ([X]
         ,[Y])
    VALUES   ( 1, 1),( 2, 1),( 1, 2),( 2, 2),( 3, 5)
            ,( 3, 9),( 3,10),( 4,10),( 4,11),( 5,10)
            ,( 7,10),(10, 9),(10, 6),( 9, 5),(10, 5)
            ,(11, 5),( 9, 4),(10, 4),(11, 4),(10, 3); 
 
EXECUTE  [dbo].[uspOpticsClustering] 1.75,4,1;
EXECUTE [dbo].[uspOpticsDataShow] 0.03;
GO

The first stored procedure results:

Example 2 summary

The ClusterId does not follow a numeric sequence order, but defines the cluster.

Visualize the plotted clusters:

Example 2 graph

Example 3 – Happy face

Load to the Optics data table the happy face data using my article Generating shaped-bounded random points in SQL Server, and run

-- MSSQLTips (TSQL)
 
EXECUTE  [dbo].[uspOpticsClustering] 0.2,10,0.1;
EXECUTE [dbo].[uspOpticsDataShow] 0.02;
GO

Visualize the plotted clusters:

Happy face graph

Next Steps

As you can see Optics does not produce clusters itself and will depends on how you cut the reachability plot.

One improvement that can be done in the code above is to implement the Xi method to steep downward changes in reachability and starts or ends clusters based on those compared slopes of adjacent values.

Leave a Reply

Your email address will not be published. Required fields are marked *