Populate dimension tables dynamically in SQL Server Part 1

By:   |   Updated: 2022-04-08   |   Comments (2)   |   Related: More > Database Design


Problem

We often build logging or other insert-only tables where we store large strings like URLs, host names, or error messages. It's usually not until the table has terabytes of data that we realize there might have been a better way. If we are logging traffic or exceptions for our own application, it's likely that we record the same URL, host name, or error message on millions of rows. What if we only had to write that URL or host name or message text once, the first time we saw it? In this tip, I want to share one idea for abstracting away recurring values, reducing storage, and making search queries faster (especially those with wildcards) without requiring immediate changes in the application layer.

Solution

I mentioned dynamic dimension tables in the title because, typically, a dimension table is one we generate in advance with a known (or at least slowly changing) data domain. These are data sets that remain relatively static over time, but we don't always know all possible values up front. Examples include order status codes, lists of cities and states, an enumeration of major SQL Server versions, or – one of my favorites – a calendar table.

When there are new values coming into the system regularly, we can generate such a dimension on the fly, populating it with only new dimension values as we encounter them. Logging tables that can grow exponentially and unpredictably are exactly where it is worth investing in this kind of optimization.

Let's say we start with a logging table like this:

CREATE TABLE dbo.ExceptionLog
(
  LogID         bigint IDENTITY(1,1),
  Hostname      nvarchar(64),
  URI           nvarchar(255),
  ErrorMessage  nvarchar(255),
  EventTime     datetime2(0) NOT NULL DEFAULT sysutcdatetime(),
  CONSTRAINT    PK_ExceptionLog PRIMARY KEY (LogID)
);

After some time, if the application is busy, we can imagine there will be a lot of rows in this table. To simulate this, without writing a low quality app that generates years' worth of exceptions quickly, we can generate some realistic data manually. I may have mentioned in previous tips that data generation is often my favorite part of these exercises.

First, we need a variety of host names (let's say a bunch of machines acting as specific types of application servers). One way to do that is to pull some rows from sys.all_objects and use something that varies a little, but not much; one attribute with a small range of variety is the length of object names:

SELECT TOP (5) HostName = CONCAT(CASE LEN(o.name) % 2 
                          WHEN 1 THEN N'WebApp' 
                                 ELSE N'FileLoad' END, 
                          N'Server', LEN(o.name))
FROM sys.all_objects AS o 
ORDER BY NEWID();

Results (yours will absolutely vary):

HostName
----------------
WebAppServer25
FileLoadServer16   
FileLoadServer20   
WebAppServer19     
WebAppServer21     

If we expand that, we can add fake URI and EventTime values:

SELECT TOP (5) HostName  = CONCAT(CASE LEN(o.name) % 2 
                           WHEN 1 THEN N'WebApp' 
                                  ELSE N'FileLoad' END, 
                           N'Server', LEN(o.name)),
               URI       = N'/' + REPLACE(o.name, N'_', N'/'),
               EventTime = DATEADD(MINUTE, ABS(o.object_id) % 10000, '20220101')
FROM sys.all_objects AS o 
ORDER BY NEWID();

Results:

HostName           URI                          EventTime
----------------   --------------------------   ----------------
WebAppServer25     /fn/GetRowsetIdFromRowDump   2022-01-02 08:16
FileLoadServer16   /sp/MSdeleteretry            2022-01-01 06:44
FileLoadServer20   /trace/event/bindings        2022-01-05 22:52
WebAppServer19     /default/constraints         2022-01-01 06:48
WebAppServer21     /sp/addtabletocontents       2022-01-02 01:36

Now, all that's left is the error messages. Again, we want to have some variety here, but not too much. It is also nice to have realistic text without having to use our imagination. We can cross join to sys.messages:

SELECT TOP (5) HostName  = CONCAT(CASE LEN(o.name) % 2 
                           WHEN 1 THEN N'WebApp' 
                                  ELSE N'FileLoad' END, 
                           N'Server', LEN(o.name)),
               URI       = N'/' + REPLACE(o.name, N'_', N'/'),
               MsgText   = LEFT(msg.Text, 255),
               EventTime = DATEADD(MINUTE, ABS(o.object_id) % 10000, '20220101')
FROM sys.all_objects AS o 
CROSS JOIN
(
  SELECT TOP (5) [text] FROM sys.messages 
  WHERE language_id = 1033 ORDER BY NEWID()
) AS msg
ORDER BY NEWID();

Results (with some truncation for page width):

HostName           URI            MsgText               EventTime
----------------   ------------   -------------------   ----------------
WebAppServer25     /fn/GetRo...   A corrupted mess...   2022-01-02 08:16
FileLoadServer16   /sp/MSdel...   Invalid property...   2022-01-01 06:44
FileLoadServer20   /trace/ev...   A corrupted mess...   2022-01-05 22:52
WebAppServer19     /default/...   %.*ls: Page %d:%...   2022-01-01 06:48
WebAppServer21     /sp/addta...   Warning: the flo...   2022-01-02 01:36

Next, we can go for scale. Increase the objects to 10,000 rows, the messages to 100 rows, and then repeat 100 times (producing a million rows into our logging table):

INSERT dbo.ExceptionLog(Hostname, Uri, ErrorMessage, EventTime)

SELECT TOP (10000) 
               HostName  = CONCAT(CASE LEN(o.name) % 2 
                           WHEN 1 THEN N'WebApp' 
                                  ELSE N'FileLoad' END, 
                           N'Server', LEN(o.name)),
               URI       = N'/' + REPLACE(o.name, N'_', N'/'),
               MsgText   = LEFT(msg.Text, 255),
               EventTime = DATEADD(MINUTE, ABS(o.object_id) % 10000, '20220101')
FROM sys.all_objects AS o 
CROSS JOIN
(
  SELECT TOP (100) [text] FROM sys.messages 
  WHERE language_id = 1033 ORDER BY NEWID()
) AS msg
ORDER BY NEWID();
GO 100

On my machine, this took a little over a minute, and we have our million rows. Partial output of sp_spaceused:

rows      reserved    data 
-------   ---------   ---------
1000000   351368 KB   335920 KB

And if we look at the breakdown of distinct values in our three important columns:

SELECT Hosts     = COUNT(DISTINCT HostName),
       HostSize  = SUM(DATALENGTH(HostName))/1024,     -- MB
       URIs      = COUNT(DISTINCT URI),   
       URISize   = SUM(DATALENGTH(URI))/1024,          -- MB
       Errors    = COUNT(DISTINCT ErrorMessage),
       ErrorSize = SUM(DATALENGTH(ErrorMessage))/1024, -- MB
       AllThree  = SUM(DATALENGTH(HostName) + 
                       DATALENGTH(URI)+
                       DATALENGTH(ErrorMessage))/1024  -- MB
  FROM dbo.ExceptionLog;

Results:

Hosts   HostSize   URIs   URISize   Errors   ErrorSize   AllThree
-----   --------   ----   -------   ------   ---------   --------
48      29244      2507   46334     7288     207840      283418

That shows us the not-so-shocking news that these wide columns take up almost 90% of the table's 300+MB.

Now what?

We shouldn't need to keep roughly 20,000 copies of each host name, 400 copies of each URI, and 125 copies of each error message, especially on a very write-heavy table. As I mentioned in the opening, this is wasteful in terms of storage (and, by extension, memory and I/O), and makes for expensive searches for specific values (never mind substring and wildcard searches).

To address this, we can create separate dimension tables that just contain one copy of each values, and a corresponding integer ("surrogate key") that can be used in the logging table instead.

CREATE TABLE dbo.ExceptionHostnames
(
  HostnameID  bigint IDENTITY(1,1),
  Hostname    nvarchar(64),
  CONSTRAINT  PK_ExceptionHostnames PRIMARY KEY (HostnameID)
);

CREATE TABLE dbo.ExceptionURIs
(
  URIID       bigint IDENTITY(1,1),
  URI         nvarchar(255),
  CONSTRAINT  PK_ExceptionURIs PRIMARY KEY (URIID)
);

CREATE TABLE dbo.ExceptionErrorMessages
(
  ErrorMessageID  bigint IDENTITY(1,1),
  ErrorMessage    nvarchar(255),
  CONSTRAINT      PK_ExceptionErrorMessages PRIMARY KEY (ErrorMessageID)
);

Why bigint? I have made it a personal policy to always use bigint for identity columns because of seeing int exhaustion one time too many.

Then we can add nullable foreign key columns to the fact table:

ALTER TABLE dbo.ExceptionLog ADD
  HostnameID     bigint NULL FOREIGN KEY
                 REFERENCES dbo.ExceptionHostnames(HostnameID),
  URIID          bigint NULL FOREIGN KEY
                 REFERENCES dbo.ExceptionURIs(URIID),
  ErrorMessageID bigint NULL FOREIGN KEY
                 REFERENCES dbo.ExceptionErrorMessages(ErrorMessageID);

Next, we add an AFTER INSERT trigger so that we don't miss any new hosts, URIs, or errors (and it's important we do this before back-filling the old rows). It needs to do two things: (1) generate new IDs for new values we haven't seen before, and (2) update the table with those new IDs.

CREATE TRIGGER dbo.ExceptionLog_DivertDimensions
ON dbo.ExceptionLog
AFTER INSERT
AS
BEGIN
  SET NOCOUNT ON;

  -- generate new IDs
  INSERT dbo.ExceptionHostnames(Hostname) 
    SELECT Hostname FROM inserted AS i WHERE NOT EXISTS 
    (
      SELECT 1 FROM dbo.ExceptionHostnames 
        WHERE Hostname = i.Hostname
    ) GROUP BY Hostname;

  INSERT dbo.ExceptionURIs(URI) 
    SELECT URI FROM inserted AS i WHERE NOT EXISTS 
    (
      SELECT 1 FROM dbo.ExceptionURIs 
        WHERE URI = i.URI
    ) GROUP BY URI;

  INSERT dbo.ExceptionErrorMessages(ErrorMessage) 
    SELECT ErrorMessage FROM inserted AS i WHERE NOT EXISTS 
    (
      SELECT 1 FROM dbo.ExceptionErrorMessages 
        WHERE ErrorMessage = i.ErrorMessage
    ) GROUP BY ErrorMessage;

  -- update IDs
  UPDATE l SET l.HostnameID = h.HostnameID
    FROM dbo.ExceptionLog AS l
    INNER JOIN dbo.ExceptionHostnames AS h
    ON l.Hostname = h.Hostname
    INNER JOIN inserted AS i ON i.LogID = l.LogID
    WHERE l.HostnameID IS NULL;

  UPDATE l SET l.URIID = u.URIID
    FROM dbo.ExceptionLog AS l
    INNER JOIN dbo.ExceptionURIs AS u
    ON l.URI = u.URI
    INNER JOIN inserted AS i ON i.LogID = l.LogID
    WHERE l.URIID IS NULL;

  UPDATE l SET l.ErrorMessageID = m.ErrorMessageID
    FROM dbo.ExceptionLog AS l
    INNER JOIN dbo.ExceptionErrorMessages AS m
    ON l.ErrorMessage = m.ErrorMessage
    INNER JOIN inserted AS i ON i.LogID = l.LogID
    WHERE l.ErrorMessageID IS NULL;
END
GO

Once the trigger is in place, we can populate the dimension tables with all the values we've seen so far. We can take our time with this, and for larger tables we could use log-friendly, non-disruptive batching, since the application has no idea there's been a change yet. For now, we'll continue populating both sets of columns, and clear out the wider values later.

INSERT dbo.ExceptionHostnames(Hostname) 
   SELECT Hostname FROM dbo.ExceptionLog AS src WHERE NOT EXISTS 
   (
     SELECT 1 FROM dbo.ExceptionHostnames 
       WHERE Hostname = src.Hostname
   ) GROUP BY Hostname;

 INSERT dbo.ExceptionURIs(URI) 
   SELECT URI FROM dbo.ExceptionLog AS src WHERE NOT EXISTS 
   (
     SELECT 1 FROM dbo.ExceptionURIs 
      WHERE URI = src.URI
   ) GROUP BY URI;

 INSERT dbo.ExceptionErrorMessages(ErrorMessage) 
   SELECT ErrorMessage FROM dbo.ExceptionLog AS src WHERE NOT EXISTS 
   (
     SELECT 1 FROM dbo.ExceptionErrorMessages 
       WHERE ErrorMessage = src.ErrorMessage
   ) GROUP BY ErrorMessage; 

After that, we can start backfilling the ID columns in the fact table with the keys we generated above. This looks remarkably like the second half of the trigger:

UPDATE l SET l.HostnameID = h.HostnameID
  FROM dbo.ExceptionLog AS l
  INNER JOIN dbo.ExceptionHostnames AS h
  ON l.Hostname = h.Hostname
  WHERE l.HostnameID IS NULL;

UPDATE l SET l.URIID = u.URIID
  FROM dbo.ExceptionLog AS l
  INNER JOIN dbo.ExceptionURIs AS u
  ON l.URI = u.URI
  WHERE l.URIID IS NULL;

UPDATE l SET l.ErrorMessageID = m.ErrorMessageID
  FROM dbo.ExceptionLog AS l
  INNER JOIN dbo.ExceptionErrorMessages AS m
  ON l.ErrorMessage = m.ErrorMessage
  WHERE l.ErrorMessageID IS NULL;

Note: If the error messages that come in from the app contain tokens, like you see in sys.messages, this is probably fine. If the error messages are materialized with instance-specific inline values before logging happens, it might not be such a good idea, because you'll probably never have the same error message twice. A different approach in that case could be to get the pre-tokenized error string and the individual token values from the application at logging time, and store the individual values separately (it's always much easier to assemble a string from parts than to extract parts of a string).

The types of queries your users run against the logging table will help drive your indexing strategy for these columns, so I'll leave that part out for now. Being integers, though, they certainly open up a lot more doors than wide Unicode strings.

But an after trigger is not ideal

This will work in the short term: the application can keep inserting rows into ExceptionLog, and the trigger will sync everything together. Since we're still populating the wider columns, any readers will still be able to see those values without needing to know anything about the new dimension tables we're populating.

But we don't want to keep this logic in the trigger for long, because it's hiding some of the data logic, and the nature of the movement is relatively expensive. Eventually we'll want to update the application (or the stored procedures it calls), remove the trigger, and empty the wider columns in the log table. How we do that depends on how quickly we can change the application.

If we can change the app right now

For any portions of the application code that read from ExceptionLog, we can create a view that handles the joins for them and produces the same output columns as before, so the change we need to make to the app is minimal:

CREATE VIEW dbo.vExceptionLog
AS
  SELECT l.LogID,
         h.Hostname,
         u.URI,
         m.ErrorMessage,
         l.EventTime
  FROM dbo.ExceptionLog AS l
  INNER JOIN dbo.ExceptionHostnames AS h
    ON l.HostnameID = h.HostnameID
  INNER JOIN dbo.ExceptionURIs AS u
    ON l.URIID = u.URIID
  INNER JOIN dbo.ExceptionErrorMessages AS m
    ON l.ErrorMessageID = m.ErrorMessageID;

Once that view is in place, we can change the applications so that, when they write exceptions here, they are calling a stored procedure that deals with our new dimension tables without the use of a trigger (and without needlessly populating the wider columns). This is a slightly more invasive change to the application code if it's currently performing singleton inline inserts but, with any luck, we're just changing the stored procedure the application is already calling. This procedure will generate the IDs first so there is only one insert (leaving out error handling and proper transaction handling for brevity):

CREATE PROCEDURE dbo.LogException
  @Hostname      nvarchar(64),
  @URI           nvarchar(255),
  @ErrorMessage  nvarchar(255)
AS
BEGIN
  SET NOCOUNT ON;

  DECLARE @h table(HostnameID     int);
  DECLARE @u table(URIID          int);
  DECLARE @m table(ErrorMessageID int);

  INSERT @h(HostnameID) SELECT HostnameID 
    FROM dbo.ExceptionHostnames
    WHERE Hostname = @Hostname;

  IF @@ROWCOUNT = 0
  BEGIN
    INSERT dbo.ExceptionHostnames(Hostname) 
      OUTPUT inserted.HostnameID INTO @h(HostnameID) 
      VALUES(@Hostname);
  END

  INSERT @u(URIID) SELECT URIID 
    FROM dbo.ExceptionURIs
    WHERE URI = @URI;

  IF @@ROWCOUNT = 0
  BEGIN
    INSERT dbo.ExceptionURIs(URI) 
      OUTPUT inserted.URIID INTO @u(URIID) 
      VALUES(@URI);
  END

  INSERT @m(ErrorMessageID) SELECT ErrorMessageID 
    FROM dbo.ExceptionErrorMessages
    WHERE ErrorMessage = @ErrorMessage;

  IF @@ROWCOUNT = 0
  BEGIN
    INSERT dbo.ExceptionErrorMessages(ErrorMessage) 
      OUTPUT inserted.ErrorMessageID INTO @m(ErrorMessageID) 
      VALUES(@ErrorMessage);
  END

  INSERT dbo.ExceptionLog(HostnameID, URIID, ErrorMessageID)
    SELECT (SELECT HostnameID FROM @h),
           (SELECT URIID FROM @u),
           (SELECT ErrorMessageID FROM @m);
END
GO

(If any of these dimension columns are non-NULLable in the ExceptionLog table, you'll have to fix that first, or continue populating them until you can change the table.)

We can test this with the following:

EXEC dbo.LogException @Hostname     = N'My Laptop', 
                      @Uri          = N'/fake/path/', 
                      @ErrorMessage = N'404 for fake path';

SELECT TOP (1) * FROM dbo.ExceptionLog  ORDER BY LogID DESC;
SELECT TOP (1) * FROM dbo.vExceptionLog ORDER BY LogID DESC;

Output (leaving out EventTime):

LogID     Hostname    URI    ErrorMessage   HostnameID   URIID   ErrorMessageID
-------   ---------   ----   ------------   ----------   -----   --------------
1000002   NULL        NULL   NULL           50           2509    7290

LogID     Hostname    URI           ErrorMessage
-------   ---------   -----------   -----------------
1000002   My Laptop   /fake/path/   404 for fake path

Once this is the common path for inserts, we can disable or drop the trigger (it won't impact any logic, but it still is going to perform additional, wasteful reads to determine that we already took care of the dimension values).

And when the application is writing and reading in a way that the old columns are no longer relevant, we can drop them. Of course, we'll first need to handle any dependent indexes or constraints. If you want to play it a little bit safer, you could set them all to NULL or empty string (in batches if you need to be less intrusive), but I'd be cautious here. I'd rather get an invalid column error from some obscure part of the app, and know about the problem immediately, than for sudden blank strings to go unnoticed.

If we can't change the app(s) right now

Quite commonly we have development cycles for the database side that are not in sync with the application side, or there are multiple applications that can't all be changed together, or one or more apps aren't even under our control. In these cases, we won't necessarily have the freedom to use our new dimension tables exclusively, but in part 2 I'll talk about ways we can minimize the write impact and shield users from the underlying change as we wait (and wait, and wait) to change the application(s).

Next Steps



Get Started Now - Click here to get your free 14 day trial of SolarWinds Database Insights




get scripts

next tip button



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

View all my tips


Article Last Updated: 2022-04-08

Comments For This Article




Tuesday, April 12, 2022 - 2:14:07 AM - Barak 810 Back To Top (89993)
I use a lot of dimension tables, but not for logs.
I prefer archiving process for logs.
1. The log is used by DBA the same as for Developers I prefer everybody to be able to read it straight forward.
2. The real volume though is with the large texts. An error message returned by the payment provider contains the User ID, the time and variables all as text. Those can never ever be handled by a dimension tables.

I just keep X days in the logs and get all the rest to long-term archiving using Clustered Columnstore Index on the main table and varchar(max) into varbinary with Compress for the texts.
It turns out that Host Names, IPs and other repeated values are compressed really well with columnstore.

Monday, April 11, 2022 - 5:37:26 AM - Onur Back To Top (89990)
Tip Comments Pending Approval


download














get free sql tips
agree to terms