SQL Server 2008 MERGE More than UPSERT

So, more than likely almost everyone has already heard that Sql 2008 will introduce a MERGE syntax, making it much easier to deal with scenarios where you need to perform 'upsert' type logic (i.e., you have a source data set and you want to put records from that source set into a target set (table, updateable view), but some of the records already exist in the target, and some don't – for those that exist, you want to perform an update, for those that don't exist, you want to insert…hence, upsert :-)). A simple bullet list of what you do when using the MERGE statement would include:

Specify the source and target data sets
Specify joining conditions between the source and target data set
Specify 1 or more 'matching' clauses
Specify the action to occur when matches do/don't exist

However, there is much to the MERGE statement that far exceeds simply 'upsert' operations, for example:

There are 3 separate matching clauses you can specify, not only 2 (so it's not just an upsert if you want). The 3 matching clauses are:
- WHEN MATCHED
- WHEN [TARGET] NOT MATCHED
- WHEN SOURCE NOT MATCHED

The 'WHEN MATCHED' clause fires for records that match based on the joining conditions you specify between the source and target sets (typically where you might perform an update). The 'WHEN [TARGET] NOT MATCHED' fires when a record exists in the source data set that doesn't exist in the target (typically where you might perform an insert) – note also that the [TARGET] clause is optional, and if you simply write 'WHEN NOT MATCHED' you are implying the [TARGET] clause. The 'WHEN SOURCE NOT MATCHED' clause fires when a record exists in the target data set that doesn't exist in the source data set (typically where you might not do anything at all, or where you might perform a delete on the target) – some people get a little cross-eyed at this and wonder when it would ever be useful. Well, a big place where it will be useful is for managing data warehouse merges for example, and we'll show an example of this further down in the post…

You can include additional arguments in the matching clauses

You don't have to leave it at 'WHEN MATCHED' or 'WHEN TARGET NOT MATCHED'. You can also say something like 'WHEN MATCHED AND columnId > 10' or 'WHEN MATCHED AND columnQuantity = 0' or anything else you can come up with that is a valid join-type argument. We'll also see an example of this a bit further down in the post…

You can include multiple instances of the same matching clause

You aren't restricted to having only a single 'WHEN MATCHED' clause, or a single 'WHEN TARGET NOT MATCHED' clause – you can have as many of them as you like. This makes logical sense given #2, but it's worth stating as well. If you have multiple instances of the same matching clause, they are processed in the order they are listed in your query. Yep, you guessed it, we'll see an example of this later down the post…

You don't have to specify the typical operations with the typical clauses

What I mean by this is you don't have to use an update with the 'WHEN MATCHED' clause, and you don't have to use an insert with the 'WHEN TARGET NOT MATCHED' clause. Most likely, you'll frequently use the 'typical' operations with each clause, but you might find this useful particularly in cases where you have multiple instances of the same clause. And yes, we'll see an example of this later down the post…

You can use a new $action function in conjunction with the OUTPUT

This new $action function will output values of 'INSERT', 'UPDATE', or 'DELETE', giving you insight into what type of operation occurred on the record in question to cause it to be OUTPUTed…

Put all of these points together and you've got a very robust statement with plenty of possible uses. Obviously, it's also still great for the 'simple' scenarios of just wanting to 'upsert' data from one set to another. Particularly if you combine these features with other new features in Sql 2008 (such as the INSERT over DML feature, which I blogged about just the other day). One place much of this additional functionality will be extremely beneficial for is when loading datamarts/warehouses, particularly in a Slowly Changing Dimension scenario. (I've attached to this blog a simple MERGE usage scenario as well if you haven't had the opportunity to work with it yet).

If you're not familiar with a Type 2 Slowly Changing Dimension scenario, this next block will try to explain the main parts of it that you'd be concerned with for the following script/demo – if you are familiar with the semantics of it, simply skip this block and go straight to the script below.

In a type 2 slowly changing dimension, you can think of having a source data set that contains records that identify something you care to track the historical record of (could be inventory parts, or contacts, etc., etc.). The historical stuff you want to track could include changes in pricing for inventory parts, or phone # changes for a contact, etc. In your OLTP system, you most likely have only the 'current' reflection of the part or contact, but in your data warehouse, you may want/need to keep all versions of each part/contact so you can correlate that back to other associated items at a particular point in time (orders for the parts for example, or what geographic region a person lived in). There's many ways to do this, but in a Type 2 SCD, you basically have a table that mimics the OLTP table in terms of structure with the addition of columns for tracking the start/end dates/times of when the given version of a record was 'valid' or 'current'. The current version of the record would contain no value for the end time, since it's currently valid, but you may have 15 other copies of the same logical record in the table, each with the same key values but differing in non-key values in some way (i.e. the same part number, but different prices, or descriptions, or cost, etc., etc.). When an existing current record for some logical entity needs to be replaced with a new current version of the record, the existing record is simply left in place and updated to have an end time of the current date/time value, then the new version of the record is inserted and left with a null end date/time value. If you end up in the scenario where you have a copy (or multiple copies) of a given logical record in your warehouse, but suddenly no longer carry that item, or the contact no longer exists (i.e. you no longer carry this item in inventory perhaps, or a person passes away perhaps), you have a few possible options. You can simply purge/delete all the records for that logical record from your data warehouse; more likely however, you'll want to keep the history for that item/thing around, but simply indicate that no current data exists for the given thing – in this case, you would simply expire the current version of the record (by updating the end date/time value for the existing current record) and not put in a new version of the record.
This is vastly over-simplified, but enough to get the idea on how you might handle this type of scenario today if you had to with existing Sql technology and to see the benefit of the following sample…

So, in a Type 2 SCD scenario today, to get data from your source system to your target system might require anywhere from 3-6 or more different operations (compare the source against the target to insert non-existing records, do the same to update existing records, do the same again to insert new copies of existing records, do the same again to expire records from the target that don't exist in the source, etc.) depending on the functionality you want (obviously, you could use SSIS and the SCD task as well, but it's got to figure out this stuff somehow as well). With Sql 2008's MERGE and INSERT over DML functionality, you can perform this task in a single atomic operation (which is a little tough to understand until you review it a few times, but I've commented the heck out of it here, so hopefully it will be fairly easy to consume). Once you work through what's going on, it opens up the potential for lots of other places it would be handy as well in each environment.

Enjoy!

————————————————————————————————————-

—— CODE ONLY BELOW ——

————————————————————————————————————-

use tempdb;
go
— Cleanup…
if object_id('tempdb.dbo.scd2_source') > 0
drop table dbo.scd2_source;
if object_id('tempdb.dbo.scd2_target') > 0
drop table dbo.scd2_target;

— Setup the source table…
create table dbo.scd2_source (
    id                        int not null,
    category_id                int not null,
    short_desc                varchar(25) not null,
    long_desc                varchar(250) not null
go
alter table dbo.scd2_source add constraint pk_scd2_source
    primary key clustered (id)
go

— Setup the target table…
create table dbo.scd2_target (
    id                        int not null,
    category_id                int not null,
    short_desc                varchar(25) not null,
    long_desc                varchar(250) not null,
    date_effective_begin    datetime not null,
    date_effective_end        datetime null,
    date_created            datetime not null default getdate(),
    date_modified            datetime null
go
— Given this is a type 2 SCD, uniqueifier is the ID plus the effective begin date…
alter table dbo.scd2_target add constraint pk_scd2_target
    primary key clustered (id, date_effective_begin)
go

/*
    In this scenario, we will have a record that exists in the source table but not
    in the target table, 2 records that exist in both the source and the target (however
    1 of these records will match completely, 1 will match on key values but not on
    descriptive values), and 1 recrod that exists in the target table, but not the
    source table. We're simulating a slowly changing type 2 dimension scenario here.
*/

— Input some sample source records…
insert    dbo.scd2_source
        (id, category_id, short_desc, long_desc)
values    (1,1,'ISNT','in the SOURCE, not in the TARGET'),
        (2,2,'ISIT_FULLMATCH','in the SOURCE, in the TARGET, match all columns'),
        (3,3,'ISIT_KEYMATCHONLY_A','in SOURCE, in the TARGET, keys match, descriptions do NOT match – this is record A (in the source)');

— Input some sample target records…
insert    dbo.scd2_target
        (id, category_id, short_desc, long_desc,
        date_effective_begin, date_effective_end, date_created, date_modified)
values    (4,4,'NSIT','not in the SOURCE, in the TARGET', getdate(), null, getdate(), null),
        (2,2,'ISIT_FULLMATCH','in the SOURCE, in the TARGET, match all columns', getdate(), null, getdate(), null),
        (3,3,'ISIT_KEYMATCHONLY_B','in SOURCE, in the TARGET, keys match, descriptions do NOT match – this is record B (in the target)', getdate(), null, getdate(), null);
go

/*
    At this point, we've got the structure setup with the initial data as well…
    Should notice that we have the exact dataset we described above…
*/
— Show the before snapshot of the source…
select    id, category_id, short_desc, long_desc
from    dbo.scd2_source;
— Show the before snapshot of the source…
select    id, category_id, short_desc, long_desc,
        date_effective_begin, date_effective_end, date_created, date_modified
from    dbo.scd2_target;
go

/*
    Now for the single operation that will be run to merge the source dataset into the target
    dateset. However, since this is a Slowly Changing Dimension Type 2 scenario, we aren't
    simply going to do a simple MERGE. We've got to:

        1)    INSERT any records from the source that aren't already in the TARGET, based on a key-match
        2)    UPDATE any records in the TARGET that match records from the SOURCE, based on a key-match, but
            ONLY if the non-key columns don't match (if all columns match, we won't do anything, since nothing
            has changed). However, again since this is an SCD2 scenario, we aren't going to UPDATE the target
            record to match the SOURCE record, we're simply going to expire the existing TARGET record by
            changing the date_effective_end column value to the current date/time and put another record with
            the same key values into the table with a current date_effective_begin value and no date_effective_end
            value, since this will be the new 'current' record for the given entity.
        3)    INSERT any records from #2 that have matching key-values but non-matched non-key values – these are the
            new 'current' records for entity's that changed…
        4)    Either UPDATE or DELETE records in the TARGET that no longer exist in the SOURCE. What you would do here
            would depend on your business and how you want to track data across time. This scenario would be something
            like having parts on your shelves at a store for a few years and tracking the changes in price/etc. to
            those items, then one day deciding to pull that item from the shelves and not carry it any longer. At this
            point you could purge all the history from your data warehouse if you so choose, however you may also want
            to keep that history for, well, historical purposes. In this case, instead of deleting the data in the
            target, we'll simply UPDATE the date_effective_end for the current record to the current date/time, and
            then not re-insert a new record for the given item…

    Again, the key thing here is that this will ALL be performed in a single operation with Sql 2008 (most likely this
    would be 4-6 different operations or more in Sql 2005 and earlier (unless you use the SCD task with SSIS, which by
    the way will begin using similar type functionality as this in 2008 and later). The 2 primary new features in SQL
    2008 that allow all this to occur in a single-operation are:

1) MERGE
2) INSERT over DML using OUTPUT

Ok, onward to the statement…
*/

— This is the outer insert statement that will directly consume records from the inner MERGE statement via the use
— of the OUTPUT clause along with the MERGE. The OUTPUT data set here will contain any records from the inner MERGE
— statement that were inserted/updated. However, all we want to INSERT here in the outer insert statement are
— records that were UPDATED due to matching key values but mis-matched non-key values (i.e. a case where we have an
— existing record matching on keys in both the source and target, and we need to update the target record to become
— expired (see the inner MERGE statement for that), but then also INSERT a new 'current' copy of that same logical
— record). To achieve this, you will notice way down at the end of this statement that we have a WHERE clause on
— the SELECT that is feeding this insert – this WHERE clause is what is actually filtering the OUTPUT dataset to
— include only the records we want to INSERT new copies of…
insert    dbo.scd2_target
        (id, category_id, short_desc, long_desc,
        date_effective_begin, date_effective_end, date_created, date_modified)
select    scd2.source_id, scd2.source_category_id, scd2.source_short_desc, scd2.source_long_desc,
        scd2.target_date_effective_end, null, getdate(), null
from    (    — Here is the inner MERGE statment. This MERGE will use the OUTPUT clause (down a bit) to output
            — a data-set that will feed the outer SELECT statment, which feeds the top-most INSERT…
            merge    dbo.scd2_target t    — Merging into the target…
            using    dbo.scd2_source    s    — From the source…
            on        t.id = s.id                        — Joining on ID…
            and        t.category_id = s.category_id    — and category…
            and        t.date_effective_end is null    — only join to TARGET records that make up the 'current'
                                                    — data set, i.e. are not yet expired…
            when    matched                            — Ok, start the MERGE…when we have a match based on the above conditions…
            and        ((isnull(t.short_desc,'') <> isnull(s.short_desc,''))    — AND either the short_desc values do NOT match…
                    or (isnull(t.long_desc,'') <> isnull(s.long_desc,'')))    — OR the long_desc values do NOT match…
            then    update set    — This is a record that has a key-matching record in the source but with different non-key values…
                    t.date_effective_end = getdate(),    — Expire the current target record – we will have to INSERT a new copy
                    t.date_modified = getdate()            — of this logical record later above in the outer INSERT statment…
                    — NOTE: We will ignore any records from the SOURCE that match the TARGET completely – no changes, so nothing to do…
            when    target not matched        — A record in the SOURCE, but not in the TARGET…
            then    insert                    — Simple scenario – insert the new record from the SOURCE into the TARGET…
                    (id, category_id, short_desc, long_desc,
                    date_effective_begin, date_effective_end, date_created, date_modified)
                    values (s.id, s.category_id, s.short_desc, s.long_desc, getdate(), null, getdate(), null)
            when    source not matched        — A record in the TARGET, but not in the SOURCE…
            then    update set                — This is the case outlined above – we will not delete, we will simply…
                    t.date_effective_end = getdate(),    — expire the current record in the TARGET and
                    t.date_modified = getdate()            — not put a new record back in its place (later)
            output    — Ahhh…the OUTPUT clause – this will OUTPUT any records that were inserted/updated by our MERGE above…
                    $action,    — hmmm…this is new…will tell us if the record in the OUTPUT was either inserted, updated, deleted per row…
                    s.id,        — Notice that we're OUTPUTing the SOURCE values – this is because these source values will be used
                    s.category_id,    — to insert a new copy of a pre-existing record in the TARGET that had matching key-values but
                    s.short_desc,    — non-matching non-key values – the old record will have been expired by our WHEN MATCHED clause
                    s.long_desc,    — above in the MERGE, then these values will be used by the outer INSERT statement to put in new copies…
                    inserted.date_effective_end    — Notice we can access the special 'inserted' and 'deleted' tables…
        ) as scd2 (    output_action, source_id, source_category_id, source_short_desc,    — Define the column names for
                    source_long_desc, target_date_effective_end)                        — our OUTPUT set…
where    scd2.output_action = 'UPDATE'    — We only want to INSERT records from the MERGE statement that were UPDATEd…
and        scd2.source_id is not null;    — Need to include this clause because we want to exclude any records that were in the TARGET but not
                                — in the SOURCE and were UDPATEd to be expired, but no replacement 'current' record should be put back…

/*
Take a look at the after state of the target table now…should notice that:

        1)    Record ID 1 has now been INSERTed into the TARGET – single record exists for ID 1 with no date_effective_end value
        2)    Record ID 2 has not changed at all, was not affected by the operation at all (since everything matched)
        3)    Record ID 3 has 2 records – 1 (with a short_desc value of 'ISIT_KEYMATCHONLY_B', which is the record that was already
            in the TARGET table prior to the operation) with a date_effective_end value that is non-null and a date_created value
            that is earlier than the 2nd logical record for ID 3. The 2nd ID 3 record (with a short_desc value of 'ISIT_KEYMATCHONLY_A',
            which is the record that was in the SOURCE table prior to the operation) has a null date_effective_end value and a
            date_created value that is later than the 1st ID 3 record. These records are the result of the scenario where we needed
            to expire the existing record, then INSERT a new copy of that record.
        4)    Record ID 4 has been UPDATEd to have a non-null date_effective_end (which should match the date_modified value for the
            same record). Prior to the operation, this record existed in the TARGET but not in the SOURCE – during the operation,
            since we had a record that no longer was 'on the shelves' so to speak, we simply expired the TARGET record and did not
            replace it with anything new (no new record from the SOURCE to put in its place)
*/
— Show the 'after' state of the target table…
select    *
from    dbo.scd2_target t
order by t.id, t.date_created;
go

Chad Boyd

Chad is an Architect, Administrator, and Developer with technologies such as Sql Server (and related technologies), .NET, Windows Server, and the Microsoft Clustering stack (MSCS, NLB, HPC). After nearly 5 years at Microsoft in a variety of roles related to pushing the limits of the Microsoft stack in many of the largest, most complex Sql Server installations in the world, he now provides consulting to many different customers and clients ranging from startup to Fortune 50. Most recently he has been spending much of his time as an architect at SpruceMedia, Inc., a startup firm in the Facebook/Social Advertising space (http://sprucemedia.com) and resides in the Redmond, WA area where he also spends a significant amount of time writing, talking, presenting and blogging about the data-related technologies on the Microsoft stack. Chad regularly posts Sql Server related content, tools, and advice with the MSSQLTips team and on stackoverflow.com at http://stackoverflow.com/users/169012/chadhoc. Chad can be contacted via email at chad dot boyd dot tips at gmail dot com.

Leave a ReplyCancel Reply