1 Billion

My boss was stomping around agitated because The Business has some heavy data requests. He had written something some time ago that got them something close to what HE thought THEY thought they needed.

I knew that this wasn’t what they really needed, but let him go ahead and send a sample of the data off anyway. Most times when people senior to you have their minds made up, there’s no point in arguing.

“It’s pretty fast,” he said, “if you need the code. It takes about an hour per partition.”

On our system, some of the really big tables are pseudo-partitions – separate tables with identical structures and similar names. Orders1, Orders2, Orders3, etc. He meant it took about an hour to process each one of these tables.

What I really needed was a subset of that data. About 12%, as it turns out. So I politely acknowledged his offer of Free Code! and wrote something that processed the first 16 partitions. It completed in about 2 hours 20 minutes.

As a measure of how much data this is, this is the rowcount of the largest table after the first 16 partitions were done. Remember, this rowcount is about 12% of the data.

———–
109734329

I have another set of partitions to process today, which will a bit more than double this total.

Yes, if take the time to do the math you will see that the base data is more than 1 billion rows.

Posted in Performance, T-SQL | Leave a comment

Procedure Query Plans in Cache

If you remember, SQL Server takes your stored procedure and rewrites it, then prepares a plan. This process is called ‘compilation’. If it can, it then executes the prepared plan rather than interpreting the query anew each time.

These plans are stored in an area called the ‘procedure cache’. Plans are retained in the cache based on a number of factors such as amount of server memory, plan cost, and current memory pressure. The engine will get rid of infrequently used plans before frequently used ones, and cheap plans before expensive ones.

You can, and often will, get multiple plans for some procedures in cache. Predicting how many plans there will be is not an exact science, but in general the more complex the arguments to your procedure, the more plans will be in cache, because the engine will create a different plan if you’re looking for all orders from Chicago than if you’re looking for all orders with total > $1000.

Here is a query that will let you see all plans in cache for a given object (a stored procedure is an object in SQL Server), as well as some useful and interesting information about those plans.

Apologies for the style – the mix of camelcase and under_score names is due to my schizophrenia on this point. I used to be exclusively one, but this place uses the other, and when I write queries that use DMVs I get all split-personality like.

Usecounts = number of times the plan has been used.
Plan_comp_time = the time that this plan was compiled.
Last_execution_time = the time that this plan was last executed.
HoursOld = length of time, in hours, that this plan has been in cache.
HoursUntouched = length of time, in hours, since this plan was last executed.
Execution_count = number of times this plan has been executed.
AvgReads = average number of pages read during an execution.
Max_logical_reads = the highest number of pages read during an execution.
Avg_time = average time, in milliseconds, that this plan took to complete.
Total_elapsed_time = total time – an indicator of total cost, since this is the sum of all times for this plan.
last_elapsed_time = the time for the last execution.
min_elapsed_time = the shortest time for an execution
max_elapsed_time = the longest time for an execution

SELECT  CONVERT(VARCHAR(36),o.name) AS ROUTINE,
        cp.usecounts,
        qs.creation_time AS plan_comp_time,
        qs.last_execution_time,
        CONVERT(VARCHAR(5), DATEDIFF(hh, qs.creation_time, GETDATE())) AS HoursOld,
        CONVERT(VARCHAR(5), DATEDIFF(hh, last_execution_time, GETDATE())) AS HoursUntouched,
        qs.execution_count,
        qs.total_logical_reads/qs.execution_count AS AvgReads,
        qs.max_logical_reads,
        qs.total_elapsed_time/qs.execution_count AS AvgTime,
        qs.total_elapsed_time,
        qs.last_elapsed_time,
        qs.min_elapsed_time,
        qs.max_elapsed_time
FROM    sys.dm_exec_cached_plans    cp
JOIN    sys.dm_exec_query_stats qs
ON      cp.plan_handle = qs.plan_handle
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st
JOIN    sys.objects o
ON      o.OBJECT_ID = st.objectid
WHERE   o.name LIKE ‘%SomeProcedureName%’
– ORDER BY hoursuntouched DESC
 
Posted in Performance, T-SQL | 1 Comment

Dean Yells At People

We had a mini-change go in. It was small, a change to an interface to a third party application. I had an email in hand detailing my part in it, which was to produce a list of old users to new users.

The external application had been notified, and email had gone out to the users. We would go down at 8:00 AM, and be back up in 20 minutes.

Note that this is a change to an interface to an external application that processes actual money for people.

6:45 – I review the project plan, retrieve the query I will use, check the query plan, make sure all is good. I am ready to go.
7:56 – the project manager makes his appearance and boots his laptop.
7:59 – the web dev manager appears.
8:01 – project manager calls out ‘are you guys ready to rock?’
8:04 – the web dev manager informs us that we are waiting for another developer.
8:05 – after some discussion in which someone says that there is some other database code to go in, I blow my stack, because as usual nobody has said anything to me or any other member of the database team. I hate surprises.
8:07 – PM phones the other developer, who apparently we need. He is on the train, and will be about 20 minutes.
8:10 – I calm down a bit, and search the code. I find some changes I made about 3 weeks ago (apparently I am supposed to keep all of these things in my head) and cue them up in my text editor, presuming that these are the necessary changes.
8:28 – I am asked to run the mapping query I had originally been asked for. I run it and supply the results to those who need them.
8:35 – the developer we are waiting for shows up and confirms that yes, we are putting in the code changes I located.
8:36 – I apply the code changes to the production database.
8:37 – I go back to other work, half expecting issues to surface at some point.
8:41 – the interface is back up.

Posted in IT Follies | Leave a comment

I May Be a Day Late, But I Don’t Think I’m a Dollar Short

I sometimes take part in SQL challenges or contests. I don’t think I’ve ever won one, but that could be due to the fact that I tend to come to the party late and wind up rushing something in. Part of that problem, too, is that I am a fairly busy person and spare time to tackle brain-twisting SQL problems is in short supply.

I may have been too late with my entry for Celko’s spring SQL Stumper: Eggs in one Basket on Simple-Talk, Redgate Software’s very useful site. Plus I don’t know how to format code in Simple-Talk’s comment form – I’m sure it’s on the site somewhere, but I couldn’t find it.

If you don’t want to read the problem, the gist of it is that you must take a Basket containing any number of eggs and split that Basket into cartons of 12. If you’ve ever read Celko (and I suggest you do), you’ll know that he abhors looping constructions (cursors and while/for each loops), and so the correct solution had damn well better be a set-based one.

My solution is included below. It does not include any of the stuff I put into production code: parameters are not checked, there are no TRY CATCH blocks, transactions are implicit, and there is no error-handling.

However, the statements are (mostly!) written so that they will not create crap. If the arguments provided will result in something that breaks the business rules, then the insert/update/delete simply won’t happen.

This only runs on SQL Server 2008. It makes use of the VALUES table constructor, and of the powerful MERGE statement. I’m trying to talk higher ups at work into going to SQL Server 2K8 just so that I can use the MERGE.

The first part is the DDL. Good DDL makes good DML possible.

CREATE TABLE Locations
    (LocationID         VARCHAR(20)     NOT NULL,
    LocationDescription VARCHAR(100)    NOT NULL,
    CONSTRAINT PK_Locations PRIMARY KEY (LocationID))
       
CREATE TABLE Baskets
    (BasketID           VARCHAR(20) NOT NULL,
    BasketEggCount      INT NOT NULL,
    CONSTRAINT PK_Baskets PRIMARY KEY (BasketID))

CREATE TABLE Cartons(
    BasketID            VARCHAR(20) NOT NULL,
    CartonSequenceNo    INT         NOT NULL,
    CartonEggCount      TINYINT     NOT NULL DEFAULT (0),
    CONSTRAINT  PK_Cartons PRIMARY KEY (BasketID, CartonSequenceNo),
    CONSTRAINT  FK_CartonsBaskets FOREIGN KEY (BasketID) REFERENCES dbo.Baskets(BasketID),
    CONSTRAINT  CK_CartonEggCount CHECK (CartonEggCount BETWEEN 0 AND 12)
    )
   
CREATE TABLE BasketLocations(
    LocationID          VARCHAR(20) NOT NULL,
    BasketID            VARCHAR(20) NOT NULL,
    CONSTRAINT  PK_BasketLocations PRIMARY KEY (LocationID, BasketID),
    CONSTRAINT  FK_BasketLocations_Locations FOREIGN KEY (LocationID) REFERENCES Locations(LocationID),
    CONSTRAINT  FK_BasketLocations_Baskets FOREIGN KEY (BasketID) REFERENCES Baskets(BasketID)
    )
 

I frequently use a table of integers. Various people call this by different names, but they are all essentially the same, a list of whole integers. For this exercise I created it with 500 numbers in it. The SplitBaskets stored procedure will break if you create a Basket with more than 500 eggs in it, but extending the size of the table is a trivial matter if you should want to try splitting a Basket with 10,000 eggs in it.

CREATE TABLE Integers(I INT NOT NULL,
    CONSTRAINT PK_Integers PRIMARY KEY (I))

INSERT Integers(I)
SELECT  (COALESCE(MAX(I), 0) + 1)
FROM    Integers   
GO 500 
 

This is the test data I used.

INSERT Locations (LocationID, LocationDescription)
SELECT  LOC.LocationID, LOC.LocationDescription
FROM (  VALUES  (‘BARN1′, ‘Barn #1′),
                (‘BARN2′, ‘Barn #2′),
                (‘PLANT1′, ‘Processing Plant #1′)
        ) AS LOC(LocationID, LocationDescription)
WHERE NOT EXISTS(
        SELECT  *
        FROM    Locations
        WHERE   LocationID = LOC.LocationID)
       
INSERT Baskets(BasketID, BasketEggCount)
SELECT  BAS.BasketID, BAS.BasketEggCount
FROM    (VALUES (‘BASKET1′, 0),
                (‘BASKET2′, 11),
                (‘BASKET3′, 25),
                (‘BASKET4′, 144),
                (‘BASKET5′, 0)
        ) AS BAS (BasketID, BasketEggCount) 
WHERE NOT EXISTS (
        SELECT  *
        FROM    Baskets
        WHERE   BasketID = BAS.BasketID)       

INSERT BasketLocations(BasketID, LocationID)
SELECT  B.BasketID,
        L.LocationID
FROM    Baskets B
CROSS JOIN Locations L
WHERE   B.BasketID <> ‘BASKET5′
AND     L.LocationID = ‘PLANT1′ 
AND NOT EXISTS (
        SELECT  *
        FROM    BasketLocations
        WHERE   BasketID    = B.BasketID
        AND     LocationID  = L.LocationID  )
       
INSERT BasketLocations(BasketID, LocationID)
SELECT  B.BasketID,
        L.LocationID
FROM    Baskets B
CROSS JOIN Locations L
WHERE   B.BasketID = ‘BASKET5′
AND     L.LocationID = ‘BARN1′ 
AND NOT EXISTS (
        SELECT  *
        FROM    BasketLocations
        WHERE   BasketID    = B.BasketID
        AND     LocationID  = L.LocationID  )      
GO
 

This first procedure, SplitBaskets, is the heart of the challenge. This code uses the Integers table provided above to create a set of rows, each with 12 eggs, and one row with the remainder (if any). It uses the ROW_NUMBER() window function to provide a sequence number, which is part of the primary key in the Cartons table.

CREATE PROCEDURE dbo.SplitBaskets(
    @LocationID     VARCHAR(20)
    )
AS
BEGIN
– Splits all Baskets at @Location into Cartons for any Basket
– where the BasketEggCount > 0

DECLARE @CartonSize     TINYINT

SET     @CartonSize     = 12

INSERT  Cartons(BasketID, CartonSequenceNo, CartonEggCount)
SELECT  X.BasketID,
        X.SequenceNo,
        CASE    WHEN X.CumEggs <= X.BasketEggCount THEN @CartonSize
                ELSE X.Remainder END    
        AS CartonEggCount
FROM    (– produces a set of rows with CartonEggCount = @CartonSize
        – the last row gets CartonEggCount = Remainder
        – sequence numbers are generated
        – doesn’t insert rows with BasketEggCount = 0  
        SELECT  ROW_NUMBER() OVER (PARTITION BY B.BasketID ORDER BY B.BasketID)
                            AS SequenceNo,
                B.BasketEggCount,
                I.i         AS CumEggs,
                B.basketeggcount % @CartonSize
                            AS Remainder,
                B.BasketID
        FROM    Baskets         B
        JOIN    BasketLocations L
        ON      B.BasketID      = L.BasketID
        CROSS JOIN Integers     I
        WHERE   I.I <= (B.BasketEggCount + @CartonSize)
        AND     I.I % @CartonSize = 0
        AND     L.LocationID    = @LocationID
        AND     B.BasketEggCount > 0
        AND     I.I – B.BasketEggCount <> @CartonSize
        )X – X, get it? Eggs/X?
WHERE NOT EXISTS(
    SELECT  *
    FROM    Cartons
    WHERE   BasketID    = X.BasketID
    AND     SequenceNo  = X.SequenceNo) 
END
GO
 

The next procedure takes a Carton and splits it, creating a new Carton with a specified number of eggs in it. Note that the MERGE statement allowed me to create a single statement that does this.

CREATE PROCEDURE dbo.SplitCarton(
        @BasketID           VARCHAR(20),
        @CartonSequenceNo   INT,
        @EggsToMove         INT )

AS
BEGIN
– Takes a Carton identified by @BasketID and @CartonSequenceNo
– and removes @EggsToMove eggs from it.
– creates a new Carton with @EggsToMove eggs in it.

MERGE INTO Cartons C
USING ( SELECT  C.BasketID,
                (SELECT MAX(CartonSequenceNo) + 1
                    FROM Cartons
                    WHERE BasketID = C.BasketID) AS CartonSequenceNo,
                @EggsToMove AS CartonEggCount   
        FROM    Cartons     C
        WHERE   C.BasketID          = @BasketID
        AND     C.CartonSequenceNo  = @CartonSequenceNo
        AND     C.CartonEggCount    > @EggsToMove
        AND     @EggsToMove         BETWEEN 1 AND (C.CartonEggCount – 1)
        UNION ALL
        SELECT  BasketID,
                CartonSequenceNo,
                CartonEggCount – @EggsToMove AS NewCartonEggCount
        FROM    Cartons    
        WHERE   BasketID            = @BasketID
        AND     CartonSequenceNo    = @CartonSequenceNo
        AND     CartonEggCount      > @EggsToMove
        AND     @EggsToMove         BETWEEN 1 AND (CartonEggCount – 1)
    ) NEW
ON  C.BasketID          = NEW.BasketID
AND C.CartonSequenceNo  = NEW.CartonSequenceNo
WHEN MATCHED THEN UPDATE
    SET C.CartonEggCount = NEW.CartonEggCount
WHEN NOT MATCHED BY TARGET THEN INSERT –(BasketID, CartonSequenceNo, CartonEggCount)
    VALUES(NEW.BasketID, NEW.CartonSequenceNo, NEW.CartonEggCount)
    ;   
END
GO
 

The last procedure takes two cartons and makes them into one. The implied rule is that the sum of the egg counts in the two cartons must be less than or equal to 12. Again, the MERGE statement allowed me to do this in a single statement.

CREATE PROCEDURE dbo.ConsolidateCartons(
        @BasketIDFrom           VARCHAR(20),
        @CartonSequenceNoFrom   INT,
        @BasketIDTo             VARCHAR(20),
        @CartonSequenceNoTo     INT)
AS     
BEGIN
– Takes all the eggs from Carton identified by @BasketIDFrom and @CartonSequenceNoFrom
– and adds them to the count for Carton identified by @BasketIDTo and @CartonSequenceNoTo
– then deletes the empty Carton

– If this were production code, there would be parameter checking, TRY CATCH blocks, transactions, etc.

DECLARE @CartonSize             INT
SET     @CartonSize             = 12
   
MERGE INTO Cartons C
USING(  – builds a set of rows that contain what the database
        – should look like after the move
        – both parts of the query check that the sum of the two
        – egg counts do not exceed the maximum carton size – if they
        – do, this is an invalid consolidation and no rows will
        – be updated or deleted
        SELECT  CTo.BasketID,
                CTo.CartonSequenceNo,
                CTo.CartonEggCount + CFrom.CartonEggCount AS CartonEggCount
        FROM    Cartons                 CFrom
        JOIN    Cartons                 CTo
        ON      CFrom.BasketID          = @BasketIDFrom
        AND     CFrom.CartonSequenceNo  = @CartonSequenceNoFrom
        AND     CTo.BasketID            = @BasketIDTo
        AND     CTo.CartonSequenceNo    = @CartonSequenceNoTo
        WHERE   CTo.CartonEggCount + CFrom.CartonEggCount <= @CartonSize
        UNION ALL
        – because the rule is that all eggs must be moved from the FROM
        – Carton, this set contains 0 as the CartonEggCount
        SELECT  CFrom.BasketID,
                CFrom.CartonSequenceNo,
                0 AS CartonEggCount
        FROM    Cartons                 CFrom
        JOIN    Cartons                 CTo
        ON      CFrom.BasketID          = @BasketIDFrom
        AND     CFrom.CartonSequenceNo  = @CartonSequenceNoFrom    
        AND     CTo.CartonSequenceNo    = @CartonSequenceNoTo
        WHERE   CTo.CartonEggCount + CFrom.CartonEggCount <= @CartonSize
        ) NEW
ON      C.BasketID          = NEW.BasketID
AND     C.CartonSequenceNo  = NEW.CartonSequenceNo
– if NEW.CartonEggCount > 0, this must be the Carton that the eggs are moving to          
WHEN MATCHED AND NEW.CartonEggCount > 0 THEN UPDATE
    SET     C.CartonEggCount = NEW.CartonEggCount
– if NEW.CartonEggCount = 0, this must be the Carton that is being removed 
WHEN MATCHED AND NEW.CartonEggCount = 0 THEN DELETE;

END
GO
 

Posted in T-SQL | Tagged , , | 1 Comment

Guerilla Database Support

One of the jobs I had in the middle-recent past involved a fair amount of time supporting IBM DB2.

This was also a job where the database design and code were horrifying, of the omigod I can’t believe that none of the clients are suing us level of horror. Adding (yet another) index that dropped the time it took to delete a bill from 30 to 20 seconds was considered a major coup.

Application speed is a major, major, MAJOR problem at that place. I would guess that 2 FTEs are spent just dealing with speed and performance issues.

One of those FTEs was my colleague, who I have stayed in touch with and who has become a good friend. She has been under a lot of pressure lately as the new version of the Monumental Shitware has entered beta and has been installed on some beta clients, with the attendant (predictable) savage performance problems.

This morning she is dealing with a customer who hasn’t returned voicemail and when called on it (after he complained to higher ups) in an email that was cc’d to his boss claimed that he never got any voicemail.

I suggested that when she does connect, she turn his bufferpools way down, 1000 pages each or something. Then leave him like that for a week…

bet the fucker will answer his voicemail then!

Posted in IT Follies, Performance | Tagged , | Leave a comment

So True, So True

If Comedy is born of Pain, and Pain is born of Truth, then this post by the enigmatic Phil Factor should have you laughing through your tears. Figuratively speaking, of course. There’s no crying in software development.

If you’re one of those people who doesn’t like to click through without a taste of what is being clicked through to, Phil (not his real name, I hope) has broken down the phases of a software development project into the following:

  1. The Euphoric Phase
  2. The Chill Wind
  3. The Slough of Despond
  4. The Search for the Guilty
  5. The Depressive Position
  6. The Death March
  7. Recriminations
Posted in IT Follies | Tagged , | Leave a comment

Code Formatting In 4/4 Time

I like to listen to blues on internet radio while I work.

I also format my code as I go. I use a lot of whitespace- I generally find that I save time overall by always using clearly formatted code. It is much easier to read, and to see the structure of the query.

I realized this morning that if I like the song that is playing, I hammer the TAB and arrow keys in time, so it’s arrow-arrow-tab-tab, tab-tab-arrow-arrow, tab-tab-tab-tab (etc).

I also have to check this blog more often. Graham left a comment a month ago that I just approved. Sorry Graham!

Posted in Site Notes, T-SQL | Leave a comment

ISNULL vs COALESCE

One of the ways you can tell code that I have written from code that most other developers have written is that I use the ANSI92 compliant function COALESCE() instead of the more common proprietary ISNULL().

I do this for several reasons. First, COALESCE is not proprietary, and so it is highly unlikely that it will ever be dropped. Second, COALESCE is more flexible. In case you haven’t encountered these two functions, they have similar functionality.

ISNULL(arg1, arg2) returns the second argument if the first is NULL. If your database is rigidly designed and normalized to 3NF, you probably have little use for ISNULL and COALESCE. If, on the other hand, your database is like 99% of databases and it contains a mess of columns that allow NULL, you probably make extensive use of ISNULL.

COALESCE() takes a list of arguments of arbitrary length and returns the first non-NULL value it finds. This means that instead of long chains of ISNULL(ISNULL(ISNULL(col_a, col_b),col_c),col_d), you can write COALESCE (col_a, col_b, col_c, col_d). This is much easier to parse and leads to more logical, maintainable code.

This morning I encountered another reason to use COALESCE.

In my experience, genuine bugs (as opposed to behavior somebody doesn’t like) are rare in SQL Server. I don’t know whether this is a bug, or merely an allowed deficiency but:

SELECT ISNULL(‘abcd’ + null,‘1234567890′)

I would expect this to return ‘1234567890′. However, it does not. On SQL Server 2005, this returns ‘12345′.

On the other hand,

SELECT COALESCE(‘abcd’ + null,‘1234567890′)

returns, as expected, ‘1234567890′.

This is a subtle deficiency, one that could be difficult to track down, and is yet one more reason to use COALESCE.

Posted in ANSI Standard, T-SQL | 3 Comments

Uh Oh

You know you’re in trouble when the stored procedure you wrote has been modifed by another developer, and you find that what had been an elegant INSERT has been tarted up with something like this at the end:

WHERE NOT EXISTS(
    SELECT  *
    FROM    History
    WHERE   UserID  = STACKS.UserID
    AND     LevelID     =   STACKS.LevelID
    AND     StackNumber = STACKS.StackNumber
)
) A  INNER JOIN RewardLevel RL
    ON A.LevelID = RL.LevelID
WHERE
(
    EXISTS(SELECT 1 FROM CurrentLevel GS
            INNER JOIN RewardLevel GSR
                ON GS.LevelID = GSR.LevelID
            WHERE GS.UserID = A.UserID
            AND A.LevelNumber <= GSR.LevelNumber AND A.StackNumber < GS.CurrentStackNumber)
    OR
    EXISTS(SELECT 1 FROM CurrentLevel GS
            INNER JOIN RewardLevel GSR
                ON GS.LevelID = GSR.LevelID
            WHERE GS.UserID = A.UserID
            AND A.LevelNumber < GSR.LevelNumber)
)

It’s actually much worse. The original insert was a fairly complex one of about 90 lines (that’s deceptive, because I format code long rather than wide – other developers might format this to 50 lines) but due to other developers taking the short way round and doing the quickest patch possible, it has ballooned to a nested set of statements over 500 lines in length, of which this particular add-on is only a part.

Somehow, the structure of the original SELECT has been altered such that date calculations that depend on the ordinal position of the rows no longer work.

This is how code begins to approach unmaintainability.

Posted in SQL | Leave a comment

SQL Blog

This is intended to be a place to put SQL and SQL Server related material. A blog like this should be simple and needs a different set of categories and tools than a regular blog.

I don’t know how much posting will be done here. Guess we’ll see.

Posted in Site Notes | Leave a comment