Total Pageviews

Sunday, February 24, 2013

SQL Server Data Compression – Beauty or Beast?

Data compression is an Enterprise/Developer Edition feature starting from SQL 2008 and onward. The pros of this feature is that it reduces disk space and memory requirement, however on the other hand, it has some CPU cost for compressing and decompressing data pages repeatedly. Therefore, it is recommended that we should have some CPU headroom where it is being implemented.

We can easily estimate the benefits of saving disk space and memory, but how do we measure CPU cost associates with data compression? There is no any specific indicator till to date that can measure associated CPU headroom or cost.

What we can compress?
Unfortunately, we cannot compress the entire database in one shot, there is no such option. We can compress data table by table. We have following options:
1.       A HEAP or a Clustered index
2.       One or all non-clustered index
3.       One or any table partition with different compression option
4.       Indexed view

We have three options for data compression:
1.       ROW: Compresses only rows in a leaf-level page.
2.       PAGE: Compresses all leaf-level pages and all the rows as well
3.       NONE: default – no compression

Basic syntax of data compression:
The following command changes the compression of a table and the heap or clustered index will be rebuilt. Please note that if the table is a HEAP (has no clustered index), all non-clustered indexes will be rebuilt.
ALTER TABLE TableA REBUILD WITH (DATA_COMPRESSION = PAGE);

(a)    To rebuild an existing Clustered Index:
CREATE CLUSTERED INDEX [PK_TableA] ON [dbo].[TableA]
([xID] ASC)
WITH (DATA_COMPRESSION = PAGE, DROP_EXISTING = ON) 

(b)   To rebuild an existing non-clustered Index:
CREATE CLUSTERED INDEX [IX_TableA_Col1] ON [dbo].[TableA]
([Col1] ASC)
WITH (DATA_COMPRESSION = PAGE, DROP_EXISTING = ON) 

(c)    To rebuild all indexes:
ALTER INDEX ALL ON TableA
       REBUILD WITH ( DATA_COMPRESSION = PAGE)

(d)   To rebuild Partition number 1 (one)
ALTER TABLE TableA
REBUILD PARTITION = 1 WITH (DATA_COMPRESSION = PAGE)

(e)   To rebuild all partitions in the table:
ALTER TABLE TableA
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE ON PARTITIONS(1) ) ;

Our Sample table and testing approach:
We will be cratering a sample table and populating it with 1,000,000 rows.

1.       Review data compression on HEAP with default fill factor
2.       Compare the physical size versus cached size
3.       Comparing IO and CPU

DMV and script for comparing compression and data buffer:

1.       We will use following script to get buffer and disk pages

/*************************************************************
** Script to check buffer and disk pages
*************************************************************/
-- dbcc dropcleanbuffers

SELECT  DB_NAME(bd.database_id) AS database_name ,
        obj.[name] AS 'table_name' ,
        obj.index_id ,
        si.name index_name ,
        obj.type_desc ,
        obj.total_pages AS total_pages_in_disk ,
        obj.size_in_disk_mb ,
        CONVERT(NUMERIC(10), ( COUNT(*) )) AS total_cached_pages ,
        CONVERT(NUMERIC(10, 2), ( COUNT(*) * 8 ) / 1024.0) AS 'cached_size_mb'
FROM    sys.dm_os_buffer_descriptors AS bd
        INNER JOIN ( SELECT OBJECT_NAME(object_id) AS name ,
                            type_desc ,
                            index_id ,
                            allocation_unit_id ,
                            total_pages ,
                            CONVERT(NUMERIC(10, 2), ( total_pages * 8 ) / 1024.0) AS 'size_in_disk_mb'
                     FROM   sys.allocation_units AS au
                            INNER JOIN sys.partitions AS p ON au.container_id = p.hobt_id
                                                              AND ( au.type = 1
                                                                    OR au.type = 3 )
                     UNION ALL
                     SELECT OBJECT_NAME(object_id) AS name ,
                            type_desc ,
                            index_id ,
                            allocation_unit_id ,
                            total_pages ,
                            CONVERT(NUMERIC(10, 2), ( total_pages * 8 ) / 1024.0) AS 'size_in_disk_mb'
                     FROM   sys.allocation_units AS au
                            INNER JOIN sys.partitions AS p ON au.container_id = p.hobt_id
                                                              AND au.type = 2
                   ) AS obj ON bd.allocation_unit_id = obj.allocation_unit_id
        JOIN sys.indexes si ON si.index_id = obj.index_id
                               AND si.[object_id] = OBJECT_ID(obj.name)
WHERE   bd.database_id = DB_ID()
        AND obj.[name] = 'tbllarge'
GROUP BY obj.name ,
        bd.database_id ,
        obj.index_id ,
        obj.type_desc ,
        si.name ,
        size_in_disk_mb ,
        obj.total_pages

Let’s get started:

1.       Create a “tblLarge” with default fill factor (which is 0 or 100) and default data compression (which is NONE)

SET NOCOUNT ON
IF OBJECT_ID('tblLarge') IS NOT NULL
    DROP TABLE tblLarge
GO

CREATE TABLE tblLarge
    (
      xID INT IDENTITY(1, 1) ,
      sName1 VARCHAR(100) ,
      sName2 VARCHAR(1000) ,
      sName3 VARCHAR(400) ,
      sIdentifier CHAR(100) ,
      dDOB DATETIME ,
      nWage NUMERIC(20, 2) ,
      sLicense VARCHAR(25)
     )
GO

2.       Populate 1000000 rows

INSERT  INTO tblLarge
        ( sName1 ,
          sName2 ,
          sName3 ,
          sIdentifier ,
          dDOB ,
          nWage ,
          sLicense
        )
VALUES  ( LEFT(CAST(NEWID() AS VARCHAR(36)), RAND()*50) ,          -- sName1
          LEFT(CAST(NEWID() AS VARCHAR(36)), RAND()*60) ,          -- sName2
          LEFT(CAST(NEWID() AS VARCHAR(36)), RAND()*70) ,          -- sName2
          LEFT(CAST(NEWID() AS VARCHAR(36)), 2) ,                  -- sIdentifier    
          DATEADD(dd, -RAND() * 20000, GETDATE()) ,                -- dDOB
          ( RAND() * 1000 ) ,                                      -- nWage
          SUBSTRING(CAST(NEWID() AS VARCHAR(36)), 6, 7)            -- sLicense
         )

GO 1000000

3.       We need to clear the data buffer cache with the following DBCC command
DBCC DROPCLEANBUFFERS
4.       We also like to get the STATISTICS IO and STATISTICS TIME output
5.       We will use the following query to perform the test
SELECT * FROM tblLarge WHERE sIdentifier='AB'
6.       Our preferable data compression method will be PAGE-level settings.

Reviewing performance on the HEAP with 0 Fill Factor without Data Compression:

We will run the following query at least ten times and average the ten outputs.

SET STATISTICS IO ON
SET STATISTICS TIME ON
DBCC DROPCLEANBUFFERS
SELECT * FROM tblLarge WHERE sIdentifier='AB'

After running the above query ten times in my system it seems that we have the following average from the STATISTICS TIME output:

-- Output from SET STATISTICS TIME ON
 SQL Server Execution Times:
   CPU time = 358 ms,  elapsed time = 3190 ms.
   CPU time = 406 ms,  elapsed time = 2926 ms.
   CPU time = 421 ms,  elapsed time = 3074 ms.
   CPU time = 297 ms,  elapsed time = 3291 ms.
   CPU time = 546 ms,  elapsed time = 3218 ms.
   CPU time = 359 ms,  elapsed time = 3169 ms.
   CPU time = 437 ms,  elapsed time = 3057 ms.
   CPU time = 437 ms,  elapsed time = 3007 ms.
   CPU time = 312 ms,  elapsed time = 3011 ms.
   CPU time = 327 ms,  elapsed time = 2931 ms.

Average CPU time = 390 ms
Average elapsed time = 3087.4 ms.

Whereas STATISTICS IO output as follows:
Table 'tblLarge'. Scan count 3, logical reads 28832, physical reads 0, read-ahead reads 28810, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Buffer review from the given DMV query:
Reviewing performance on the HEAP with 0 Fill Factor with Data Compression:

Now compress the HEAP “tblLarge” table using the PAGE-level setting:

DBCC DROPCLEANBUFFERS
ALTER TABLE tblLarge REBUILD WITH (DATA_COMPRESSION = PAGE);

STATISTICS TIME output from above is:

SQL Server Execution Times:
   CPU time = 6302 ms,  elapsed time = 18245 ms.

Once the data compression is done we run the same script we ran before.
SET STATISTICS IO ON
SET STATISTICS TIME ON
DBCC DROPCLEANBUFFERS
SELECT * FROM tblLarge WHERE sIdentifier='AB' 

From the STATISTICS TIME output we got following:

-- Output from SET STATISTICS TIME ON
 SQL Server Execution Times:
   CPU time = 342 ms,  elapsed time = 643 ms.
   CPU time = 313 ms,  elapsed time = 470 ms.
   CPU time = 281 ms,  elapsed time = 630 ms.
   CPU time = 375 ms,  elapsed time = 478 ms.
   CPU time = 359 ms,  elapsed time = 666 ms.
   CPU time = 327 ms,  elapsed time = 475 ms.
   CPU time = 375 ms,  elapsed time = 643 ms.
   CPU time = 312 ms,  elapsed time = 470 ms.
   CPU time = 297 ms,  elapsed time = 544 ms.
   CPU time = 328 ms,  elapsed time = 471 ms.

Average CPU time = 330.9 ms
Average elapsed time = 549 ms.

From the STATISTICS IO output we got the following:
Table 'tblLarge'. Scan count 3, logical reads 13981, physical reads 0, read-ahead reads 1640, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Buffer review from the given DMV query:
Comparison Without vs. With:
STATISTICS IO
Without Compression
With Compression
Difference
Scan count
3
3
0
logical reads
28832
13981
14851
physical reads
0
0
0
read-ahead reads
28810
1640
13189
STATISTICS TIME
Without Compression
With Compression
Difference
Average CPU Time
379.3 ms
325.8 ms
53.5
Average elapsed time
3050.8 ms
1320.7 ms
1730.1
Data Pages
Without Compression
With Compression
Difference
# pages in disk
28841 (225.32 mb)
14001 (109.38 mb)
115.93 mb
# of pages In memory
28833 (225.26 mb)
13982 (109.23 mb)
116.03 mb


Comparison Summary:
From the above comparison it makes sense that data retrieval cost has been improved dramatically. It also indicates that the average CPU cost has also been reduced. In our example, there is a one time for Data compression. However, there are some (?) CPU costs associated with DML operations when Data Compression is enabled for a table, partition or index which I guess is much less than the overall benefits.

Benefits from Data compression:
1.       Reduces disk space overhead
2.       Reduces memory requirement
3.       Fewer reads – improves performance
4.       Less worker time – increases concurrency
5.       Reduces Lock contention

Disadvantages:
Other than some CPU overhead there is no significant side effects mentioned in Microsoft Books Online (BOL) or in the MSDN article. However, there are guidelines and information that we can read for better understanding before implementing data compression.

Data Compression: Strategy, Capacity Planning and Best Practices

“Microsoft SQL Server 2008 Internals” - Kalen Delancy
MS Press Book

Conclusion:
Data compression is an Enterprise Edition feature which helps to improve performance in many ways. We should take advantage of this feature to enhance performance.

So, beauty or beast? The answer is “The best way to predict your future is to create it” - Abraham Lincoln

No comments:

Post a Comment