PRPASS Home
  PASS Summit Unite 2009
The Puerto Rico's Professional Association of SQL Server 
Skip Navigation LinksHome : SQL World
  
Skip Navigation Links.     
Membership Info

Latest: omar_a
New Today: 0
New Yesterday: 2
Overall: 382
People Online

Visitors: 4
Forum Last Activity

alan_koo posted at 1/7/2010 4:40:00 PM
 
SQL World

This section brings to you the best live content of the SQL around the World. Select the category and source from the topics that you are looking for and enjoy it.

 Category:     Source:
From MSQL Development Customer Advisory Team
Enabling Partition Level Locking in SQL Server 2008
Published: Wed, 03 Mar 2010 23:54:00 GMT


Author: Thomas Kejser
Reviewers: Juergen Thomas, Sanjay Mishra, Stuart Ozer, Lubor Kollar, Kevin Cox, Kun Cheng

In this blog, we will provide additional details on a well-hidden feature in SQL Server 2008 – partition-level locking. But before we illustrate the feature, here’s a bit of background about lock escalation and lock granularities.

Background

SQL Server has a lock manager that tracks which database objects are locked. Typically, a query will take locks at the row or page level, but after the numbers of locks become too large to manage, SQL Server will try to escalate the locks to the table level to preserve memory. Lock escalation can sometimes cause blocking or deadlock situations – so it is feature that DBAs should be aware of. You can trace lock escalation in SQL Server profiler with the Lock:Escalation event.

In SQL Server 2005, you can disable lock escalation completely, at the instance level, using trace flag 1211 – this is described in KB 323630. However, in SQL Server 2008, you get a new option: disabling lock escalation at the table level using the command:

ALTER TABLE SET (LOCK_ECALATION = AUTO | TABLE | DISABLE)

DISABLE mimics the behavior of Trace flag 1211. The TABLE setting is the default, to maintain backward compatibility with SQL Server 2005. However, consulting SQL Server Books Online, we see that there is an interesting behavior for the AUTO setting, when the table is partitioned:

“If the table is partitioned, lock escalation will be allowed to partition. After the lock is escalated to the partition level, the lock will not be escalated later to TABLE granularity.”

Lock Escalation on a Partitioned Table with the AUTO Setting

Let’s put the AUTO setting to the test with this partitioned table:

CREATE PARTITION FUNCTION pf_year(INT)
AS RANGE RIGHT FOR VALUES (2007,2008,2009,2010)

CREATE PARTITION SCHEME ps_year
AS PARTITION pf_year ALL TO ([PRIMARY])

CREATE TABLE Data (    
 
[year] INT NOT NULL
, UpdateValue FLOAT NOT NULL
, PayLoad CHAR(200) NOT NULL
) ON ps_year([year])

-- Cluster the table (also test on heap)
CREATE CLUSTERED INDEX CIX ON Data(Year) ON ps_year(Year)

The attached file has a script that will populate this table with data. You need around 10,000 rows in each partition to make sure lock escalation will happen.

We can now run a statement like this to test the escalation behavior:

ALTER TABLE Data SET (LOCK_ESCALATION = AUTO) /* Vary this */
BEGIN TRAN
  UPDATE Data SET Payload = RAND()
  WHERE Year = 2008 /* Touch only one partition */ 
  /* Query locks here */
COMMIT TRAN

On both cluster indexes and heaps we get the following results.

Lock Level TAB HOBT PAGE RID/KEY Escalation Event?
TABLE / OBJECT (default) X None IU None Yes
DISABLE None None IX X No
AUTO IX X IU None Yes

The HOBT granularity is a “Heap or B-tree”. This means that in the AUTO case, only the partition touched is exclusively locked. This means that other partitions can be read or updated concurrently, while the update on the year 2008 partition is running.

Impact of LOCK_ESCALATION Setting on the Locking Behavior in SERIALIZABLE Isolation Level

So far, we have seen how the AUTO setting will give you partition-level locking when lock escalation kicks in. This is particularly interesting in data warehouse scenario.

However, this is not the end of the story. There is an interesting side effect of the AUTO setting. Try turning the partitioned table into a heap:

DROP INDEX Data.CIX

Now, execute this statement on the heap:

ALTER TABLE Data SET (LOCK_ESCALATION = TABLE)  /* default */
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

BEGIN TRAN
 
SELECT COUNT(*) FROM Data 
WHERE Year = 2008 /* Touch one partition only */ 
   /* Query locks here */

COMMIT TRAN

Output:

image

Notice the S-lock on the table (OBJECT). Check your profiler for Lock:Escalation event– no lock escalation happened – yet there is still a lock on the table level? What is going on here?

Notice we are asking for the SERIALIZABLE isolation level. Unlike a cluster index – where a range lock can be taken to ensure SERIALIZABLE reads – heaps have no order of their rows.  Hence, SQL Server realizes that a table-level lock is the only strategy feasible to enforce the isolation level. This is not lock escalation; it is a behavior of the optimizer under the given isolation level.

Now, try changing the LOCK_ESCALATION setting to AUTO or DISABLE. DISABLE will still take the table (TAB) level lock. But with the AUTO setting, a partition-level (HOBT) lock is taken. So even though there is this no lock escalation in this case, the LOCK_ESCALATION setting of the table still affects the locking behavior when the isolation level is SERIALIZABLE.

Summary

If you have a partitioned table with concurrent access to different partitions, you may want to experiment with the LOCK_ESCALATION = AUTO setting in SQL Server 2008. Because this setting enables SQL Server to use a finer grain of lock escalation, it may improve concurrency in your environment.




SQL Server 2008 R2 UNICODE Compression – what happens in the background?
Published: Wed, 03 Mar 2010 00:06:00 GMT


SQL Server 2008 R2 added a much requested feature: Unicode compression. It addresses the need to compress Unicode strings. It is implemented as part of ROW compression, which was added in SQL 2008. That is; if ROW compression (on SQL 2008 R2) is enabled on a table that contains NCHAR / NVARCHAR datatypes, then the Unicode compression algorithm will kick in for each individual Unicode column.

Note: since PAGE compression is a ‘superset’, which also includes ROW compression, it will also enable Unicode compression.

There is another Blog that talks about the basics of Unicode Compression including specific compression numbers on various Locales. It can be found here: http://blogs.msdn.com/sqlserverstorageengine/archive/2009/08/17/unicode-compression-in-sql-server-2008r2.aspx

However, I recently ran an ISV test on Unicode compression and it didn’t produce the expected results, so I decided to look at the actual compression implementation in more detail.

The current implementation of ROW compression is very simple to understand; all columns are implicitly converted to variable length columns under the covers. I assumed that Unicode compression would be equally intuitive. Today all SQL Server Unicode data is stored in UCS-2 format. This means that a single byte ASCII character such as ‘a’ would be stored as two bytes (0x0061) in NCHAR vs. a single byte in CHAR (0x61). With Unicode compression I assumed that these single byte ASCII characters would be compressed as single byte, and ‘traditional’ double-byte characters (such as Russian; where the leading byte is not 00) would continue to be stored as two bytes (as they are today).

Well, this assumption is not really true… to understand why, one needs to understand that the actual algorithm (SCSU) employed by SQL Server 2008 R2.

The algorithm used to implement Unicode compression (SCSU) is described in the following spec: http://unicode.org/reports/tr6/.

Looking at the spec, one can see some things that completely make sense, but are not entirely intuitive. For example (from the spec):

9.1 German

German can be written using only Basic Latin and the Latin-1 supplement, so all characters above 0x0080 use the default position of dynamically positioned window 0.

Sample text (9 characters)

Öl fließt

Unicode code points (9 code points):  

00D6 006C 0020 0066 006C 0069 0065 00DF 0074

NOTE: In SQL Server 2008 and prior versions this would translate into 18 Bytes in a NCHAR column (using UCS-2 encoding).

Unicode Compressed (9 bytes):

D6 6C 20 66 6C 69 65 DF 74

9.2 Russian

Sample text (6 characters)

Москва

Unicode code points (6 code points):

041C 043E 0441 043A 0432 0430

NOTE: In SQL Server 2008 and prior versions this would translate into 12 Bytes.

Unicode Compressed (7 bytes):

12 9C BE C1 BA B2 B0

9.3 Japanese

Japanese text almost always profits from the multiple predefined windows in SCSU. For more details on this sample see below.

Sample text (116 characters)

 リンゴ可愛いや可愛いやリンゴ。半世紀も前に流行した「リンゴの歌」がぴったりするかもしれない。米アップルコンピュータ社のパソコン「マック(マッキントッシュ)」を、こよなく愛する人たちのことだ。「アップル信者」なんて言い方まである。

Unicode code points (116 code points)

3000 266A 30EA 30F3 30B4 53EF 611B
3044 3084 53EF 611B 3044 3084 30EA 30F3
30B4 3002 534A 4E16 7D00 3082 524D 306B
6D41 884C 3057 305F 300C 30EA 30F3 30B4
306E 6B4C 300D 304C 3074 3063 305F 308A
3059 308B 304B 3082 3057 308C 306A 3044
3002 7C73 30A2 30C3 30D7 30EB 30B3 30F3
30D4 30E5 30FC 30BF 793E 306E 30D1 30BD
30B3 30F3 300C 30DE 30C3 30AF FF08 30DE
30C3 30AD 30F3 30C8 30C3 30B7 30E5 FF09
300D 3092 3001 3053 3088 306A 304F 611B
3059 308B 4EBA 305F 3061 306E 3053 3068
3060 3002 300C 30A2 30C3 30D7 30EB 4FE1
8005 300D 306A 3093 3066 8A00 3044 65B9
307E 3067 3042 308B 3002

Unicode Compressed (178 bytes)

08 00 1B 4C EA 16 CA D3 94 0F 53 EF 61 1B E5 84
C4 0F 53 EF 61 1B E5 84 C4 16 CA D3 94 08 02 0F
53 4A 4E 16 7D 00 30 82 52 4D 30 6B 6D 41 88 4C
E5 97 9F 08 0C 16 CA D3 94 15 AE 0E 6B 4C 08 0D
8C B4 A3 9F CA 99 CB 8B C2 97 CC AA 84 08 02 0E
7C 73 E2 16 A3 B7 CB 93 D3 B4 C5 DC 9F 0E 79 3E
06 AE B1 9D 93 D3 08 0C BE A3 8F 08 88 BE A3 8D
D3 A8 A3 97 C5 17 89 08 0D 15 D2 08 01 93 C8 AA
8F 0E 61 1B 99 CB 0E 4E BA 9F A1 AE 93 A8 A0 08
02 08 0C E2 16 A3 B7 CB 0F 4F E1 80 05 EC 60 8D
EA 06 D3 E6 0F 8A 00 30 44 65 B9 E4 FE E7 C2 06
CB 82

Looking at the examples above, one can see that the ‘00’ leading bytes were stripped out of the compressed German data. This will be true of all commonly used ascii characters. So far, so good; my assumption holds true.

Looking at the Russian example, we see something slightly different:

The ‘04’ leading bytes were completely stripped out and replaced by a single additional leading ‘tag’: ‘12’ to identify the current code page. All other trailing characters are offset by  0x80. To follow along; in the example above the leading value ‘041C‘ becomes ‘12 9C ‘.  12 replaces the 0x0400 ‘code page identifier’ and ‘1c’ becomes ‘0x1c + 0x80 = 0x9c’, since the following characters are all of the same code page they can be compressed and don’t need any additional leading characters. Here we can see very clearly that compression can work efficiently for even traditional double-byte characters.

However, the Russian example exposes another interesting detail regarding the algorithm; For the characters in the 0x400 code page that are offset by 0x80, what happens when we get to the character that causes it to ‘go over’ 0xFF? Well, the algorithm accounts for this by ‘splitting’ the code page into two (0x0401-0x047F and 0x0480-0x04FF); therefore some characters will get ‘tagged’ with 12 and the rest will be tagged by another unique number. For example: the Cyrillic ‘small letter Ghe with upturn ґ ‘  =  0x0491, this will be represented as:  ‘91’ with a different leading tag identifier.

 

The Japanese example is much more complex, because the datapoints come from many different Japanese ‘code pages’, analogous to the Russian example above. Therefore, there is no consistent leading character as in the Russian and German examples above. So, the algorithm was required to use ‘tags’ as described above, but also ‘special-case delimiters’ to signal the end or start of a codepage. That is; a tag, an ‘I am ending data from this tag’ delimiter, then a new tag, and then the start of the newly compressed trailing characters. One can conclude that certain very large ‘character sets’ like Japanese are difficult to compress using this algorithm.

One can also conclude that a single given Unicode string that contains characters from many different code pages (ie. Russian, Thai, Japanese, Chinese, etc…) could easily end up being larger than the original after the algorithm is done making the required substitutions and additions. 

The SQL Server implementation of the algorithm checks to see whether the compressed value is larger than the original and if so, stores the ‘uncompressed value’ instead.

This again, brings up an interesting question; how does one tell a Unicode compressed string from an uncompressed Unicode string, given that both are now possible in the same row (in different columns)?

The SQL Server implementation makes sure that every compressed Unicode string has an uneven number of bytes. And every uncompressed Unicode string has an even number of bytes (as was the case before compression). That means that additional special case characters are used to maintain this ‘uneven’ rule.

There are other SQL Server implementation details such as; there must be at least three consecutive characters from the same code page that will trigger the compression ‘tagging’.

In summary, one can see that there is much more to Unicode compression than just saying: Unicode ‘traditional ascii single-byte’ characters will be converted to storing only one character. In fact, many multi-byte code pages such as Thai, compress down as far as 40%.

For more details please see the algorithm specs at the website posted above.

I have presented a very simplistic view what is happening, but I believe it is enough to gain a rudimentary understanding of how SQL Server 2008 R2 Unicode compression is implemented.

Oh, and if you are wondering why the ISV workload results I was looking at didn’t produce the expected result… This was related to the fact that most of the Unicode data was stored in NVARCHAR(max) and or NTEXT dataypes. These column types do not participate in Unicode compression at this time. This may change in a future version of SQL Server.

 




Performance tips of using XML data in SQL Server
Published: Mon, 01 Mar 2010 23:52:00 GMT


Author: Kun Cheng
Reviewers: Peter Carlin, Mike Ruthruff, Thomas Kejser, Nicholas Dritsas

XML data type is usually used to store semi-structured data with great flexibility and query capabilities. It’s a good choice for developing platform agnostic applications to separate storage of complex data from processing at application layer, which understands the complex data relationship. To achieve optimal performance of querying XML data in SQL Server, extra steps need to be taken to ensure query access pattern matches how XML data is stored and indexed.

Here are 3 quick tips from query performance perspective based on real customer experience.

1.       Promoting key XML attribute/property to relational column. Promoting to relational column will get you to on-par performance with relational queries, the best you can get out of modern RDBMS. To achieve optimal performance and scalability with XML data, I strongly recommend you first consider this option. The idea is identifying hot elements or attributes that your workload queries on the most. Then add computed columns using user functions to promote the elements or attributes out of XML data. Note it does require application code change to modify queries to take advantage of the promoted columns. A simple example below is a table with “XMLDATA” that has a hot element “a1”. Here are steps to add a computed column to promote the element (Note adding the computed column is offline operation for the table). 

CREATE FUNCTION udf_get_a1 (@xData xml)

RETURNS varchar(30)

WITH SCHEMABINDING

BEGIN

   DECLARE @a1  varchar(30)

   SELECT @a1 = @xData.value('(/Dept[1]/orders/a1/text())[1]', 'varchar(30)')

   RETURN @a1

END

Go

ALTER TABLE Orders

ADD a1 AS dbo.udf_get_a1(XMLDATA) PERSISTED

Go

CREATE INDEX ind_a1 ON Orders(a1)

Go

 

Now the original query below

 

SELECT *

FROM Orders t

WHERE t.XMLDATA.exist(N'(/Dept[1]/orders/a1/text())[1][.="Company10001Special"]') = 1

 

Needs to be rewritten as

 

SELECT *

FROM Orders

WHERE a1='Company10001Special'

In a customer lab, we observed exponential performance gain by using this method (1000+ times faster!!). To caution against over-promotion, when you add too many promoted columns, the overhead of index change could reduce the performance gain.

2.       Typed or Untyped XML data? Typed XML means there is validating schema defined against the data. For untyped XML, the whole XML data is treated as a big string. To decide whether or not to define typed XML, you need to examine your XML query access pattern. Generally speaking, schema speeds up lookups since the data types of XML elements/attributes are known. But it would slow down INSERT due to overhead of validating new data. For UPDATE, it would benefit from faster lookups like SELECT, but incur same schema validation overhead like INSERT.

 

 

Better off

Worse off

Typed (w/ schema)

Faster SELECT

Slower INSERT

Untyped (w/o schema)

Faster INSERT

Slower SELECT

 

3.       What XML indexes to add? Again depending on query access pattern, you may choose different XML indexes. Note XML index strategy is different than relational index. Popular DTA (Database Tuning Advisor) tool wouldn’t be able to recommend XML indexes. Fortunately there are only 3 types of secondary XML indexes to consider in addition to the required primary XML index.

 

Secondary XML index type

Benefits this type of queries

Path

XPath queries  with explicit path expression (XMLcol.exist ('/Dept/Orders/[@id="10001"]') = 1)

Value

XPath queries  with no explicit path expression but with value predicate (XMLcol.exist ('//[@id="10001"]') = 1)

Property

XPath queries  with multiple row results

 

As straight forward as it sounds, you may find it difficult to examine your application and find out what appropriate indexes to add. The best way I recommend is to add all 3 types of XML indexes and test your workload to find out which XML indexes were used to determine the ones to keep (Be aware of rebuilding index affecting system availability). You can build XML indexes on either untyped or typed data. But you will get more gain with indexes on typed data. I see up to 50 times faster query response (SELECT queries) compared to plain untyped XML! However for DML queries, they could be significantly impacted by the overhead of XML index change plus schema validation. So again the key is to test your workload to find out if it makes sense to add XML indexes.

 

In conclusion, Promotion to relational column is the primary option you should consider when working with XML data. Whether or not to use XML schema and XML indexes, you need to be careful about your decision. Test your workload and identify the XQueries, which are critical to the overall performance. There is no DTA to help you, no query hints apply to XQueries either. Examine the query plans of those XQueries thoroughly to eliminate inefficiency.




Full Text Indexing Terabytes of Files with SQL Server and Cloud Storage
Published: Wed, 03 Feb 2010 02:52:00 GMT


Author: Darko Sancanin, Nicholas Dritsas

Reviewers and contributors: Lubor Kollar, Stuart Ozer, Michael Thomassy

 Business Case

There are currently over 50 million files (over 10 Terabytes of data) that the customer is migrating into a custom project management application (for international customers) that can be accessed via the application and can be searched upon.  SQL Server 2008 full text indexing is used to index the content of these files which gives them rich searching capabilities within their application.

 

The initial solution was to store the files within SQL Server but maintaining Terabytes of distributed data proved to be unsustainable both from a performance, maintenance and storage cost point of view.

 

It was decided to move the files into cloud storage, this allowed to drastically reduce the hosting costs, reduce the load on the servers from storing and serving out this data, increase the redundancy plan by easily replicating the data across cloud instances and to increase the speed at which users can access their files by allowing them to select which region to serve the data from based on the users geographic location.

 

One of the major challenges faced with this approach was maintaining the full text index on these files while moving the files out of the local file system and into the cloud.

 

Technical Background

Originally the files were stored as part of the database using file stream pointers with an incremental full text index over these files and file metadata, such as file size and content encoding, stored in the database.

 

Moving these files out to the cloud meant that sql server had no way to access these files to index them, so it was decided to use the same underlying technology that SQL Server uses to index the files, IFilters. IFilters are a plug-in architecture implemented via COM components used by Windows applications to index files based on file format, they are used by Windows Desktop Search and SQL Server to extract the content from the files for indexing.

 

The process of storing a file in the application first starts with a user uploading the file to the servers, the software then loads the appropriate IFilters and extracts the text from the documents, concurrently create thumbnails for the file and upload the file and thumbnails to cloud storage. The text that was extracted from the file is then stored in the database along with metadata for the file and the URI to the file in cloud storage.

 

A slightly modified sample of the code that achieves this is shown below:

 

string fileExtension = Path.GetExtension(filePath);

                   

var filterLoader = new FilterLoader();

var iFilter = filterLoader.LoadFilter(fileExtension);

                   

var filterReader = new FilterReader(iFilter);

var fileTextContent = filterReader.ReadAllText();

 

var thumbnailCreatorFactory = new ThumbnailCreatorFactory();

var thumbnailCreator = thumbnailCreatorFactory.GetThumbnailCreator(fileExtension);

var thumbnail = thumbnailCreator.GetThumbnail(filePath);

 

SaveFileInformation(fileId, fileTextContent, thumbnail);

 

The path to the COM component that implements the IFilter interface is stored in the registry based on the file extension, the FilterLoader class traverses the registry to find the appropriate filter based on the file extension. The next step is to read the contents of the file using the methods of the IFilter interface.


To create the thumbnails we use a 3rd party component to generate thumbnails for the PDF documents, and for other document types we use the Windows API to get the shell to generate the thumbnails, finding the appropriate thumbnail creator class is abstracted behind the ThumbnailCreatorFactory. Then, once we have the text and thumbnail information. we save it to the database.

 

The schema for the table looks something like the following:

 

CREATE TABLE [dbo].[FileStorage](

                [FileStorageId] [bigint] IDENTITY(1,1),

                [FileName] [nvarchar](250) NOT NULL,

                [FileExtension] [nvarchar](10) NULL,

                [FileTextContent] [nvarchar](max) NULL,

                [CloudPath] [nvarchar](max) NULL,

                [CloudLargeThumbnailPath] [nvarchar](1000) NULL,

                [CloudLargeThumbnailSize] [int] NULL,

                [CloudSmallThumbnailPath] [nvarchar](1000) NULL,

                [CloudSmallThumbnailSize] [int] NULL,

                [Encoding] [nvarchar](max) NULL,

                [Size] [bigint] NULL

)

 

An example of a subset of this data is attached.  

 

A full text index is created over the FileTextContent field, which allows us to find particular files, display the file name and thumbnails to the user.  If the user chooses to download the file, then the user is given a link to the cloud storage file location.

 

The contents of the file are still stored in the database so there is still some overhead with the size of the database, but we are seeing over 15 to 1 reduction in file storage necessary to store just the text from the documents in the database. This approach also gives the flexibility to increment or regenerate the full text index anytime that we want.

 

References

The IFilter interface on MSDN: http://msdn.microsoft.com/en-us/library/ms691105(VS.85).aspx

IFilter.org, which contains links to some IFilters that are not included with  Windows, including the Office 2007 filters and PDF filters : http://www.ifilter.org

On MSDN code gallery there is an example of loading an IFilter and extracting the text using the filter in C# : http://code.msdn.microsoft.com/ifiltersample




Whitepaper About Star Join Optimizations
Published: Wed, 16 Dec 2009 20:20:03 GMT


When we talk to Data Warehouse customers – they often raise questions about the star joins improvements SQL Server 2008. Star join is a set of optimizations that provide some interesting performance gains in data warehouse workload. Instead of digging into details here on SQL CAT we want to draw your attention to this nice whitepaper about the subject:

Happy reading.




      
 
PASS Official Chapter 
 

PRPASS is a Ineta Latam Member
 
 
© 2006 Puerto Rico PASS. All rights reserved.
Designed for resolution 1024x768 or higher