Monday, February 20, 2012

online compression building

I read in a blog I as pointed to that the compression indexes are built online -- how is this done, live page-by-page so the impact on query is only a single page? Is it done like a shadow approach using snapshot and then switched over?SQL Server will compress pages when they become full. Normally, when a page becomes full in a B-Tree, the page gets split. In case of compression, before a page is split, SQL Server checks if we can compress the page, and if so, it will do the compression to avoid a page split.

This means that different pages can have different compression information, which is good for B-Trees and data that is clustered together.

When you query the information, the data gets uncompressed while it is queried.

Thanks,|||But if I load a table, will the data be compressed on load as it comes it, or only once the page splits?|||The algorithm works as follows:

1) When a page is full, and no compression is enabled, a page split is done
2) When a page is full, and compression is enabled, the page is compressed. After the page is compressed, a check is done to see if the new row fits. If the row fits, it gets added to the page. If the row does not fit a page split is done.

Thanks,|||What what I gather then you never compress a page until I fill the page...I assume that the compression hits the TX that makes it compress -- any idea how long this takes? Have you ever looked at just compression the data on the way in or is this what gives you the 'forever' online aspect?|||The reason to only compress when a page is full is that you can only save space a page at a time. For instance, suppose I have two rows and they in total take up 4000 bytes. As we need to store these rows on a single page, we will have to allocate a 8K page. Even though we could compress the 4000 bytes, we still need a complete 8K page.

In case of online, the transaction that hits the page boundary will get the compression hit. However, this is not much different than page splits at the moment. When we don't have compression and the page is full, a page split is done, and the transaction that ended up in this situation will page the cost of the split.

The compression algorithm we use is pretty fast and cheaper than doing a page split.

Thanks,|||Marcel -- I really appreciate your dedication to your client set -- it gets us knowledge in the technology beforehand which makes it more consumable.

So I think I have a pretty good idea of how this works. So I just have 2 more questions then I won't bug you until I see you at PASS next week and in person Smile.

(a) I read the extensive BLOG referenced in these posts on how compression works. And I get that, almost. Is the dictionary only updated to reaplace pre-fixes for columns? Or could there be entries in the dictionary that wouldn't be in a pre-fix column and why would that be? Perhaps to do cross-column compression?

(b) From the above post, I see when you compress now. So my only quesiton is what do I do with an existing table that has a full page or loads of pages 'to be split'. Do I have to wait for activity to happen to split them or can I call some maintenance routine or a dbcc tool to force it to happen. If I have a large historical table, I want to get these benefits. If there is no change activity, it seems I won't. I could also see that being benefiical on a staging load where I could compress after the initial load?

Thanks!
|||Unfortunately, I won't be at PASS, as I have to finish the last pieces on compression.

Regarding your questions:

The way page compression works is a follows:
1) we try to insert a row on a page and we find out that that page is full.
2) we compress the page by doing the following:
a) For each column, we calculate a column prefix. After the column prefix is calculated, we go through all the rows, and for each column value we apply the column prefix, and store the column value after the prefix is applied
b) After that is done, we go through all the values again and see if there are duplicate values that can be stored in a dictionary. If there are, we build a dictionary, and go through each column value again and replace values with values in the dictionary

The actual algorithm is not exactly as described above (we actually do only a single pass over the data to calculate the prefix and dictionary at the same time), but the above describes how it works at the high level.

So when do we compress? There are several options:

1) For existing indexes / tables, you can do ALTER TABLE / ALTER INDEX and specify a new option that allows you to specify whether you want ROW or PAGE compression. ALTER index can be done in two modes: in offline mode, we take a table lock on the table, and compress all the pages in the index (under the covers we are really rebuilding the index from scratch, and doing compression for each page that gets full). If you rebuild in ONLINE mode, we use the same algorithm as for DML (see below).

2) When you create a new table, you can specify that you want the table to be compressed. Any DML happening after that will use the compression algorithm (see below)

3) If you already have a table / index that is compressed, and you start doing DML operations (insert / update / delete) against it, before we split pages, we see if the page can be compressed. There are two cases:

a) the page is not compressed yet. In which case we compress the page
b) the page is already compressed. In this case, we do a recalculation of prefix and dictionary and see if we can have another prefix / dictioanry that would give better compression.

On thing to note about DML operations is that it is only done for indexes and not for heaps. There is some internal reason for this, but I don't want to discuss that in detail as it requires a deep understanding of how the engine works internally.

The cool thing is that you can set different compression settings for different partitions of an index, and that different indexes in a table can have different compression settings.

For instance, suppose I have a datawarehouse where I frequently query data from the last month, but data that is older than a month does not get queried often.

You could create a partition for every days data, keep the first 7 days tables compressed, and all dates after 7 days you can do compression. Every day, you could have a daily job that takes the oldest non-compressed table / index and compresses it.

Another scenario is where you compress the clustered index on a table to save space on the data, but keep the non-clustered indexes uncompressed for performance reasons.

If you need more information, let me know,

Thanks,

No comments:

Post a Comment