Summary: This document
discusses things to consider when managing a large, high-performance relational
data warehouse, especially one that hosts direct queries from users and
reports. The focus is on the efficient operation of management functions so
that the data warehouse will be highly available to service query requests. The
discussion includes some of the new features of SQL Server 2005 and
considerations to take into account when using these features.
Copyright
The information contained in this document represents
the current view of Microsoft Corporation on the issues discussed as of the
date of publication. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of
Microsoft, and Microsoft cannot guarantee the accuracy of any information
presented after the date of publication.
This White Paper is for informational purposes
only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO
THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the
responsibility of the user. Without limiting the rights under copyright,
no part of this document may be reproduced, stored in or introduced into a
retrieval system, or transmitted in any form or by any means (electronic,
mechanical, photocopying, recording, or otherwise), or for any purpose, without
the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications,
trademarks, copyrights, or other intellectual property rights covering subject
matter in this document. Except as expressly provided in any written
license agreement from Microsoft, the furnishing of this document does not give
you any license to these patents, trademarks, copyrights, or other intellectual
property.
Unless otherwise noted, the companies, organizations,
products, domain names, e-mail addresses, logos, people, places, and events
depicted in examples herein are fictitious. No association with any real
company, organization, product, domain name, e-mail address, logo, person,
place, or event is intended or should be inferred.<INCLUDE THIS DISCLAIMER
ONLY WHEN APPLICABLE TO YOUR CONTENT>
A number of activities may need to occur to effectively manage a data
warehouse on the Microsoft® SQL Server™ platform. These activities
include:
·ETL (extraction, transformation, and loading of incremental data)
·Database backups
·Index defragmentation
·Database health checks
·Update statistics
·Summary table updating
·Moving aged data to less expensive disk
·Archiving historic data
·Dimension pruning (removal of obsolete dimension members)
There are costs associated with these activities that must be considered
and mitigated. Costs usually come in one of the following forms:
·Availability. Most of
these operations can be performed when users are online but at the potential
cost of increased query execution time. User query response time is
particularly a consideration if incremental updates of the data warehouse are
performed throughout the day. If updates can be performed during a daily batch
process, outside of the active query time frame, emphasis is usually on the
speed of the batch execution in order to meet user availability requirements.
·Server
resources. Excessive use of server resources can have an impact
on users, if management activities are performed outside of a predefined batch
process. If these activities can take place offline, perhaps at night or on the
weekends, the primary consideration is disk space. Server resources should also
be monitored in relation to parallel execution of management activities. The
goal is to perform as much work at a time as possible, without thrashing any
one of the server resources.
·Administration
costs. The more complex the process, the more highly skilled the
administrative staff needs to be. Generally speaking, these processes should be
highly automated to lessen the recurring human resource cost to manage the
system. This automation should be simplified to minimize the resource costs
associated with the diagnosis and resolution of occasional problems that might
occur or with resource turnover. In addition, the tasks covered in this
document should be addressed in order to avoid slow performance and/or
unplanned outages.
Dealing with these management issues in a mid- or large-sized relational
data warehouse can seem to be a daunting task. When database administrators
(DBAs) are responsible for database sizes over 500 GB, and certainly when
the size is in the multi-terabyte range, they can wonder whether it is possible
to do all that must be done to ensure a healthy, well-performing database
within the required batch window. The thing to remember is that the vast majority
of data warehouse fact tables, which are the tables that represent the largest
proportion of the database, are mostly stable. That is, they represent history
that is rarely, if ever, updated. Often times, only 5-10% (or less) of the
database is actively updated. For instance, a 2-terabyte database only requires
active maintenance on 100-200 GB or less. You also need to take another
look at how traditional maintenance is performed. Instead of broad stroke
maintenance, analyze where the maintenance truly needs to occur and when.
This white paper provides insight into other ways of looking at
maintenance and how it is performed. First, we discuss how to partition, or
breakup, very large tables so they can be more easily managed. Then we discuss
the various management activities required to ensure a healthy,
high-performance data warehouse. For each of these, we look at some ways to
minimize the various costs mentioned above to make the entire process more
efficient. Many of these techniques can be applied in OLTP environments, though
the data warehouse database is the focus of this document. Finally, we present
a sample data warehouse environment scenario and propose a high-level schedule
for management activities.
This paper assumes basic knowledge of data warehousing concepts,
dimensional modeling, SQL Server 2005 table partitioning, and
SQL Server database maintenance activities. For additional information on
these topics, see the References section at the end
of this white paper.
Partitioning for Management
One of the primary tools for managing data warehouse operations is table
partitioning. Partitioning is a means of breaking up data in tables into more
manageable sections. Almost all management operations in SQL Server can be
done at the partition level, so partitioning provides the ability to break up
the management of a very large table into more manageable units. As we will see
later, partitioning can also be used to segregate sections of the table that no
longer require management so we are only left with those sections that do.
Partitioning can be either vertical or horizontal. Vertical partitioning
refers to a method of breaking out columns, which logically belong to a single
table, across multiple partitions which are stored as physical tables. This is
an effective method in an OLTP environment but not in a data warehouse. In a
data warehouse, the very large tables are fact tables with a tremendous number
of rows but few columns. Horizontal partitioning involves segmenting tables
based on values of a column in the table. Horizontal partitioning is almost a
given in a data warehouse environment where the tables are of a significant
size, say 300 GB or more.
Microsoft SQL Server has provided partitioning capability through
partitioned views since version 7.0. SQL Server 2005 provides a new
capability, table and index partitioning. Our focus on partitioning will use
this new feature.
We can use table and index partitioning to segment data through a life
cycle. For instance, we can decide to partition a Sales fact table by month of
sale. As a set of monthly sales records age, say after three years, they
lose their value and may be moved to an archival system or deleted. Thus,
partitioning by a date value is extremely common in a data warehouse system.
From the standpoint of management, any column that represents the age and/or
segregates dynamic data from stable data is a potentially good choice.
One thing to note is that we typically use partitioning to divide a very
large table, usually fact tables. The size boundary that indicates partitioning
varies but is generally around the 300‑GB range. When determining whether
a table should be partitioned, consider the projected size, not just the
current size. Partitioning adds a level of complexity to the environment, so
only consider this option for these large tables. Very large dimensions are
rarely so large as to benefit greatly from partitioning. However, a very large
dimension, such as a Customer dimension, can be partitioned to minimize the
portion of the dimension table that requires maintenance at a given time
period.
A helpful piece of information to obtain at the onset of a data
warehousing project is the point at which data becomes stabilized, or is no
longer updated. This may be obtainable from the source systems or from the
business users. Late-arriving facts are not uncommon, but there is usually a
point at which you can no longer expect them. Even if you cannot absolutely
distinguish such a point of time, you can build some intelligence within the
ETL (extraction, transformation, and loading) application to automate this.
This is discussed in detail later in this paper.
An ultimate goal is to use this information to establish a time period
after which fact data becomes stable. This can then be used to create a much
more efficient data management strategy, as explained throughout the remainder
of this white paper. A strong recommendation is to segment these partitions
onto their own filegroup(s) which are then marked as read-only filegroups.
Marking these filegroups as read-only is not a requirement, but doing so
provides some significant benefits, such as the following:
·It
guarantees that no changes have been made that could require additional
maintenance.
·It
eliminates the need for associated transaction log backups when restoring
read-only filegroups.
·It
provides more flexible piecemeal restore scenarios when the database is set to
the simple recovery model.[1]
With this goal in mind, there are two high-level strategies for laying
out partitions on filegroups to segregate active data from latent data:
Create two filegroups, one for partitions that are
still updated and one for read-only partitions
·Advantages
·Fewer filegroups to maintain
·Simplicity
·Disadvantages
·Less flexibility in piecemeal restore.
·Data must move when going from active to inactive.
·Data “freeze” date must be firm, or else partitions may have to move
back and forth between the two filegroups.
·Can restrict management options, such as backup and consistency checking
granularity.
Create one filegroup per partition, flag each relevant
filegroup as read-only when the partition is no longer updated
·Advantages
·Data does not have to be moved when going between active and inactive.
·Full piecemeal restore capabilities.
·ETL is more efficient if data is updated after the established “freeze”
date.
·Provides flexible management options.
·Disadvantages
·More filegroups to maintain
For these reasons, the second strategy carries a strong recommendation
when implementing the management strategy outlined in the remainder of this
white paper. The simplicity of the first option, or a combination of both
options, may be desirable in a smaller data warehouse where downtime for
management is less of an issue.
A final partitioning decision has to do with granularity. It is very
common to set up partition boundaries to be uniform across the entire table,
for instance, by month. There is nothing wrong with this strategy, but remember
that boundaries do not have to be uniform. It may be beneficial to have more
granular partitions for current data and less granular as data ages. For
instance, data can be added as daily partitions, and merged into weekly or
monthly partitions as they age. The usefulness of this strategy varies from
environment to environment. The main thing to consider is if there are
occasional times, perhaps on the weekend, to devote to partition merging and
whether this is worth the time savings observed during data loads. Note that
additional partitions are not the issue that they were with partitioned views.
The number of partitions in a partitioned table has no affect on query compile
time and the support for the number of partitions has been greatly increased
with this feature (from 256 partitions with partitioned views to
1,000 partitions with partitioned tables).
There are several activities that occur during the ETL batch
cycle—inserting new data comprises a large part. Data inserts (and updates)
have repercussions beyond the insert itself, most of which we discuss here.
First, it is important to determine the goals for data updates during
the batch cycle. A common goal is to minimize the batch window by making the
inserts faster. It is also important to simplify the ETL process to make it
easier to maintain. Server resources should be used efficiently to reduce costs
and affect the first goal of speeding the update process. In other words,
parallelize to the extent possible without introducing bottlenecks. Server
resources can include, in this case, CPU, disk, tempdb, network,
transaction logs, memory, and more. It is also important to understand the side-effects
of the updates and the additional management activities that need to be
performed as a result. Index fragmentation can be managed, to an extent, by the
design of the index. Row updates and late-arriving facts can still result in
fragmentation both at the index and extent level.
Bulk inserts continue to be the fastest means of inserting new data.
There are a number of conditions that must exist in order to achieve a true
bulk insert, all of which are identified in SQL Server Books Online. The
quickest way to determine if you’ve actually met all of the criteria is to
examine the locks that are taken while the insert runs by running a query
similar to the following:
WHERE
resource_database_id = db_id(“REAL_Warehouse_Sample”)
ORDER BY
request_session_id DESC
You should see locks that look like the following:
Bulk insert locks – minimal logging and locking
Bulk loading is certainly a worthwhile goal on the initial loading of
data, but it is not out of reach even during incremental loads. Aside from a few
BULK INSERT (or BCP) options, the destination table must be empty or have no
indexes. This can be accomplished in two ways: drop the indexes on the table
prior to the load, or use a new partition for the new data. Dropping indexes is
a common recommendation anyway when inserting any significant amount of data.
This only makes sense on smaller tables, or externally loaded partitions of a
larger table.
Common insert strategies
The strategy for inserting data is dependent on a few factors, namely
the availability requirements of the data, the simplicity of the load process,
and whether there is a combination of updates and inserts. Again, this is
primarily an issue with fact tables, not only because of their size but also
because of the number of inserts/updates that occur in a single batch cycle.
Inserting new data into the data warehouse can offer some challenges. Table
partitioning provides the capability to load data outside of the table that
users or applications access so that the availability of the data is greatly
maximized. This was somewhat true of SQL Server 2000 and partitioned
views, but has been improved with partitioned tables.
It is preferred that fact table updates be limited to inserts, but there
are scenarios that require a combination of inserts and updates. There are a
few high-level strategies for efficiently updating partitioned fact tables
during the ETL process. They are mentioned here primarily due to their impact
on partitioning design.
·Insert/update
rows directly in the partitioned table. This is the most straightforward
method, but will usually only perform well when there are no indexes on the
partitioned table, the number of rows to be inserted/updated are evenly
scattered across several partitions, or the number of inserted/updated rows are
small. Evaluate the performance of this method against the next one if you are
unclear which is better in your environment.
·SWITCH out
the partition to be updated, drop all indexes, update the data, recreate the
indexes, then SWITCH the partition back in. This option is obviously more
complex to implement. It is highly typical that most fact table updates will be
highly concentrated to just one partition—the current partition. You can
combine this method and the previous one, using this method to update the
current partition and the previous method for all other inserts/updates. Again,
evaluate the performance of either method or a combination to develop the best
approach for your data loads.
·Create a
new partition to cover the span of time since the last load, insert the data,
create the indexes, SPLIT the partitioned table (function) to make a home for
the new partition, and SWITCH the partition in. The only benefit this approach
has to the second option is that the data that was previously in the
partitioned table is never unavailable to users. This might be a good approach
to near real-time loads, allowing updates to occur while business users are
accessing the data. The downside of this approach is that the partitions will
probably need to be merged to make larger partitions at some point, which can
be time-consuming (perhaps during the weekend).
We’ve mentioned a few strategies for managing the time it takes to load new
data into the warehouse while mitigating the time data is unavailable for
queries. The remarks regarding relative impact on the environment is based on
experience, but could vary from environment to environment. It is always best
to compare methods within your own environment if you are unsure of which is
best for you.
Maintaining pre-aggregated data
Data warehouses are usually supplemented with some form of
pre-aggregated data, which removes the need to repeatedly recalculate commonly
queried aggregated data. This concept is what allows Analysis Services to
perform so well. In SQL Server, these pre-aggregations can take one of
three forms, all of which require either implicit or explicit management after
the data warehouse is updated. If indexed views can be used for most or all
pre-aggregated summary tables, there is overhead when the underlying fact
tables are updated. If the fact tables are sizeable and, therefore,
partitioned, indexed views are probably not an option since they would need to
be rebuilt every time a partition was switched in or out. User-defined summary
tables or OLAP cubes require their own maintenance when the detailed fact data
is updated. This can be minimized by building partition awareness into the
summary tables and/or cubes and building intelligence in the ETL process to
indicate which partitions are updated during a load.
This is usually accomplished in Analysis Service cubes by partitioning
the cube’s measure groups in the same way as the underlying fact table. This
way, only the measure group partitions that are based on the fact table
partitions that were updated are reprocessed. This greatly diminishes the cube
processing time.
When maintaining several partitions in a partitioned table, it is
practical to maintain a table that is updated during the ETL process to keep
track of which partitions are updated during each batch. This can be used to
determine during scheduled maintenance which partitions have been updated since
the last maintenance. This can be used, of course, to kick off the tasks to
update cube partitions or to update segments of user-defined summary tables.
Likewise, user-defined summary tables should have a component that
identifies what rows are associated with a partition. The partitioning key will
most likely be highly significant in the data warehouse, so it will naturally
become part of any summary table. For example, if order date is used as the
partitioning key, some aspect of order date will reside in any summary tables
dependent on the granularity of this date dimension in the summary table. Let’s
say the fact table is partitioned on order date by calendar month. The summary
table date granularity may be higher, lower, or equal, but it will be enough to
identify the rows that are changed. If the October partition is updated, all
rows in the summary table that represent all or part of October should be
updated. This can be used to only recalculate pre-aggregated data in the
summary table just like Analysis Services does.
Aging fact data
In most data warehouses, the value of data diminishes over time. A
certain amount of data must be kept online and accessible for queries, but the
frequency of the query hits on data tends to go down as the data ages. This
fact can be used to implement more cost-conscious storage methods by moving
older data to less expensive disk subsystems. The cutoff points for aging data
vary from environment to environment. This information can be gathered from
business users at the beginning of the data warehouse project and monitored in
production.
Usually, two or three storage subsystems are used to handle Tier 1,
Tier 2, and potentially Tier 3 data. In a three-tier scenario,
Tier 1 data storage may represent RAID 1+0 on the fastest and most
highly-available storage subsystem (redundant, multipath). Tier 2 may be
RAID 5 on the fastest storage (this assumes that data on Tier 2 is
older than the known potential update window and won’t incur the write overhead
of RAID 5). Tier 3 may represent RAID 5 on a less expensive disk
storage technology. These are just examples; your configuration depends on
budget, storage vendors, update windows, and other factors.
Part of the regular maintenance plan would be to move data from one tier
to another as it ages. This is true data movement, which requires an extended
batch window, such as over a weekend. It is best to minimize the number of
partitions that move at any given time to manage the time required for the data
movement. Also, each tier implies additional movement of data, that is, three
tiers means that data will move twice in its lifetime where as two tiers will
only require one move.
To elaborate, imagine a data warehouse where the most actively queried
data is 13 rolling months. Year-to-date queries are common, as well as
monthly comparisons to the same month last year. These queries represent
roughly 75% of the total queries. We also have another cutoff that includes
data from the last three fiscal years. This represents an additional 20%
of the queries, so Tier 1 and Tier 2 storage will serve up data for
95% of the query load. All data that is prior to the beginning of this fiscal
year minus 2 will move to Tier 3. We currently have data starting from
January 1st, 2000. If the fiscal year begins on July 1st (FY
2007) and ends on June 30th, the data is represented as follows for this
scenario on October 15th, 2006:
Note: The number of months does not
necessarily directly correlate to the amount of the data in each tier. As business
grows, the number of fact rows generally grows proportionally. In this case,
Tier 1 space may be larger than Tier 2 space, even though there are
more months stored on Tier 2.
Remember, that this is just a hypothetical scenario. Two tiers of storage
are likely to be more common, in order to minimize data movement and due to the
storage options available in a particular environment.
Partitioning makes the data movement more effective. Tables large enough
to warrant a storage tier strategy should almost certainly be partitioned for
other management reasons.
Some storage vendors also provide a storage system feature to perform
such movement behind the scenes. This occurs at the physical disk level and is
undetected by SQL Server. This greatly minimizes the impact of many tiers
and the amount of data that can move at one time. The database files will
probably need to be specific to a partition, since entire disk file(s) will be
moving.
After implementing an aging strategy, it is important to monitor the
files to make sure that the anticipated activity is accurate. This can be
accomplished by reviewing the disk activity in the System Monitor. The ability
to monitor the files is somewhat related to the way that the database files are
mapped to logical disks. You can ascertain whether data is moving too quickly
by comparing disk activity on the Tier 2 storage to that of Tier 1,
for example. If a high proportion of disk activity is regularly occurring on
Tier 2 as opposed to Tier 1, data may be moving too early.
For additional information on data aging, see the Project REAL:
Data Lifecycle – Partitioning white paper on the Project REAL Web site
(http://www.microsoft.com/sql/solutions/bi/projectreal.mspx).
Archiving fact data
To manage the size of a data warehouse over time it is important to
establish a time after which data is no longer needed. This should be asked of
the business community at the beginning of the project so it can be built into
the management process during development. Waiting until the data warehouse is
bursting at the seams puts the DBA in a crisis mode where the process may not
be executed as effectively. Also, this information is important to have when
designing the partitioning strategy. Partition boundaries should not exceed the
portion of data to be removed at any point in time. For instance, if data is to
be removed a fiscal month at a time, the partition boundary should be no larger
than fiscal month. This allows the use of partition switching as a means of
removing the data from the partitioned fact table. The alternative is to use a
DELETE command which takes much longer and has a heavy impact on the log.
If the preceding conditions have been met, the sliding window
implementation to remove data is easily performed.
1.Create a
new table that mimics the partitioned table in terms of metadata and indexes.
2.SWITCH out
the oldest partition to the new table.
3.Perform
any operations necessary prior to completely removing the data, such as a final
backup.
4.TRUNCATE
the new table.
5.MERGE the
oldest partition with the next oldest in the partitioned table (function).
6.If more than
one partition is to be archived, repeat this process from step #2.
7.When all
data has been removed, DROP the table that was used for the switch out process.
For more details, see the partitioning white papers in References at the end of this whitepaper.
Maintaining dimensions (pruning)
A management activity that is often overlooked is dimension pruning, or
the removal of old dimension members. Again, business rules dictate when
dimension members can be removed. There are benefits to pruning dimensions on a
regular basis, including:
·Improved
end user experience. It can be cumbersome to browse large dimensions.
Oftentimes, users do not want to see dimension members that have no related
fact table rows.
·Query
time. Joins to dimension tables are more efficient on smaller tables.
·Potential
benefits to cube processing and query performance
One criterion is to remove members that are not referenced by any fact
table rows, although care must be taken to not remove new dimension members for
which no facts have arrived. It is also possible that the business may opt to
categorize dimension members in a generic member for very old references.
Another consideration is with slowly changing dimensions—you may want to keep
the first version of the dimension, even if it is no longer referenced by any
fact rows. If a parent key is used to tie all versions of a dimension member
together, be sure that this key is updated to the earliest represented version
of the member if any of the initial versions are deleted.
Dimension pruning is usually an infrequent maintenance activity
(quarterly or yearly), though this depends on the growth rate of a dimension.
Be sure to rebuild indexes after dimensions are pruned to compact the index
pages. Also run Update Statistics, if Auto Update Statistics is turned off.
Finally, the best way to manage the size of dimensions is to minimize
the number of new inserts into the dimension. This is especially relevant with
Type 2 slowly changing dimensions. Be sure to determine which dimension
columns truly require versioning and insert new version rows only when any one
of those columns is modified. Also, confirm whether dimension row versioning
needs to occur before facts arrive that are related to them. For instance, if a
product is created and multiple changes are made to add attributes to the
product, it may not be necessary to track all of those changes unless or until
the product starts selling (sales fact records start to arrive).
Backup and
Restore Strategies
Database backups are often one of the first database maintenance
activities that come to mind. Backups can seem to be less critical in a data
warehouse database because the data can often be reproduced by reprocessing ETL.
However, this is usually a more time-intensive proposition than simply
restoring backups, especially in a severe disaster where the entire database
must be recovered. It is also likely that the data is archived off of the
source system long before it is removed from the data warehouse, making the
data warehouse the only place where this historical data is available. If the
database is strategically partitioned, the amount of data that must be
initially restored in a recovery scenario can be greatly minimized. The ability
to perform piecemeal restores make critical data available much faster, as we
explain in this section. Backups can also be used to create a mirror of a
database for load balancing or to perform additional maintenance, such as
checking for index fragmentation or consistency checking on a separate server.
New functionality in SSIS can also be used to create checkpoints
throughout the ETL/batch process. If this process takes five hours every night,
it is a good idea to take occasional checkpoints after significant activities
so that a problem that occurs well down the line does not require that the
entire batch process be rerun! If a problem occurs in the final hour and
a checkpoint was taken after four hours of processing, the ETL process can be
restarted (after the problem is resolved) at the point of the last checkpoint
and only less than an hour of batch processing remains.
Backups
for business continuity
A key thing to remember is that only the data that has been changed
requires a new backup. We can utilize the partitioning strategies mentioned in
the previous section to minimize the amount of data that requires backups.
Specifically, there usually comes a time in any data warehouse where older
partitions have little or no potential to be updated. This fact can be utilized
to greatly minimize the required maintenance window. A data warehouse can
extend into history for 5–7 years or more of fact data, but if only the last
two months have the potential to be updated, those are the only partitions that
require maintenance. This can be further guaranteed by setting the read-only
attribute on filegroups related to stable partitions. As partitions age, simply
flag the respective filegroup(s) as read-only, backup those filegroup(s), and
no further backups are required unless very late fact data arrives.
The partitioning strategy suggested in this document supports this
backup strategy. A best practice is to create one (or more) filegroups for
dimension tables, another one for summary tables (if used), and to isolate
table partitions to their own filegroup(s). The primary filegroup is used only
for system tables. With this setup, only the primary, dimension, summary and
active fact table partition filegroups require maintenance.
An additional benefit to this strategy is the ability to flexibly
restore a data warehouse database in the event of a disaster.
SQL Server 2005 Enterprise Edition provides an online piecemeal
restore capability which makes the database available even while additional
filegroups are restored. Only the primary filegroup must be restored in order
to bring the database online. The remaining filegroups can be restored in order
of query priority. This order usually requires restoring the filegroup(s) where
the dimensions reside, followed by the summary tables, followed by each fact
partition starting with the current partition and going backward. Note that any
query that references partitions that have not yet been restored will fail. In
other words, the query must include a filter on the partitioning key to
eliminate nonloaded partitions.
For example, visualize a data warehouse database that has been laid out
as follows:
Filegroup
Tables
Read Write/Read-only
Primary
System objects
Read write
Dimensions
All dimension tables
Read write
Summary
All summary tables
Read write
Active01
10/01/2006 - 10/31/2006
Read write
Active02
09/01/2006 - 09/30/2006
Read write
Dormant01
08/01/2006 - 08/31/2006
Read-only
Dormant02
07/01/2006 - 07/31/2006
Read-only
Dormant02
06/01/2006 - 06/30/2006
Read-only
Dormant03
05/01/2006 - 05/31/2006
Read-only
Dormant04
04/01/2006 - 04/30/2006
Read-only
…
…
…
Dormant xx
Oldest partition
Read-only
In this case, we’ll restore the filegroups in the order represented in
the table. This means that the active filegroups (read write) will be available
first. When restoring these filegroups individually, the database will be available
as soon as the primary filegroup has successfully completed. This will not be
very beneficial to the business users, though. We will continue to restore the
dimension tables, the summary tables, the current partition, and last month’s
partition. At this point, a query like the following can be run successfully:
SELECT Subject,
SUM(Sale_Amt)
FROM
Tbl_Fact_Store_Sales f
JOIN Tbl_Dim_Item i
ON f.SK_Item_ID = i.SK_Item_ID
WHERE
SK_Date_ID BETWEEN 20061001 and 20061031
GROUP BY Subject
This next query will not run successfully because it requests
August 2006 data, which has not yet been restored:
SELECT Subject,
SUM(Sale_Amt)
FROM
Tbl_Fact_Store_Sales f
JOIN Tbl_Dim_Item i
ON f.SK_Item_ID = i.SK_Item_ID
WHERE
SK_Date_ID BETWEEN 20060801 and 20061031
GROUP BY Subject
The following error is returned:
Msg 679,
Level 16, State 1, Line 1
One of the
partitions of index 'CIX_Tbl_Fact_Store_Sales_DateItemStore' for table
'dbo.Tbl_Fact_Store_Sales'(partition ID 72057594080067584)
resides on
a filegroup that cannot be accessed because it is offline, restoring, or
defunct. This may limit the query result.
The following query will also not work because partition elimination
does not occur due to a join to another table:
SELECT Subject,
SUM(Sale_Amt)
FROM
Tbl_Fact_Store_Sales f
JOIN Tbl_Dim_Item i
ON f.SK_Item_ID = i.SK_Item_ID
JOIN Tbl_Dim_Date d
ON f.SK_Date_ID = d.SK_Date_ID
WHERE
d.Calendar_Year_ID = 2006
AND
d.Calendar_Month_Desc = 'October'
GROUP BY Subject
This may be a good reason for using a smart key for the partitioning
key. If the partitioning key is a surrogate key to a date dimension, consider
using the date in lieu of a true surrogate key. If possible, use smalldatetime
instead of datetime to compact the column from 8 bytes to 4. The
valid values from smalldatetime are from January 1, 1900 through
June 6, 2079, which is usually more than adequate. Time granularity
is also to the minute, which is usually not an issue in a data warehouse. Using
a true date type instead of an integer enables the use of date functions in
queries and reports.
In general, the data type on all columns should be as compact as
possible, especially on the fact table. A savings of 4 bytes may not seem
significant, but when multiplied by one billion rows it can become so (that’s
just over 3.7 GB). Columns that exist in a clustered index or partitioning
key also are proliferated to all nonclustered indexes on the same table.
Smaller columns mean more data per page, which results in more efficient use of
memory and less disk I/O.
In SQL Server 2005, read-only filegroup restores no longer
require that a transaction log be applied as long as the filegroup backup is
made after the filegroup was designated as read-only. This may not be an issue
in a data warehouse where transaction log backups are rare, but the flexibility
is there. Also, be sure that all regular database maintenance is performed on
the partition(s) in filegroups prior to making them read-only.
Note that online piecemeal restores are only available with the
Enterprise Edition of SQL Server 2005. Much of the functionally that
we have just described depends on the table partitioning feature of
SQL Server 2005, which requires the use of Enterprise Edition.
Smaller data warehouses that are running on Standard Edition will likely not
require the backup/restore complexity that is suggested here.
ETL checkpoints
Checkpoints should be performed periodically in a long
running batch to minimize the amount of work to be redone if the ETL process
fails. SQLServer Integration Services (SSIS)
provides a new checkpoint capability in packages so that a restart resumes work
where it left off. When the package is run, package state information is
written to a checkpoint file. The restart of a package whose run was
interrupted begins at the last checkpoint in the file (the package must be
configured to use this feature). This can be used in combination with package
transactions to implement a rollback to the last checkpoint, which should be
the beginning of the transaction. Package restartability through checkpoints
must be configured by setting the package’s CheckpointFileName, CheckpointUsage,
and SaveCheckpoints properties. You must also set the FailPackageOnFailure
property to True on all containers that you want to designate as restart
points.
It is important to understand the underlying workings
of SSIS checkpoints as it may affect your package design. First of all,
checkpoints are taken at the control flow container level. A data flow is a
single unit of execution and, therefore, is rerun in its entirety if the
package fails during the data flow’s execution. For that reason, you may wish
to break up long data flows and call them individually from a control flow.
Also note that ForEach loop tasks and transacted
containers constitute a unit of work. If a ForEach loop task is interrupted
near the end of its iterations, it starts from the beginning. Likewise, a group
of tasks that are grouped in a container that has been designated as a transaction
are rolled back to the beginning of the transaction if a failure occurs during
the transaction. This is by design and should be the desired outcome, but it
can affect the package design.
Transactions should be used in conjunction with
checkpoints to provide consistency in the database. Transactions in a package
are very much the same as database transactions. They provide a means of
designating a group of actions that should be performed as a unit. If there is
a failure in the middle of the transaction, any updates that occurred from the
beginning of the transaction up to the failure are rolled back. Transactions
and checkpoints are used together to preserve database integrity in a restart
scenario.
Transaction can be designated at the packages, container,
For Loop, Foreach Loop, or Sequence container level to provide a high degree of
control for the package developer. Transactions are started by setting the TransactionOption
property for the package or container to Required. Subsequent
containers can join a transaction started by a preceding container by setting
the container’s TransactionOption property to Supported.
For a full discussion on implementing checkpoints and
transactions, see SQL Server 2005 Books Online or one of the white
papers in the References section in this white paper.
Recovery model
A detailed discussion of recovery models is best
reviewed in SQLServer Books Online and other
sources. Just a few remarks will be made here as they relate to backups of the
data warehouse. The simple recovery model is often used in production data
warehouses to manage the size of the transaction log during the huge bulk
update activity that occurs during ETL processing. This can seem a much easier
choice in a fully managed update process. The simple recovery model has its
benefits but continues to be a dangerous choice unless you are willing to
absorb the cost of re-executing ETL. The simple recovery model also removes
much of the flexibility of piecemeal restores. If recovering from backups made
while the database is set to the simple recovery model, all read-write
filegroups must be restored in the initial piecemeal restore sequence. In other
words, the database is not available to users until all read-write filegroups
have been restored.
The recommended approach is to use the bulk-logged
recovery model throughout the ETL batch processing. The data warehouse can be
backed up using either differential or filegroup backups. This assumes that a
point-in-time recovery is not relevant. After batch processing is complete and
the database is readied for the business day, place the database in the simple
recovery model. Ideally, make the database read-only to minimize locking, as
well.
If the data warehouse is updated more frequently, such
as in a near real-time environment, use either the bulk-logged or full recovery
model based on your tolerance for point-in-time recovery.
Database Health
As with all SQL Server databases, there are a handful of health
diagnostics that is a best practice to perform on a data warehouse database.
These are often omitted in data warehouses due to limited batch windows and
their very large size. Maintenance activities such as DBCC CHECKDB may be
viewed as a nicety that can be put off, sometimes permanently. This, and the
other database health assessment operations described in this section, should
be performed on a regular basis for the very reasons they are so important on
an OLTP system.
Database consistency checking
SQL Server database administrators are usually well aware of the
importance of database consistency checking, especially in the OLTP
environment. We have seen this operation ignored frequently in the data
warehouse environment due to the amount of time that it requires on these very large
databases. This is further impacted by the fact that it is mostly an
all-or-nothing proposition. The granularity of consistency checking is at the
table level. Fact tables can easily grow into the multiterabyte range, making
consistency checking very time consuming. Note that DBCC CHECKTABLE and its
related commands now use an internal snapshot of the database in order to
maximize concurrency (not block users). These commands can consume tremendous
server resources, though, and are preferably run outside of the active query
window.
A common trick in OLTP is to restore a backup of the OLTP database to a
secondary server, where time-consuming maintenance operations can be performed
without affecting users. This has the additional benefits of testing the backup,
providing a warm backup for disaster recovery purposes, and creating an
additional low-availability source for reporting. This trick has not been a
good option for data warehouses because of the disk resources that are required
to retain two copies of the database. Partial restores make this a far more
viable option.
Partial restores can be used to restore only the active part of the data
warehouse, which is usually a small fraction of the full size. Options are
available in the BACKUP and RESTORE commands to easily backup and restore only
filegroups that are in read/write mode:
FROM DISK =
'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\Partial.bak'
WITH RECOVERY,
REPLACE, PARTIAL
At this point, DBCC CHECKTABLE can be used to check for consistency of tables
that reside entirely within the restored filegroups, such as system, dimension,
and summary tables. In SQL Server 2005 SP1, consistency checking
is unavailable for tables that span filegroups where one or more of the
filegroups were not recovered. This is the case with large fact tables, where
many partitions reside on read-only filegroups that were not recovered during
this partial restore. The ability to perform consistency checking of individual
partitions of an index contained within a single filegroup by using the DBCC
CHECKFILEGROUP command will be available in a future service pack. If
partitions are isolated on their own filegroup, this can be viewed as a way of
making the CHECKTABLE command partition-aware.
Index fragmentation
Index fragmentation will undoubtedly occur over a period of time as new
data is introduced into the data warehouse. This can be mitigated somewhat by
insert methods and how indexes are defined, but it is next to impossible to
entirely prevent. Creating a chronological clustered index is helpful, but
usually cannot entirely prevent fragmentation. For instance, dimension tables
often have one or more nonclustered indexes on a nonsequential key.
Late-arriving fact records are inserted nonsequentially. Also, updates to dimensions
rows will often inflate the row to where a page split occurs. Indexes should be
regularly monitored for fragmentation and re-organized when necessary.
A good way to deal with fragmentation is to prevent it from happening,
to the extent possible. For instance, when adding inferred or incomplete
dimension members, use column values that are indicative of the anticipated
eventual size of the column. This can also be mitigated with Type 2 slowly
changing dimensions, where changes in dimension columns result in a new row
insert.
Since fragmentation is not completely unavoidable, minimize the
maintenance tasks to defragment indexes. It is all too common to find that a
single maintenance plan is run, perhaps over the weekend, to rebuild every
index in every table in the database. Though this is thorough, it is rarely
necessary. Use the new DMVs in SQL Server 2005 to determine which indexes
in which partitions of which tables are fragmented enough to require an index
rebuild. As always, remember that data that does not change cannot become
fragmented. It is unnecessary to check for fragmentation on indexes on table
partition filegroups that are read-only, let alone rebuild them. Maintenance
overhead is be greatly minimized if you check only the active partitions. Also,
defragment indexes in order by most used. If all fragmented indexes cannot be
managed in a single batch window, at least the next day’s query load will
benefit from this strategy.
The previous means of determining index fragmentation (versions previous
to SQL Server 2005) was to use DBCC SHOWCONTIG. Though this command
is still operational in SQL Server 2005, use the new dynamic
management view (DMV), sys.dm_db_index_physical_stats, to assess index
fragmentation. This DMV is partition-aware and can be run against a partially
restored database as previously described. The avg_fragmentation_in_percent
column reports the level of fragmentation of a partition of a table index. The
following query can be modified but is a good start. It returns a list of all
partitions of all fragmented indexes (greater than 10%) in descending order by
usage and fragmentation percent:
-- List of fragmented indexes by usage
SELECT OBJECT_NAME(f.object_id) AS 'Table name', i.name AS 'Index',
f.index_type_desc, f.partition_number,
f.avg_fragmentation_in_percent,
ISNULL(user_seeks +
user_scans + user_lookups, 0) AS 'Usage count',
f.page_count
FROM sys.dm_db_index_physical_stats(DB_ID(),NULL,NULL,NULL,'LIMITED') f
JOIN sys.indexes i
ON i.object_id =
f.object_id AND i.index_id = f.index_id
JOIN sys.tables t
ON t.object_id =
f.object_id
LEFT OUTER JOIN
sys.dm_db_index_usage_stats s
ON s.database_id =
f.database_id
AND s.object_id =
f.object_id
AND s.index_id =
f.index_id
WHERE
f.avg_fragmentation_in_percent > 10.0
AND f.index_id > 0
AND f.page_count >
0
ORDER BY 'Usage
count' DESC, f.avg_fragmentation_in_percent DESC
Likewise, the DBCC INDEXDEFRAG and DBCC DBREBUILD commands are being
deprecated. Use the REORGANIZE and REBUILD options of the ALTER INDEX command,
respectively, to replace these commands. They are equivalent to their
predecessors.
In addition to checking for indexes that are fragmented, look into those
that are rarely (or never) used. They may not be worth the upkeep (such as
update overhead, space requirements, maintenance overhead) The Customer
Advisory Team has some great information posted on one of their blogs - “How can SQL
Server 2005 help me evaluate and manage indexes?”
(http://blogs.msdn.com/sqlcat/archive/2006/02/13/531339.aspx). The blog gives
additional insight into useful DMVs and sample queries for comparing unused (or
rarely used) indexes against their maintenance costs.
Update statistics
Index statistics are a critical input into the query optimizer. They are
used to determine a high-level representation of data, which allows the
optimizer to quickly determine the best indexes and access methods to use for
fast query execution. The size of a data warehouse database may seem to warrant
reconsideration of best practices regarding update statistics. Statistics are
table-specific, not partition-specific, so it may seem unreasonable to allow
SQL Server to update statistics whenever it deems necessary (after a
significant proportion of the data has been updated). We briefly review key
characteristics of data warehousing and SQL Server 2005 as they
pertain to maintaining statistics.
In SQL Server, the first query to execute following a significant number
of changes to one of the underlying tables can result in an update of the
statistics prior to query compilation. This can add a few seconds (or sometimes
minutes) to the execution time as the query will not run until the statistics
update is complete. There is a new database option in SQL Server 2005
called AUTO_UPDATE_STATISTICS_ASYNC. If this option is turned on, statistics
are updated in the background in the above scenario. The query that triggered
the update statistics will compile, or re-use an existing query plan, based on
the previous statistics. Only queries that run after the completion of the
statistics update will compile using the updated statistics information.
One way to detect if you need to turn on the async option is to monitor
the Recompile event in SQL Profiler. Under the EventSubClass you
can see the cause for recompile (2 = Statistics changed). If the
cause is outdated statistics, that means you often get recompiles due to stale
statistics.
Though it is important to be aware of this behavior, the default
database options (AUTO_UPDATE_STATISTICS ON, AUTO_UPDATE_STATISTICS_ASYNC OFF)
are usually appropriate in a data warehouse environment. Long-running queries
could benefit from the new statistics. A few seconds could result in an execution
plan that removes minutes or more from the query execution time. Manually
updating statistics requires diligence and a thorough understanding of when to
do so. If daily reports are run off of the data warehouse, it might be a good
idea to run them right after ETL processing if they don’t already. It is likely
that the reports will absorb the minimal impact of updating statistics and it
will also warm the cache. Otherwise, consider running some typical user queries
at the end of ETL to warm the buffer cache. Warming the buffer cache can have
the very positive affect of lowering the physical I/O incurred by first-run
queries. This also updates statistics.
Depending on the size of your incremental updates, you might observe that
statistics are not updated enough. The leftmost column on a statistics object
must have changed for 500 rows plus 20% of the rows that existed when the
statistics were last updated (for tables that had more than 500 rows when
statistics were updated). On a large fact table that had 1 billion rows
when statistics were last updated, the leftmost column had to have changed on
200,000,500 rows (including inserts). Since good query plans depend so
much on good statistics, it is a good idea to update statistics manually on
very large tables with relatively small incremental updates.
Now that we’ve discussed the various maintenance activities to perform,
let’s look at a high-level system for maintaining a moderate- to large-sized
data warehouse with some of the suggestions from the preceding sections of this
white paper. The scenario is intended to represent most maintenance challenges
resulting from complex business requirements.
Scenario
You are architecting the management strategy for a retail data
warehouse:
·The
database is estimated to initially be 7 terabytes (after three years of
historical data is loaded), with growth to 15 terabytes in the next
two years due to the opening of new stores and new, innovative marketing
strategies.
·Fact
tables are partitioned by month.
·Data that
is older than 7 fiscal years is to be archived (written to permanent media and
stored offsite).
·Analysis
of the source system shows that fact data is very rarely received for days
prior to six weeks ago.
·85% of
queries are expected to access data no older that 13 months ago (include
the full month corresponding to this year’s month).
·Data is
loaded on a daily basis.
·Data is
not available from the source until midnight EST or later. Some business users
are located in the UK, so service level agreements (SLAs) require that the data
be available by 3:00 a.m. EST.
·Due to the
short batch window, the business users have agreed that the system can be
unavailable on weekends (7:00 EST Friday to 3:00 a.m. EST Monday).
1.If the
database is in READ_ONLY mode, toggle it to READ_WRITE.
2.Load
updates to dimension tables.
3.Create a
new external table for the daily load and bulk insert data into the new table. Split
the partition function, add indexes and check constraints to the external
table, and switch into the partitioned table.
4.Load
updates prior to the previous day directly into the partitioned fact tables;
implement a mechanism to track which partitions are updated either in the
pipeline or through Service Broker (asynchronous).
5.Update
summary tables or cubes based on the partitions that are updated.
6.Back up
filegroups associated with the updated partitions.
1.After
database backups, perform a partial restore of the primary and read-write
filegroups on a secondary server.
1.Execute a
query on the restored backup server to establish fragmentation levels of the
read-write filegroups. Generate a script to reorganize or rebuild fragmented
indexes during the weekend.
1.Merge
daily partitions from the previous week into the current month partition.
2.Run DBCC CHECKFILEGROUP
on read-write filegroups[2].
3.Reorganize
or rebuild fragmented indexes using the information from Step 2 from Thursday
night. This is performed directly on the data warehouse and can be performed at
the same time as Step 2.
1.Move aged
partitions per the business requirements (older than 13 months).
2.Remove
partitions, per business requirements, using the sliding window method. Copy
filegroup backups to an archival media, as appropriate (this most likely should
be done when the filegroup is made read-only).
3.Finalize
the new partition minus 3, for example, when starting December, finalize
September:
a.Validate
that there is minimal index fragmentation from the previous Thursday
fragmentation check.
b.Update
statistics if necessary (if not auto update, or if many changes have been made
since last update statistics).
c.Confirm
that the filegroup is healthy by either re-running DBCC CHECKFILEGROUP or ensuring
that no updates have been made to the partition since the last run of this
command.
d.Make the
partition filegroup read-only.
e.Make a
final backup of the filegroup. Note that it is important to perform a backup of
the filegroup after making it read-only.
It may be impossible to determine an exact cutoff point for when data
will no longer be updated. If this is the case, you can implement ETL
functionality to make a read-only filegroup back into a read-write filegroup
when an update is identified that will reside in this filegroup. The management
schedule can then check when the filegroup (partition) has been stable (no
updates) for a number of days and go through the process of finalization again.
It is important to perform all final maintenance prior to re-flagging the
filegroup as read-only.
Filegroups cannot be changed from read-write to read-only, or vice
versa, unless no other users are in the database. Since this property should be
changed in a managed environment (ETL or other) this should not be an issue. If
user connections remain even after batch has started, they can be terminated:
USE master;
GO
-- Change the
database to single user mode - rollback any
-- transactions and
disconnect all users in the database
ALTER DATABASE
[DataWarehouse]
SET SINGLE_USER
WITH ROLLBACK
IMMEDIATE;
GO
-- Change the
filegroup to read-only
ALTER DATABASE
[DataWarehouse]
MODIFY FILEGROUP
[ReadOnly] READONLY;
GO
-- Change the
database back to multiuser mode
ALTER DATABASE
[DataWarehouse]
SET MULTI_USER;
GO
This is also required to make the database read-only. You will likely
want to communicate to your business users that any connections that remain at
the end of the day are likely to be terminated after a certain time.
Conclusion
There are several techniques that can be incorporated to minimize the
time and resources required for database maintenance of very large data
warehouse databases.
·Isolate
(partition) the data that actually changes and perform maintenance only on
those partitions as it is necessary.
·Instead of
a broadstroke approach to index re-organization, evaluate index fragmentation
and reorganize the most heavily used indexes that are highly fragmented, first.
·Occasionally
verify managed objects, such as indexes and summary tables, to validate their
continued use. Removing unused objects saves both management and hardware
resources.
·Build
intelligence into the ETL processing to keep track of the changed partitions in
order to minimize maintenance.
·Offload
management to another server, if possible.
·When in
doubt, test different data life cycle strategies, such as data inserts, to determine
which method best suits your environment based on business requirements.
Several features of SQL Server 2005, such as table
partitioning, dynamic management views, SSIS checkpoints, and online piecemeal
restores greatly enhance your ability to monitor and manage data, while keeping
it highly available to the applications and business users who rely on it.
References
Data warehousing concepts and dimensional modeling
·“The Data
Warehouse Lifecycle Toolkit” by Ralph Kimball, Laura Reeves, Margy Ross and
Warren Thornthwaite
·“The
Microsoft Data Warehouse Toolkit” by Joy Mundy and Warren Thornthwaite with
Ralph Kimball
Did this paper help you? Please give us your feedback. On a scale of 1
(poor) to 5 (excellent), how
would you rate this paper?!href(mailto:
sqlfback@microsoft.com?subject=Feedback: Developing Custom Components in SQL
Server Integration Services for Project REAL)
[1] Partial recovery of a database in Simple
recovery model restores the Primary and all read/write filegroups in the
initial step. If many of the filegroups are read-only, they can be
restored separately.