.. sting operational database(s), cleansing or scrubbing the data, denormalizing the data, and then loading the data into the database populate the database. (This data population process is also known as the data transformation process.) This database is then the place for top executives, managers, analysts, and other end-users to mine a rich source of company information. They can ask compelling business questions and find answers in their data so they can make key and timely business decisions from their desktops using GUI On-line Analysis Processing (OLAP) tools. Attributes Of A Data Warehouse According to W.H.
Inmon, who is considered the father of data warehousing, A Data Warehouse is a subject-oriented, integrated, time variant, nonvolatile collection of data in support of management’s decision-making process. These fundamental attributes of a data warehouse are further explained below: Subject Oriented Operational data, such as order processing and manufacturing databases, are organized around business activities or functional areas. They are typically optimized to serve a single, static, application. The functional separation of applications causes companies to store identical information in multiple locations. The duplicated information’s format and currency are usually inconsistent.
For example, in a delivery database, the customer list will have very detailed information on customer addresses and is typically indexed by customer number concatenated with a zip code. The same customer list in the invoicing system will contain a potentially different billing address and be indexed by an accounting Customer Account Number. In both instances the customer name is the same, but is identified and stored differently. Deriving any correlation between data extracted from those two databases presents a challenge. In contrast, a data warehouse is organized around subjects.
Subject orientation presents the data in a format that is consistent and much clearer for end users to understand. For example subjects could be Product, Customers, Orders as opposed to Purchasing, Payroll. Integrated Integration of data within a warehouse is accomplished by dictating consistency in format, naming, etc. Operational databases, for historic reasons, often have major inconsistencies in data representation. For example, a set of operational databases may represent male and female by m and f, by 1 and 2, by x and y. Frequently the inconsistencies are more complex and subtle.
By definition, data is always maintained in a consistent fashion in a data warehouse. Time variant Data warehouses are time variant in the sense that they maintain both historical and (nearly) current data. Operational databases, in contrast, contain only the most current, up-to-date data values. Furthermore, they generally maintain this information for no more than a year (and often much less). By comparison, data warehouses contain data that is generally loaded from the operational databases daily, weekly, or monthly and then typically maintained for a period of 3 to 5 years.
This aspect marks a major difference between the two types of environments. Historical information is of high importance to decision-makers. They often want to understand trends and relationships between data. For example, the product manager for a soft drink maker may want to see the relationship between coupon promotions and sales. This type of information is typically impossible to determine with an operational database that contains only current data. Nonvolatile Nonvolatility, another primary aspect of data warehouses, means that after the informational data is loaded into the warehouse, changes, inserts, or deletes are rarely performed.
The loaded data is transformed data that originated in the operational databases. The data warehouse is subsequently reloaded or, more likely, appended on a periodic basis with new, transformed or summarized data from the operational databases. Apart from this loading process, the information contained in the data warehouse generally remains static. The property of nonvolatility permits a data warehouse to be heavily optimized for query processing. Built From Scratch Because each company has its own business needs and business queries, a data warehouse database is normally built from scratch utilizing the available data warehousing enabling tools. Determining what kind of questions or queries that end-users need is the first step, though, a time consuming one.
Data modeling for such a customized data warehouse database can then be developed. Identifying what data is needed from the operational database(s) and then populating the data warehouse would be the subsequent steps. The entire process can then be repeated as additional refinement is needed over time. From the attributes described above, it is apparent that the purpose and usage of an operational database and a data warehouse vary greatly. The chart below summarizes these differences: Category Operational Database Data Warehouse Function Data processing, support of business operations Decision support Data Process oriented, current values, highly detailed Subject oriented, current and historical values, summarized and sometimes detailed Usage Structured, repetitive Ad-hoc, some repetitive reports and structured applications Processing Data entry, batch, OLTP End-user initiated queries Figure 1: Operational Databases vs. Data Warehouses Deviation from the Traditional Data Warehouse Attributes As the data warehouse technology becomes a mainstream technology, some traditional attributes are being deviated from in order to meet users’ increasing demands.
The most noticeable ones are timing variant, nonvolatile, and built from scratch. Deviation from time variant & nonvolatile As the size of the data warehouses becomes larger and larger (e.g., in terabytes), the amount of time to reload or append data can become very tedious and time consuming. Furthermore, users are demanding more up-to-date data to be included in the data warehouse. Instead of adhering to the traditional data warehouse attributes of keeping the data nonvolatile and time variant, new data is being added to the data warehouse on a daily basis, if not on a real-time basis. Thus, new approaches are being made to tackle this task. Two possible methods are: Perform hourly/daily batch updates from shadowed log files.
Transformation rules are executed in this process. Thus, when the data reaches the target data warehouse database, it is already transformed and summarized. Perform real-time updates from shadowed log files. Again, transformation rules are executed in this process. Instead of batch updates, this takes place on a per transaction basis that meets certain business selection criteria. Deviation from built from scratch For customers that are in the horizontal industry, meaning their applications are unique to their own businesses, it is essential to build a data warehouse from scratch.
However, for customers that are in a vertical industry, meaning their applications are either coming from the same vendor or the functionality of those applications from various vendors are similar in nature, it is possible to leverage an off-the-shelf pre-packaged MART. The MART is a data-modeling template that is designed with a certain set of queries in mind for that specific vertical industry. Instead of designing data models from scratch, leveraging these MARTs can reduce the development time and cost. According to Frederick Rook’s prediction, (a Senior VP of Platinum Technology Inc.,) approximately 80% of the data warehouses or data marts for the vertical industries will be pre-packaged in the next two years or so. This approach definitely deviates from the traditional one.
What Is a Star/Snowflake Schema? As mentioned earlier, the data warehouse database adopts a star or snowflake schema to maximize performance. A star or snowflake schema design is very different from that of an operational database schema design. In an operational database design, the data is highly normalized to support consistent updates and to maintain referential integrity. In a data warehouse design, the data is highly denormalized to provide instant access without having to perform a large number of joins. A star or snowflake schema design represents data as an array in which each dimension is a subject around which analysis is performed.
As the name implies, the star schema is a modeling paradigm that has a single object in the middle radially connected to other surrounding objects like a star. The star schema mirrors the end user’s view of a business query such as a sales fact that is qualified by one or more dimensions (e.g., product, store, time, region, etc.). The object in the center of the star is called the fact table. This fact table contains the basic business measurements and can consist of millions of rows. The objects surrounding the fact table (which appear as the points of the star) are called the dimension tables.
These dimension tables contain business attributes that can be used as SQL search criteria, and they are relatively small. The star schema itself can be simple or complex. A simple star schema consists of one fact table and several dimension tables. A complex star schema can have more than one fact table and hundreds of dimension tables. Figure 2 depicts a simple star schema. images/starp1.gif Star Schema The snowflake schema is an extension of the star schema where each point of the star explodes into more points.
In this schema, the star schema dimension tables are more normalized. The advantages provided by the snowflake schema are improvements in query performance due to minimized disk storage for the data and improved performance by joining smaller normalized tables, rather than large denormalized ones. The snowflake schema also increases the flexibility of the application because of the normalization that lowers the granularity of the dimensions. However, since the snowflake schema has more tables, it also increases the complexities of some of the queries that need to be mapped. Figure 3 below depicts a snowflake schema.
Query optimization Performance in data retrieval can be greatly enhanced through the use of multidimensional and aggregation indexes in a star or snowflake environment. Over 90% of data warehousing queries are multidimensional in nature using multiple criteria against multiple columns. For example, end-users rarely want to access data by only one column or dimension, such as finding the number of customers in the state of CA. They more commonly want to ask complex questions such as how many customers in the state of CA have purchased product B and C in the last year, and how does that compare to the year before. To optimize the query, an index can be put on each column that end-users want to query in the dimension tables. When an end-user issues a query, a qualifying count based on index access only can be returned without touching the actual data. According to Bill Inmon, it is much more efficient to service a query request by simply looking in an index or indexes instead of going to the primary source of data.
In addition to multidimensional queries, end-users often want to see the data aggregated. A data aggregation is usually a COUNT or SUM, but can be an AVERAGE, MINIMUM, or MAXIMUM, such as number of customers, total sales dollars or average quantity. An aggregation is typically organized or grouped by another column, such as sum of sales by region, or average quantity of product line B sold by sales rep. By placing an index on aggregated values, performance can be enhanced. Summary Data is the building block for useful information.
With access to accurate and timely information, appropriate business decisions can be made to maximize profit and gain competitive advantage over other competitors. Most companies today have no shortage of data; however, the data exists in the form that is difficult for human access or interpretation. The challenge lies in transforming the data into useful information. With the data warehousing technology, the means of achieving this is possible. Business Reports.