|
Handling Big Dimensions in Distributed Data Warehouses using the DWS Technique Semina
|
|
11-22-2010, 12:52 AM
Post: #1
|
||||
|
||||
|
Handling Big Dimensions in Distributed Data Warehouses using the DWS Technique Semina
ABSTRACT
The DWS (Data Warehouse Striping) technique allows the distribution of large data warehouses through a cluster of computers. The data partitioning approach partition the facts tables through all nodes and replicates the dimension tables. The replication of the dimension tables creates a limitation to the applicability of the DWS technique to data warehouses with big dimensions. This paper proposes a strategy to handle large dimensions in a distributed DWS system and evaluates the proposed strategy experimentally. With the proposed strategy the performance speed up and scale up obtained in the DWS technique are not affected by the presence of big dimensions. Furthermore, it extends the scope of the technique to queries that browse big dimensions that can also benefit of the performance increase of the DWS technique. Keywords: Data warehousing, distributed query execution. 1. INTRODUCTION Data warehousing applications typically involve massive amounts of data that push database management technology to the limit. A scalable architecture is crucial, not only to handle very large amount of data but also to assure interactive response time to the OLAP (On-Line Analytical Processing) users. In fact, the decision making process using OLAP is often based on a sequence of interactive queries. That is, the answer of one query immediately sets the need for a second query, and the answering of this second query raises another query, and so on and so forth in an adhoc manner. In order to assure acceptable response time to allow the interactive OLAP querying style, even when the data warehouse becomes extremely large in size, data warehouses implementation normally use very expensive platforms, typically based on high-end servers or high-performance clusters. The use of classical parallel processing techniques proposed to relational database systems is also common in big data warehouses .Two types of parallelism can be explored at the query level: inter-query parallelism, wherein multiple transactions are executed in parallel in a multiprocessor environment, and intra-query parallelism, where several processors cooperate to concurrently execute a single SQL statement. The latter is particular interesting to the complex queries executed in a data warehousing as the parallelism is used to improve performance through parallel implementation of the various operations of the query execution plan. However, the use of parallelism in the complex data warehouse queries is clearly more difficult and less effective than the parallel execution of multiple small transactions that characterize typical database applications in an on-line transaction processing (OLTP) environments. Another possibility for high volumes of data is to distribute the data across multiple data warehouses in such a way that each individual data warehouse cooperates to provide the user with a single and global view of the data. In spite of the potential advantages of the distributed data warehouses, especially when the organization has a clear distributed nature, these systems are always very complex and difficult to manage .Furthermore, the performance of many distributed queries is normally poor, mainly due to load balance problems and the volume of data exchanged between servers. The data warehouse striping (DWS) approach is aimed to provide a cost effective alternative for the very expensive servers typically used in large data warehouses by implementing a data warehouse over an arbitrary number of inexpensive computers (typically cheap workstations, server blades, or standard PCs) and, at the same time, integrating this approach in the data warehousing technology available in the market. That is, DWS can be used with the database management systems (DBMS) available today (without changes), including small and cheap ones such as open source DBMS. The DWS approach is based on the clever combination of two simple ideas: 1) uniform data striping to partition the data warehouse facts over an arbitrary number of computers, in such a way that queries can be executed in a true parallel fashion (a query is actually split into many partial queries), and 2) An approximate query answering strategy (AQA) to deal with the momentary unavailability of one or more computers in the cluster. The experimental evaluation of the DWS technique shows that this approach assures nearly optimal speed up and scale up and that a momentarily unavailability of one of the computers (which is plausible, as a DWS system may consist of a large number of small computers) does not force the system to stop, as the answers can be approximated with a small error using the data in the remaining computers of the DWS system. Recently, the company Critical Software, SA developed a middle-layer implementation of the DWS technique targeted for several commercial DBMS systems and OLAP tools, allowing the transparent use of the DWS technique with the data warehousing technology available in the market (i.e., no changes are required for both the DBMS and OLAP tools) . However, the DWS technique has an important limitation: it is specifically thought to typical data warehouses organized in an ideal star schema consisting of a large fact table surrounded by a set of small dimension tables, as proposed by Kimball .In a DWS system the fact rows are uniformly distributed by all the available machines while the dimensions (supposedly small in size) are replicated in all the computers in the DWS system. This means that DWS is not effective (or cannot be used at all) in data warehouses with big dimensions, which is an important limitation as there are a significant number of businesses that have big dimensions as part of their business model. This paper proposes a new approach called selective loading to deal with data warehouses with big dimensions in DWS systems and evaluates the proposal using the TPC-H performance benchmark, whose data schema is quite far from a typical star schema and includes big dimensions. The paper is organized as follows. Next section summarizes the key aspects of the DWS technique. Section 3 presents the selective loading proposal for handling big dimensions. Section 4 presents the experimental evaluation using the TPC-H schema and finally section 5 concludes the paper. |
||||
|
« Next Oldest | Next Newest »
|
![]() |
|||||||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
User(s) browsing this thread: 1 Guest(s)
Search
Member List
Calendar
Help




![[-] [-]](images/collapse.gif)






