Data Warehouse/Analytic Appliances – What to Consider

Why was Teradata able to become the leader of data warehousing at the super high-end (e.g. greater than 25 TB’s)?  Why was Netezza only the second pure-play data warehousing company to go public by focusing on the 10 – 25 TB range of opportunities?  Why did Oracle after so many years of denial finally announce a joint hardware / software product for data warehousing with HP, the Exadata data warehouse server?  Why did Microsoft acquire DATAllegro, one of the earlier data warehousing appliances? Why are there now dozens of data warehouse appliances available on the market today, and – more importantly – how should a customer choose which one to purchase? 

In all these cases, the vendors have listened to the market and concluded that the most optimal way to serve the customer is through a true data warehouse appliance.  Given that there are so many flavors of appliances, though, here are some things to consider when making a purchase: 

1) Why is an appliance better than an on-demand offering or open-source download? For most experienced data warehouse shops, the prevailing favorite is the appliance model. It offers ease and speed of deployment AND total cost of ownership benefits without having to make sacrifices around security or integration. While on-demand can seem easier and less expensive in the beginning, there continue to be issues with security, integration and bandwidth that keep most shops from going down this route. Open source software can also seem less expensive up front, but adding annual support fees and costs for hardware and IT resources to pull a complete solution together far outweighs the initial reduction in software license. 

2) Why should someone choose a “true” appliance vs. a “bundle”? The difference between a “true” appliance with the hardware, software and storage all pre-configured and optimized rather than a “bundle” that’s more of an open menu of components should be clear – it’s simply way more expensive to deal with components that haven’t really been architected to work together. Also, a true appliance can offer features that a bundle can’t offer. For example, Kickfire has an Active System Monitor that monitors all levels of the appliance (software, hardware, storage, and OS) and proactively alerts the administrator via email to potential issues. Furthermore, while vendors of both approaches will usually offer a single point of support initially, only a true appliance vendor is really set up to solve any support issues without pointing fingers at component suppliers.

3) When is a column-store architecture better than a row-store one? For column stores that have solved the incremental update and user-concurrency issues inherent in most column-store architectures, it’s always better than row stores. Some more modern vendors have solved these inherent bottlenecks with row caches and some with other approaches (e.g. Kickfire leverages its parallel-processing SQL chip to get high-speed incremental loads and high user concurrency). Additionally, it is important to make sure the column store can be accessed through standard third-party tools and is not going to require significant training (this is why Kickfire chose to implement its column store as a MySQL pluggable storage engine. To users and third-party tools Kickfire’s column store looks just like standard MySQL).

4) Why is TPC-H important? TPC-H is a very rigorous benchmark with an independent audit process and governance council to insure transparency and adherence to the rules. As such, any vendor who can even run the benchmark – let alone hold one of the world records – should be considered on a short list. Of course, there is no substitute for proven, production customer references but TPC-H should not be underestimated as a tough, mixed workload data warehouse test.

5)Why is price/performance more important than simply pricing by the terabyte? Most of the larger vendors price by the terabyte in order to seem less expensive. The problem is that a customer needs to buy a lot of terabytes in order to get volume discounts. According to IDC, most data warehouse implementations are actually below 5 TB in size, which is far below the number of terabytes required to get low per-terabyte pricing. More importantly, pricing by terabyte tells the customer nothing about performance against those terabytes. This is another reason why the TPC-H benchmark is so important – it is a standardized way for vendors to publish both price and price-performance.

With these considerations in mind, customers can decide for themselves or consult industry analysts such as IDC to determine which vendors meet their criteria and then make a specific vendor selection or run an evaluation process.

Bookmark and Share

Tags: , , , , , , , , , , ,

Leave a Reply