Научная статья на тему 'COMPARISON OF ANALYTICAL MPP DATABASES'

COMPARISON OF ANALYTICAL MPP DATABASES Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
142
30
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
MASS PARALLEL PROCESSING / ANALYTICAL DATABASES / DATABASE MANAGEMENT SYSTEM / ONLINE TRANSACTION PROCESSING / ONLINE ANALYTICAL PROCESSING / JAVASCRIPT OBJECT NOTATION / COLUMN-ORIENTED

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Kargassekov Y.M.

Parallel database systems are beginning to replace traditional mainstream computers as they allow much larger databases to be operated in a transactional manner. This article analyzes databases with an MPP architecture.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «COMPARISON OF ANALYTICAL MPP DATABASES»

Y.M. Kargassekov

COMPARISON OF ANALYTICAL MPP DATABASES

Parallel database .systems are beginning to replace traditional mainstream computers as they allow much larger databases to be operated in a transactional manner. This article analyzes databases with an MPP architecture.

Keywords: Mass parallel processing, analytical databases, database management system, online transaction processing, on-line analytical processing, JavaScript object notation, column-oriented.

Analytical databases with mass parallel processing (MPP) are databases optimized for analytical workloads: aggregation and processing of large data sets. MPP databases tend to be columnar, so instead of storing each row in a table as an object (a feature of transactional databases, MPP databases typically store each column as an object. This architecture allows you to process complex analytical queries much faster and more efficiently.These analytical databases distribute their datasets across many machines or nodes to process large amounts of data (hence the name). All of these nodes have their own storage and computing capabilities, allowing each to perform a part of the request.The proliferation and declining cost of MPP analytical databases over the past decade has opened up a huge opportunity for data-driven organizations to enact and analyze larger datasets than ever before. These databases are a great addition to the ever-growing set of tools for analysts, but they also add additional complexity to the architecture. [1]

Teradata is a scalable MPP database solution with extensive customization options. Teradata has been around for more than 30 years and has an extremely mature feature set that meets the most stringent requirements of some of the largest organizations in the world. The non-shared architecture makes Teradata a powerful database with linear scalability.While Teradata has traditionally been installed locally with proprietary hardware, Teradata has recently started offering its database technology as a managed service in AWS or in its own private cloud. [2]

Vertica is a high-concurrency MPP database platform that prioritizes high-performance, advanced analytical workflows for massive datasets. Like many self-managed MPP data platforms, the Vertical platform is mature and designed to fit into the enterprise data stack. Vertica's core competence lies in its new architecture, which provides high performance and efficiency for extremely large data sets.Vertica is particularly well suited for advanced analytics and data processing workflows due to its close integration with R and the ability to combine data with Hadoop. [3]

SingleStore is a proprietary resident relational DBMS that allows the distribution of databases across several nodes, supports the principles of ACID, the SQL language, is notable for the fact that it generates code in C ++ to execute SQL queries. It is positioned as a system of the NewSQL class - combining the principles of horizontal scalability of NoSQL systems, and the properties and functions of classical relational DBMS. Written in C ++, running Linux for x86-64 platforms. The database is stored in the random-access memory of the nodes in non-blocking structures, both classic row and column storage are supported. MySQL syntax is implemented, JSON type and spatial types and operations are additionally supported. Write-ahead logging is supported, due to its use on the slave nodes, replication is implemented. [4]

BigQuery is a data warehouse that uses the large-scale architecture of Google Cloud to distribute data across thousands of nodes, using as many nodes as necessary to efficiently execute any query. Unlike other database dialects, where you buy or rent individual machines or space on machines, there is only one BigQuery instance consisting of thousands of nodes shared by all users of the instance. The huge scale allows BigQuery to execute even huge and complex queries in a relatively short time. As a result, even if your datasets grow from gigabytes to petabytes, BigQuery will remain responsive. [5]

© Y.M. Kargassekov, 2021.

Научный руководитель: Сапакова Сая Заманбековна - кандидат физико-математических наук, ассистент-профессор, Международный университет информационных технологий, Казахстан.

Table 1

Comparative analysis of databases

Teradata Vertica SingleStore BigQuery

Hardware Custom MPP, Shared nothing Custom Hybrid MPP, Shared everything Distributed, shared-nothing architecture on commodity hardware REST web-service, IaaS with MapRe-duce

Type of processing OLTP or OLAP. Can handle high user load OLAP optimized for large fact tables Hybrid OLTP and OLAP Google's Cloud Storage Platform

Inception 1979, Caltech 2005 MIT 2013 SingleStore startup 2011 Google

Performance and maintenance Auto-recommended optimization, columnar compression Column oriented optimization for ingestion, storage, compression, and access Act as both transac-tional (rowstore) database, and an analytical (columnstore) data warehouse System itself chooses the best way to store data

Distribution type Proprietary Commodity Proprietary Web service

From our analysis, we learned that there are many players on the market now who are ready to offer their solutions for data storage. Each of them has its own advantages and disadvantages. For example, Vertica does not support stored procedures, functions and various languages. Teradata is sold as a pre-configured hardware and software complex and is comparable in cost to Oracle Exadata, i.e., is in the upper price range. The high price of Teradata is due to the fact that the company does not try to unify its solution, offering its vision to each new customer. SingleStore is a new technology and it is used for specific tasks, it will not suit everyone. BigQuery is a cloud-based solution. You can meet the free limit for storing small amounts of data, but if your data is terabytes and you work with data often, then the costs will be high. Therefore, when choosing your database, you must first think carefully about everything.

References

1.Fajar Ciputra Daeng Bani, Suharjito, Diana, Abba Suganda Girsang. Implementation of Database Massively Parallel Processing System to Build Scalability on Process Data Warehouse // Procedia Computer Science Volume 135, 2018, Pages 68-79

2.McKenna Brian. Teradata Universe 2016: MPP architecture recast as 'Intelliflex // Computer Weekly, March 30, 2017,

3.Jean Alexandre. Vertica Announces Community Edition Version of Vertica Analytic Database // www.vertica.com, October, 2011

4.Frenkiel Eric. Scales in-memory database across hundreds of nodes, thousands of cores // SingleStore Blog, April 2013

5.Iain Thomson. Google opens BigQuery for cloud analytics: Dangles free trial to lure doubters // www.theregister.com, November 2011

КАРГАСЕКОВ ЕРНУР МУРАТОВИЧ - магистрант, Международный университет информационных технологий, Казахстан.

i Надоели баннеры? Вы всегда можете отключить рекламу.