Currently, Big Data – a very popular term, is also a new development trend of the information technology industry, has existed in many different industries. Therefore, one of the topics with great attraction is: Big Data components and architecture.
This article will introduce about Big Data components and architecture for your reference of this technology.
Big Data architecture
Big Data architecture is built on a set of Big Data components that can help develop a reliable, scalable and automated data processing flow.
Big Data components of the system
Building a hardware cluster is a complex issue, when design is often done after determining the problem requirement, initially the request is often unclear. Most service providers have specific instructions on choosing the most appropriate hardware.
Typically, a recommended hardware cluster will have 2 CPUs with 4 to 8 cores per CPU, at least 48GB to 512GB of RAM for temporary data storage, at least 6 to 12 hard drives for data storage on setting.
When transmitting data from a structured data source like RDBMS, Apache Sqoop is often chosen. Sqoop supports users very well in transferring data from RDBMS to Hadoop, from partial transfer to full transfer.
For event-based data transmission, Apache Flume is a good tool, it has agents that support transferring event data from one system to another – possibly HDFS, Spark or HBase.
This is not only a matter of where to store the data, but that data must also be in an appropriate format, of appropriate size and should have appropriate access to that data.
The format is appropriate depending on whether the application is batch processing or real-time processing. For batch processing, file formats like SequenceFile or Avro are popular and appropriate.
For real-time applications, there is Apache Parquet, which is similar in structure to tabular databases.
It is very efficient to access and process large size data sets.
Hadoop greatly supports managing a small number of large capacity files. This is still a usable path, but will require an intermediate step in the task of merging small sized data together.
Control access by creating time-stamped directories for each individual process, which will ensure that each process, even done in parallel, does not affect the data of the rest progress.
After the data has been stored, the next step of the entire process will be to automatically process those data.
Transforming data here does not mean that a part of that data will be lost, but a process that helps the system process data more effectively.
The most talked about topic in analyzing Big Data data is machine learning, which is the process of building mathematical models from which to solve recommendations, clustering or classification problems for new data.
Before the data is processed automatically we need to combine the tools and processes in each component of the system into a unified data stream. There are two types of data streams such as microphones and macros.
Hopefully, through this article, you have a more detailed knowledge about Big Data components and architecture – future development model.