This topic is an issue that we have been discussing with our clients in the past few months. It is also a question that many big companies are thinking about. Let's share our views and experiences here.
Twenty years ago, large-scale storage typically used a proprietary hardware solution (NAS) that provided access to other applications through special high-performance communication hardware. This kind of solution is expensive and hard to expand, cannot meet the ultra-large-scale data storage requirements of the high-speed internet.
In 2001, Google's GFS pioneered the first time to build large-scale storage with ordinary x86 machines and ordinary hard drives. At that time, the throughput of HDD was about 50MB/s, and the single-machine performance could increase to 1GB/s by connecting multiple hard disks. But the mainstream network at that time was only 100Mb, and remote access to data over the network was very slow. To solve the data access speed problem, Google creatively proposed a computing and storage coupled architecture, implementing computing and storage functions in the same cluster, and moving the computing code to where the data is stored instead of transferring the data to the computing node. This architecture effectively solves the difficulty of accessing massive data between the weakly connected storage nodes. Hadoop and other successors have completely implemented with this architecture. Data localization is one of the most important features to ensure overall performance. Several optimizations have also been made to reduce network bandwidth consumption between machines and cabinets.
After 10 years of development, the performance of the network has undergone tremendous changes, from the previous mainstream 100Mb to 10Gb, which has increased by 100 times, while the performance of HDD hard drives at the same time has not changed much, but the capacity of single discs has remarkably increased. Due to the high bandwidth requirements of various social networking applications and the reliable support of core switches and SDN, many companies have implemented a peer-to-peer 10Gb network architecture (10Gb bandwidth guarantee between any two machines). Besides, various efficient compression algorithms and column storage formats further reduce the amount of I/O data, gradually turning the bottleneck of big data from I/O to CPU. In a big data computing cluster where data localization is well optimized, a large amount of network bandwidth is idle, and since storage and computation are coupled in a cluster, there are some other issues occurred:
- In different applications or development stages, various storage spaces and computing power ratios are required, which makes the selection of machines more sophisticated and confused;
- When the storage capacity or computing resources are insufficient, they need to expand at the same time, resulting in lower economic efficiency (another expanded resource is wasted);
- In a cloud computing scenario, pure elastic computing cannot be achieved since there is also data stored in the computing cluster, and turning off idle compute clusters will result in data loss.
Because of the above issues caused by storage and computational couplings, many companies are beginning to think about the necessity for such coupling and data localization. When I first joined Facebook in 2013, my colleagues in other team researched how much impact on performance was produced by turning off Hadoop's data localization optimization. Measured data show that the overall impact on computing tasks is less than 2%, and local read optimization of data is trivial. Later, Facebook gradually moved to the separate architecture of computing and storage. They also made some improvements to the big data software to adapt to this new architecture. They made a detailed sharing at this year's Apache Spark & AI Summit named Taking Advantage of a Disaggregated Storage and Compute Architecture, they called this architecture DisAgg. Google should be doing this thing earlier, but there is not much public information available for reference.
On public clouds such as AWS, network-based block storage gradually replaces stand-alone local storage, making computing and storage-coupling architectures on public clouds more unreasonable (data localization is not real localization, local reads in DataNodes are actually remote read in physics layer). Databricks, a big data analytics service designed for the public cloud, began with a separate computing and storage architecture (using S3 directly as storage), giving the product a great deal of flexibility, on-demand creation and automatic elastic scaling of Spark clusters. It is a big selling point (and can also maximize the use of Spot nodes to greatly reduce the cost of computing), very popular with customers. Since S3 is just object storage, there are many obstacles with big data computations, and Databricks and its customers have been trapped many times. Databricks took much effort to improve and adapt it, making Spark tasks on Databricks faster and more reliable. Netflix, the pioneer on AWS, also uses S3 as big data storage. They have made various changes for Hive to make it stable running (not the open source S3mper). JuiceFS further abstracts and commercializes these improvements, enabling them to better serve more scenarios, including big data, helping more companies improve the cloud big data experience without having to solve the same problems with S3 and other object storages.
With the rapid development of the network and the optimization of I/O by the big data computing framework, data localization is no longer necessary, and the architecture of separating storage and computing is the future. JuiceFS follow this trend and is a better alternative to the architected outdated HDFS, providing a fully elastic storage solution for cloud big data, giving real flexibility to cloud big data (completely pay-as-you-go).