Storage requirements for AI, ML and analytics in 2022

Artificial intelligence (AI) and machine learning (ML) promise to transform whole areas of the economy and society, if they are not already doing so. From driverless cars to customer service “bots”, AI and ML-based systems are driving the next wave of business automation.

They are also massive consumers of data. After a decade or so of relatively steady growth, the data used by AI and ML models has grown exponentially as scientists and engineers strive to improve the accuracy of their systems. This puts new and sometimes extreme demands on IT systems, including storage.

AI, ML and analytics require large volumes of data, mostly in unstructured formats. “All these environments are leveraging vast amounts of unstructured data,” says Patrick Smith, field CTO for Europe, the Middle East and Africa (EMEA) at supplier Pure Storage. “It is a world of unstructured data, not blocks or databases.”

Training AI and ML models in particular uses larger datasets for more accurate predictions. As Vibin Vijay, an AI and ML specialist at OCF, points out, a basic proof-of-concept model on a single server might expect to be 80% accurate.

With training on a cluster of servers, this will move to 98% or even 99.99% accuracy. But this puts its own demands on IT infrastructure. Almost all developers work on the basis that more data is better, especially in the training phase. “This results in massive collections, at least petabytes, of data that the organisation is forced to manage,” says Scott Baker, CMO at IBM Storage.

Storage systems can become a bottleneck. The latest advanced analytics applications make heavy use of CPUs and especially GPU clusters, connected via technology such as Nvidia InfiniBand. Developers are even looking at connecting storage directly to GPUs.

“In AI and ML workloads, the learning phase typically employs powerful GPUs that are expensive and in high demand,” says Brad King, co-founder and field CTO at supplier Scality. “They can chew through massive volumes of data and can often wait idly for more data due to storage limitations.

“Data volumes are generally large. Large is a relative term, of course, but in general, for extracting usable insights from data, the more pertinent data available, the better the insights.”

The challenge is to provide high-performance storage at scale and within budget. As OCF’s Vijay points out, designers might want all storage on high-performance tier 0 flash, but this is rarely, if ever, practical. And because of the way AI and ML work, especially in the training phases, it might not be needed.

Instead, organisations are deploying tiered storage, moving data up and down through the tiers all the way from flash to the cloud and even tape. “You’re looking for the right data, in the right place, at the right cost,” says Vijay.

Firms also need to think about data retention. Data scientists cannot predict which information is needed for future models, and analytics improve with access to historical data. Cost-effective, long-term data archiving remains important.

Placa De Circuito, Circuito

What kinds of storage are best?

There is no single option that meets all the storage needs for AI, ML and analytics. The conventional idea that analytics is a high-throughput, high-I/O workload best suited to block storage has to be balanced against data volumes, data types, the speed of decision-making and, of course, budgets. An AI training environment makes different demands to a web-based recommendation engine working in real time.

“Block storage has traditionally been well suited for high-throughput and high-I/O workloads, where low latency is important,” says Tom Christensen, global technology adviser at Hitachi Vantara. “However, with the advent of modern data analytics workloads, including AI, ML and even data lakes, traditional block-based platforms have been found lacking in the ability to meet the scale-out demand that the computational side of these platforms create. As such, a file and object-based approach must be adopted to support these modern workloads.”

Block-access storage

Block-based systems retain the edge in raw performance, and support data centralisation and advanced features. According to IBM’s Scott Baker, block storage arrays support application programming interfaces (APIs) that AI and ML developers can use to improve repeated operations or even offload storage-specific processing for the array. It would be wrong to rule out block storage completely, especially where the need is for high IOPS and low latency.

Against this, there is the need to build specific storage area networks for block storage – usually Fibre Channel – and the overheads that come with block storage relying on an off-array (host-based) file system. As Baker points out, this becomes even more difficult if an AI system uses more than one OS.

File and object

As a result, system architects favour file or object-based storage for AI and ML. Object storage is built with large, petabyte capacity in mind, and is built to scale. It is also designed to support applications such as the internet of things (IoT).

Erasure coding provides data protection, and the advanced metadata support in object systems can benefit AI and ML applications.

Against this, object storage lags behind block systems for performance, although the gap is closing with newer, high-performance object technologies. And application support varies, with not all AI, ML or analytics tools supporting AWS’s S3 interface, the de facto standard for object.

Cloud storage

Cloud storage is largely object-based, but offers other advantages for AI and ML projects. Chief among these are flexibility and low up-front costs.

The principal disadvantages of cloud storage are latency, and potential data egress costs. Cloud storage is a good choice for cloud-based AI and ML systems, but it is harder to justify where data needs to be extracted and loaded onto local servers for processing, because this increases cost. But the cloud is economical for long-term data archiving.

Monitor, Binario, Sistema Binario

What do storage suppliers recommend?

Unsurprisingly, suppliers do not recommend a single solution for AI, ML or analytics – the number of applications is too broad. Instead, they recommend looking at the business requirements behind the project, as well as looking to the future.

“Understanding what outcomes or business purpose you need should always be your first thought when choosing how to manage and store your data,” says Paul Brook, director of data analytics and AI for EMEA at Dell. “Sometimes the same data may be needed on different occasions and for different purposes.”

Brook points to convergence between block and file storage in single appliances, and systems that can bridge the gap between file and object storage through a single file system. This will help AI and ML developers by providing more common storage architecture.

HPE, for example, recommends on-premise, cloud and hybrid options for AI, and sees convergence between AI and high-performance computing. NetApp promotes its cloud-connected, all-flash storage system ONTAP for AI.

At Cloudian, CTO Gary Ogasawara expects to see convergence between the high-performance batch processing of the data warehouse and streaming data processing architectures. This will push users toward object solutions.

“Block and file storage have architectural limitations that make scaling beyond a certain point cost-prohibitive,” he says. “Object storage provides limitless, highly cost-effective scalability. Object storage’s advanced metadata capabilities are another key advantage in supporting AI/ML workloads.”

It is also vital to plan for storage at the outset, because without adequate storage, project performance will suffer.

“In order to successfully implement advanced AI and ML workloads, a proper storage strategy is as important as the advanced computation platform you choose,” says Hitachi Vantara’s Christensen. “Underpowering a complex distributed, and very expensive, computation platform will net lower performing results, diminishing the quality of your outcome, ultimately reducing the time to value.”