Plato Data Intelligence.
Vertical Search & Ai.

Developing AI workloads is complex

Date:

Sponsored Feature If artificial intelligence (AI) has been sending shockwaves through the technology world in recent years, the onset of generative AI over the last 18 months has been a veritable earthquake.

For IT leaders looking to harness its potential for their own organisations, the pace of development can feel bewildering. Enterprises are racing to make best use of their own data to either build their AI models or repurpose public models already available. But this can pose a significant challenge for the dev and data science teams involved.

It can also present something of a conundrum for companies that want to keep control of the HPC infrastructure needed to support their AI workloads. AI-enabled applications and services require a far more complex mix of silicon than traditional computing, as well as accompanying storage capacity and connectivity bandwidth to handle the vast amounts of data needed in both the training and inference stages.

London data centres reflect AI trends

The potential for enterprise AI innovation and the challenges it presented is reflected by what is happening across colocation giant Digital Realty’s data centre estate in and around London as AI shifts to the top of hosting company’s customer agendas.

The UK capital and its surrounding areas has a high density of headquarter buildings and R&D offices, not just in financial services, but in other key industry verticals such as pharma, manufacturing, retail, and tech.

London is attractive because of the UKs political and legal stability, skilled workforce, and advanced tech infrastructure, explains Digital Realty CTO Chris Sharp, making it superb base both for innovation and for deploying AI applications and workloads.

Many enterprises will be acutely aware of issues around the general importance of data and IP and specific issues around data sovereignty and regulation, he adds.

“There’s a bit of nuance with training,” Sharp explains. “Nobody knows if it’s going to be able to be done anywhere and then inference has to abide by the [local] compliance [rules].” In addition, there’s an increasing understanding that one model cannot necessarily serve the world: “There’s going to be some regionality, so that will then also dictate the requirement for training facilities.”

At the same time, these organisations face the same technology challenges as other companies worldwide, particularly when it comes to putting in place and powering the infrastructure needed for AI.

It’s not enough to simply throw more CPUs at these workloads. One of the challenges with AI and HPC pipelines can be the different types of purpose-built hardware needed to efficiently support the complexity of these applications.

These range from CPUs to GPUs, even application-specific tensor processing units (TPUs) designed for neural networks, all with subtly different requirements, and all potentially playing a role in a customer’s AI pipeline. “Being able to support the full deployment of that infrastructure is absolutely top of mind,” points out Sharp.

Moreover, the balance between these platforms is set to change as AI projects move beyond development and into production. “If you take a snapshot, it’s 85 percent training, 15 percent inference today. But over the course of maybe 24 months, it’s 10 times more of a requirement to support inference,” he adds.

Flexing your AI smarts

So, the ability to flex and rebalance the underlying architecture as models evolve is paramount.

There is also the challenge of connecting this vast amount of data and compute together to deliver the AI workload performance levels required. While customers in the UK will have data sovereignty very much in mind, they still need to process workloads internationally when needed. And they may need to tap data oceans around the world. As Sharp says, “How do you connect these things together, because you’re not going to own all the data.”

But connectivity is not simply an external concern. “Within the four walls of the data centre we’re seeing six times the cable requirements [as] customers are connecting their GPUs, the CPUs, the network nodes. …. so, where we had one cable tray for fibre runs, now we have six times those cable trays, just to enable that.”

Hanging over all of this are the challenges associated with housing and powering this infrastructure. Just the density of technology required raises floor loading issues, Sharp explains. “The simple weight of these capabilities is massive.” And, as Digital Realty has found working with hyperscale cloud providers, floor loading requirements can increase incredibly quickly as projects scale up and AI technology advances.

Cooling too is always a challenge in data centres and as far as Sharp is concerned there is no longer a debate as to whether to focus on liquid or air cooling. “You need the ability to support both efficiently.”

When combined with the sheer density of processing power demanded by AI workloads, this is all having a dramatic effect on power demand across the sector. Estimations published by Schneider Electric last year suggest AI currently accounts for 4.5 GW of demand for data centre power consumption, predicted to increase at a compound annual growth rate (CAGR) of 25-33 percent to reach between 14 GW and 18.7 GW by 2028. That’s two to three times more demand for overall data centre power which is forecast to see a 10 percent CAGR over the same period).

All of which means that data centre operators must account for “more and more new hardware coming to market, requiring more power density, increasing in square footage required to support these burgeoning deployments.”

A state of renewal

That daunting array of challenges has informed the development of Digital Realty’s infrastructure in and around London, and its ongoing retrofitting and optimisation as enterprises scale up their AI operations.

The company has six highly connected campuses in the greater London area, offering almost a million square feet of colo space. But that doesn’t exist in isolation, with over 320 different cloud and network service providers across the city. “What we’re seeing today is that customers need that full product spectrum to be successful,” Sharp says.

Liquid cooling is a particular element in its London infrastructure. As liquid is 800 times denser than air, it can have a profound impact on efficiency. Digital Realty’s Cloud House data centre in London draws water from the Millwall dock for cooling, in a system that is up to 20 times more efficient than traditional cooling. Sensors ensure that only the required amount of water is used, and that it is returned to the dock unchanged.

But this ability to match the demands of corporations in and around London today and for the future also depends on Digital Realty’s broader vision.

All the power consumed by Digital Realty’s European operation is matched with renewable energy through power purchase agreements and other initiatives, while the company as a whole is contracted for over 1GW of new renewable energy worldwide.

At a hardware level, it has developed technologies such as its HD Colo product, which supports 70KW per rack, representing three times the requirement of certification for the Nvidia H100 systems which currently underpin cutting edge HPC and AI architectures.

At a macro level, as Sharp explains, Digital Realty plans its facilities years in advance. This includes “master planning the real estate, doing land banks and doing substations, making sure we pre-planned the power for five to six years.”

This requires close coordination from the outset with local authorities and utility providers, including investing in substations itself.

“We work extensively with the utility to make sure that not only the generation is there, but the distribution, and that they fortify the grid accordingly. I think that really allows customers of ours and our up the line suppliers, a lot of time to align to that demand.”

Cooling, power and infrastructure management complexities

It might be difficult to decide which is more complex. Developing cooling technologies and power management platforms that keep ahead of rapidly developing AI infrastructure or dealing with utilities and municipalities over a multiyear time horizon.

But tackling both is crucial as organisations look to stand up and expand their own AI capacity both quickly, and sustainably.

Sharp cites the example of one European education and research institution that needed to ramp up its own HPC infrastructure to support its AI ambitions, and knew it needed to utilise direct liquid to the chip. It would certainly have had the technical know-how to build out its own infrastructure. But once it began planning the project, it became clear that starting from scratch would have meant a five-to-six-year buildout. And that is an age in the current environment. Moreover, local regulations demanded it reduce their energy footprint by 25 percent over five years.

But partnering with Digital Realty, Sharp explains, it was able to deploy in one year, and using 100 percent liquid cooling improved its energy efficiency by 30 percent. As Sharp puts it, “It really helped them out rather quickly.”

Given how quickly the world has changed over the last 18 months, the ability to get an AI project up and running and into production that quickly is much more than a nice to have. For many enterprises, it’s going to be existential.

“Many AI deployments have failed, because there’s a lot of science and complexity to it,” says Sharp. But he continues, “We spend a lot of time removing complexity.”

Sponsored by Digital Realty.

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?