欢迎来到中国IT及布线行业权威网站! [请登录],新用户?[免费注册]
用户名: 密码:
| 设为首页 | 加入收藏
您当前的位置:首页 >> Tab栏目 > 数据中心
分析:数据中心制冷故障会更常见么?
来源:tt 作者:深知社 更新时间:2023-02-26

数据中心制冷故障会变得更加常见么?

Will data centre cooling failures become more common?


Luke Neville December 28, 2022

Managing Director at i3 Solutions Group


译 者 说

2022年全球普遍经历了一次热浪洗礼,英国伦敦、中国重庆,都经历了近些年来的夏季最高温,给数据中心的稳定造成了很大的挑战。这不是偶然,这是全球气候变暖所必然导致的全球气候的改变,短期内会持续温升,直至达到一个峰值。


目前数据中心最常用的两种末端制冷设备一般为精密空调和AHU,在机柜通道有效封闭的情况下进行有效制冷。在精细管理下,年均PUE可做到1.2以下。但是从机柜甚至服务器角度来看,此两种末端设备仍可被看作为集中制冷,依然存在冷量的浪费。在制冷设备性能在能做到有效突破的情况下,很难在数据中心制冷能效上再有质的提升。相比较来看,不断发展的液冷技术,是非常有效的分布式制冷,从服务器甚至服务器处理器级别进行有效制冷。在夏季室外温度不断随着气候变暖增加的情况下,液冷似乎成了一个应对温升危机、提高运行能效的新型手段。


更加极端的天气模式导致了更高的温度峰值,例如有记录的2022年夏季发生在英国部分地区的40℃高温,会引发更多的数据中心故障。然而,尽管不可避免的数据中心故障会变得越来越常见,但是想建立一个直接因果关系是非常困难的,因为需要考虑的因素包括不断增长的站点数量和老化的数据中心库存,这些都将势必会造成故障数量的增加。

More extreme weather patterns resulting in higher temperature peaks, such as the record 40oC experienced in parts of the UK in the summer of 2022, will cause more data centre failures. However, while it is inevitable that data centre failures will become more common, establishing a direct cause and effect is difficult as factors to consider include the growing number of sites and an ageing data centre stock that will statistically increase the number of outages.


不断提高的夏季温度峰值造成的结果是改变了数据中心制冷系统的设计策略以及哪些因素构成了“安全”的设计和运行温度。随着20多年前现在数据中心产业开始以来,数据中心制冷系统容量的设计一直在安装成本以及运行风险之间进行权衡。

What the increasing peak summer temperatures are doing is shifting the needle and changing conversations both about how data centre cooling should be designed and what constitutes ” safe’ design and operating temperatures. Since the beginning of the modern data centre industry over two decades ago, the design of data centre cooling system capacity has always been a compromise of installation cost vs risk.


数据中心设计人员力求通过选择峰值环境温度和设备冗余度水平来实现平衡,这样当温度达到时,制冷系统有能力继续保障数据中心正常运行。所选择的峰值环境温度越高,制冷系统的体积以及成本越大,制冷系统的抗风险性越强,意味着需要花费更多的成本来增加系统冗余。它降低了风险偏好,降低了所拥有者和运营商的成本。事实上,无论何时,只要温度超过了设计温度,出现故障的风险都一直存在,且随着温度的升高而增加。

Designers sought to achieve a balance whereby a peak ambient temperature and level of plant redundancy is selected so that should that temperature be reached the system has the capacity to continue to support operations. The higher the peak ambient design temperature selected the greater size/cost of the plant, with greater resilience meaning further cost for plant for redundancy. It came to down the appetite for risk versus the cost for the owner and operator. It is a fact that whenever the chosen ambient design temperature is exceeded, the risk of a failure will always be present and increases with the temperature.


所以,正确的环境温度和峰值温度设定值是什么?

So, what is the right ambient and peak temperature set point?


ASHRAE公布了5年,10年,20年,50年周期内多个不同气象站的温度。一般情况下,以20年周期内的数据一般还会用来作为数据中心环境温度的设计参考值。

ASHRAE publish temperatures for numerous weather station locations based on expected peaks over 5-, 10-, 20- and 50-year periods. Typically, the data for the 20-year period is used for data centre ambient design.


然而,这只是一个根据每个业主/运营商根据他们认为可以将风险降低到可接受的水平而不增加太多成本的选择指导方针。在过去的20年中,随着夏天越来越高的温度,数据中心设计温度也随之出现上升的趋势。

However, this is a guideline only and each owner/operator choses their own limit based on what they feel will reduce risks to acceptable levels without increasing costs too much. Hotter summers have seen design conditions trending upwards over the last twenty years.


老旧的数据中心通常使用更低的温度,从28℃- 30℃到后来公认的35℃- 38℃的标准设计条件。系统通常选择在超过这些温度点的情况下运行,甚至高达45摄氏度(以英国为基础-所有其他地区的温度将根据当地气候进行选择)。

Legacy data centres were traditionally aligned with much lower temperatures from say 28℃ – 30℃ to latterly accepted standard design conditions of 35 – 38℃. Systems were often selected to operate past these points, even up to 45oC (based on the UK – all other regions will have temperatures selected to suit the local climate).

在英国最新记录的室外40℃温度,给一些数据中心的运行人员敲响了警钟,这些数据中心发现,他们过时的设计工况,年久的设施以及高安装密度都将导致服务器在设计的极限值运行。所有的系统,随着环境温度的升高,散热能力都会降低,并且无论负载如何,都有一个固定的限制,在这个限制下,系统将无法散热。一旦达到这些条件,就一定会出现故障。

The new record temperatures of +40oC in the UK will sound the warning bell to some data centre operators who may find themselves in a situation where dated design conditions, ageing plant and high installed capacity will result in servers running at the limits of their design envelope.All systems, have a reduced capacity to reject heat as the ambient temperature increases and also have a fixed limit irrespective of load, at which they will be unable to reject heat. Should these conditions be reached, failure is guaranteed.


更加普遍的是,与系统设计能力相比,低水平的实际负载需求通常意味着数据中心永远不会经历对系统造成超载的情况。但是,那就要求确保IT负荷运行要不是稳定的,要不就是100%可预测的,或者两者都具备。

More commonly, low levels of actual load demand versus the systems design capability mean typically data centres never experience conditions which stress the systems. However, that requires confidence that IT workloads are either constant, 100% predictable, or both.


目前而言,数据中心冷却故障最有可能是受设备情况影响散热能力,而不是设计参数的限制。这被认为是英国夏季热浪期间一次故障的根本原因,当时伦敦数据中心就是冷却基础设施出现了故障。再加上数据中心利用率的提高,如果数据中心外的温度继续上升,这种情况将会发生改变。

For now, the failure of data centre cooling is most likely to be the result of plant condition impacting heat rejection capacity rather than design parameter limitations. This was cited as the root cause of one failure during the UK’s summer heatwave when it was stated that cooling infrastructure within a London data centre had experienced an issue.Coupled with an increase in data centre utilisation, should temperatures outside the data centre continue to rise this will change.


清楚自身的局限性

Know your limits


在数据中心内部,现代芯片和服务器设计不断增加的功耗要求也意味着热量可能会成为一个更大的问题。无论服务器制造商对可接受的温度范围怎么说,传统的数据中心和IT部门一直对在最高温度范围内运行他们的房间感到不安。通常情况下,数据中心的运维喜欢他们的基础设施可以运行在温度更低的环境中。

Inside the data centre the increasing power requirements of modern chip and server designs also mean heat could become more of an issue. Whatever server manufacturers say about acceptable ranges, it has been the case thattraditionally data centres and IT departments remain nervous about running their rooms at the top end of the temperature envelope. Typically, data centre managers like their facilities to feel cool.


为了减少电力消耗和制冷设施规模过大的负担,通常在散热系统中集成蒸发冷却系统。然而,最近有很多人关注数据中心的用水量及其对这种系统的可持续性的影响。虽然现代设计可以考虑大规模的储水系统和雨水收集/使用,但如果夏季继续变得更长、更干燥,就需要使用更多的自来水来弥补,供应问题将会增加基础设施运行的风险。

To reduce the burden on the power consumption and over sizing of plant, it is common to integrate evaporative cooling systems within the heat rejection. However, there has been much focus recently on the quantity of water use for data centres and its impact on the sustainability of such systems. Whilst modern designs can allow for vast water storage systems and rainwater collection/use, should summers continue to get longer and dryer, more mains water use will be required to compensate and the risk of supply issues impacting the operation of the facility will increase.

从任何意义上讲,从冷却技术空间到专注于冷却计算设备本身的范式转变可能会给出答案。例如,采用液冷可以消除对机械制冷和蒸发冷却解决方案的需求。除了一些环境效益和减少房间和服务器层面的风扇数量外,液冷还有助于提高可靠性,减少故障,一般情况下以及在极端高温情况下也是如此。

In every sense, a paradigm shift away from cooling the technical space towards a focus on cooling the computing equipment itself may present the answer.The adoption of liquid cooling systems, for example, can eliminate the need for both mechanical refrigeration and evaporative cooling solutions. In addition to some environmental benefits and a reduction in the number of fans at both room and server level, liquid cooling will help increase reliability and reduce failures generally as well as at times of extremely high temperatures.


液冷似乎越来越吸引人们兴趣。例如,在今年的开放计算峰会上,Meta(前身为Facebook)概述了其数据中心转向芯片液冷基础设施的路线图,以支持备受瞩目的元世界。然而,液冷设计的一个局限性是,它们在故障期间几乎没有留下任何的回旋余地,因为将弹性纳入这些系统可能更具挑战性。

Liquid cooling seems to be gaining traction. For example, at this year’s Open Compute Summit, Meta (formerly Facebook), outlined its roadmap for a shift to direct-to-chip liquid-cooled infrastructure in its data centres to support the much-heralded metaverse. However, one of the limitations of liquid cooling designs is that they leave little room for manoeuvre during a failure, as resilience can be more challenging to incorporate into these systems.


但就目前而言,如果不改造新的冷却系统,许多现有的数据中心将不得不找到使用空气和水来保持设备冷却的方法。但随着基础设施内外温度的持续升高,发生故障的风险也会随之增加。

But for now, without retrofitting new cooling systems, many existing data centres will have to find ways to use air and water to keep equipment cool. And as the temperatures continue to rise inside and outside the facility, so too will the risks of failure.


深 知 社


翻译:

贾旭

天津朝亚数据中心 CFM

DKV(DeepKnowledge Volunteer)计划成员


校对:

王谋锐

城地香江(上海)云计算有限公司 电气架构师

DKV(DeepKnowledge Volunteer)计划精英成员


公众号声明:

本文并非官方认可的中文版本,仅供读者学习参考,不得用于任何商业用途,文章内容请以英文原版为准,本文不代表深知社观点。中文版未经公众号DeepKnowledge书面授权,请勿转载。

版权声明:凡本站原创文章,未经授权,禁止转载,否则追究法律责任。

深圳联络处:0755-86317321 18929377662  华东:13761669165  北京:010-88283829 © Copyright 2007-2027 万瑞网 wiring.net.cn. All Rights Reserved