All Calculations by AI: A Breakthrough in Open Computing

Advertisements

The discussion surrounding artificial intelligence (AI) continues to deepen, especially concerning the significance of openness in AI developmentThe renowned technology futurist Kevin Kelly made a striking statement during a recent 2024 presentation, claiming that one of OpenAI's most significant missteps was not making its large models open sourceHis words resonate strongly within the tech community as the shift towards open-source AI models disrupts traditional structures and ignites new forms of collaboration and innovation across industries.

Open-source large models have unleashed remarkable vitality in the AI sector in recent yearsAs reported in 2023, two-thirds of the foundational models released globally were open-source, with more than 80% of AI projects employing open-source frameworksThe movement has seen over 300 million downloads of open-source large models, resulting in the emergence of around 30,000 new models

These impressive statistics are not mere numbers; they underline a movement marked by shared knowledge and collective engagement, propelling both innovation and application across various sectors.

This surge in open-source initiatives is significantly supported by advancements in the computing industry, particularly through the framework of open computingThe integration of “open-source large models and open computing” serves as a powerful combination that is profoundly influencing the trajectory of AI and computing industriesZhao Shuai, General Manager of Inspur Information's server product line, articulated this notion, emphasizing that the critical value of open computing in the AI era is to address the diverse challenges posed by computational power through industry collaboration, facilitating both scale and innovation in AI applications.

The transformative power of AI on the computing industry became evident in 2020 when the concept of the Scaling Law emerged, establishing a golden rule for training large models

This principle posits that increasing the parameters, datasets, and computational power of models will enhance their performance, suggesting that as models surpass a certain threshold, emergent intelligence becomes observableFor instance, Meta's latest release of the open-source Llama 3.1 boasts an impressive parameter scale of 403 billion, showcasing considerable advancements, even surpassing some proprietary models across various domains.

The implications of this Scaling Law indicate that the scale, complexity, and data demands of AI large models will continue to rise, presenting significant challenges to foundational infrastructuresAt the 2024 Open Computing China Summit, Zhao articulated that AI large models introduce a comprehensive scale of new challenges for infrastructure, necessitating innovative solutions.

To meet the scale and complexity inherent in AI models, computational infrastructures must focus on enhancing two fronts: vertical scaling, which improves individual system performance, and horizontal scaling, which expands cluster capabilities

Vertical scaling can be achieved by deploying more powerful AI acceleration cards, processors, and faster interconnect communications to boost the computational efficiency of single nodesConversely, horizontal scaling hinges on the continuous addition of computational nodes to construct expansive clusters that can meet the demanding needs of AI models.

Zhao further stated, “Using horizontal scaling highlights a series of new challenges, including cluster network bandwidth, rapid deployment of infrastructure, computation resource management, and efficient power and cooling solutions.” Currently, both vertical and horizontal scaling coexist, rapidly evolving to address these challenges.

In addition to infrastructure, the market applications of AI large models have reached a critical juncture, propelling diverse and nuanced computational demandsAccording to IDC, the application of large models in China is expected to transition into a practical phase by 2024, particularly in vertical sectors where commercialization is accelerating, especially given the rise of multi-modal large models

alefox

This will create a rich tapestry of application scenarios, alongside varied and urgent demands for AI computational power.

The comprehensive demands placed on computational infrastructures by AI large models highlight that relying solely on traditional industry paradigms and a few leading firms is insufficientIt becomes essential to foster collaboration and innovation within the industry ecosystemHence, open computing has regained visibility and demonstrated its immense value during collaborative practices and innovationsDavid Ramku, a board member of the Open Compute Project (OCP), emphasized, "The rapid growth of artificial intelligence is reconstructing the ecosystem of data centers, and the globalization of open computing projects can maximize innovative potential."

In recent years, the OCP community has grown to over 360 members, reflecting a nearly 50% increase, with over 40 projects and sub-projects emerging

Initiatives like the Open Accelerator Module (OAM), open liquid cooling specifications, and OpenBMC have yielded significant outcomes in enhancing AI computational quality and promoting innovationThe Open Computing Summit recently heralded the formal launch of the Open Computing Module (OCM) specifications, with initial members, including the China Electronics Standard Institute, Inspur Information, Intel, AMD, and Baidu, aimed at solving a series of challenges posed by diverse computational needs in the AI era.

The spark generated by large models has led to an unprecedented acceleration in AI application innovations, turning AI acceleration chips into highly sought-after componentsHowever, the influx of diverse AI acceleration chip companies and products has led to a somewhat chaotic marketplace, increasing challenges related to compatibility and adaptability, complicating user experiences when employing AI computational products

Achieving uniformity in the interface standards of various AI acceleration cards has thus become criticalThis paved the way for the creation of the Open Accelerator Infrastructure (OAI) project in 2019, which aims to address challenges associated with disparate forms and interfaces of AI acceleration cards within servers, as well as overcoming limitations related to interconnect efficiency and lengthy R&D cycles.

Among the OAI project initiatives, the OAM design specifications have progressed remarkably, gaining widespread backing from major AI chip companies and cloud providers, including Nvidia, Intel, AMD, Microsoft, Alibaba, Google, and Inspur InformationThe initiative has highlighted the significant industrial value of hardware openness, as OAM has become the prevailing unified design standard followed by the majority of high-end AI acceleration chips globally, with over 20 chip companies supporting the OAM standards.

In the realm of AI system development, the typical two to three-year iteration cycles of AI chips pose significant challenges in designing and developing AI systems

OAM specifications have transformed this dynamic, enabling AI chips to save over six months in R&D time while driving product innovation speed within system vendors such as Inspur InformationStatistical analyses indicate that OAM design has resulted in billions of yuan in savings on overall industry R&D expenditure in recent years, lowering the barriers to innovation in AI computational industries and significantly meeting market demands.

Inspur Information emerged among the first system vendors to embrace and actively participate in OAM specifications, defining the industry’s first eight-card interconnect hardware system that complies with OAM standardsTheir pioneering open computing system, MX1, supports a range of different AI acceleration chips, allowing various accelerators to share a unified serverThis design provides users the flexibility to swap out AI acceleration chips according to their needs without replacing the entire system, greatly reducing the entry threshold for AI technology

Last year, Inspur launched its NF5698G7 open acceleration computing platform based on the OAM v1.5 specifications, supporting multiple models of OAM-compliant acceleration chips, thereby enriching the OAM industrial ecosystem.

According to Zhao, “The OAM-based standardized platform not only accelerates the compatibility process of AI chips significantly but also facilitates the iterative upgrades of AI chip productsConsequently, this expedites the deployment and use of computational power, quickly supporting the innovative demands of large models and AI-generated content (AIGC) applications.” Looking ahead, Inspur plans to introduce a new topology based on UBB2.0 next year that will support the adaptation of dozens of OAM2.0 products currently in development.

It is evident that OAM stands as a model of successful openness and collaboration in the open computing industrial chain

As the AI wave continues to rise, OAM has genuinely utilized AI-driven demand to enable efficient cooperation across the industrial chain through hardware products, design standards, and knowledge sharingFor instance, the proliferation of large AI clusters comprising thousands of cards has introduced stability challenges in training these models, with frequent interruptions leading to insufficient effective training timeIn response, Inspur, ByteDance, and over ten other companies have jointly defined OAM monitoring management specifications, balancing the functionalities of different AI chips while enhancing data processing mechanismsThey have established multi-tier fault diagnosis protocols and standardized data transfer formats to mitigate training failures of AI chips.

Moreover, the OAM specification continues to evolve, with future OAM2.0-based AI acceleration cards anticipated to support interconnections among 1,024 cards, aiming to break through existing bottlenecks in large model networking.

While the computing industry has garnered significant attention in recent years due to AI computational power, traditional general-purpose computing seems to be somewhat sidelined

However, with AI large models penetrating various industries, the convergence of AI with devices like PCs, smartphones, and edge servers creates a new paradigm of computationHence, traditional computing needs to rise to the occasion, adopting capabilities that can facilitate AI integration effectively.

As Zhao expressed unabashedly, “In the future, not only AI chips but everything computation must incorporate AI capabilities.” General-purpose computing chips remain critical in the computing industry, exhibiting a flourishing development landscapeMultiple architectures, such as x86, ARM, and RISC-V, are rapidly evolving, highlighting a trend towards diversified computational powerHowever, the lack of unified CPU protocol standards raises substantial challenges for hardware development, firmware adaptation, and component testing, particularly as system power, bus rates, and current densities continue to escalate.

In light of the rich and rapidly changing application scenarios, Zhao underscored the pressing need for a unified computational foundation to address efficiency, compatibility, and iterative upgrades within CPU computing

Thus, the announcement of the Open Computing Module (OCM) specifications at the Open Computing Summit attracted considerable industry attentionThe OCM standard aims to decentralize previously tightly-coupled server architectures, treating CPU and memory as the minimal computational units and achieving compatibility via standardized high-speed interconnection, management protocols, and power interfaces.

Undoubtedly, the OCM standard carries monumental significance for open computing and the computing industry at largeWith OCM, computational platform vendors can accelerate the speed of product iterations while enhancing R&D efficiencyAdditionally, the OCM-compliant computational platforms will allow users to adaptively replace CPUs according to different application needs, facilitating rapid access to cutting-edge computational technologies.

However, the push for OCM standardization may also usher in challenges of product homogeneity

Inspur Information acknowledges that while standardization often leads to homogeneity, the movement towards standardized and open computing products reflects an inevitable industry trendThis shift not only enables rapid iterations and applications of new technologies but also fosters a closer connection between vendors and users, facilitating the industrialization of innovative technologies.

The journey of AI transforming the world has only just begun, and for computational infrastructure, the emergence and implementation of standards like OAM and OCM signify merely the evolving paradigm of computationWith escalating demands for computational power anticipated to persist, infrastructure must embrace extensive evolution across management, operation, cooling, and heat dissipation to further catalyze AI innovation.

For instance, the growth of heterogeneous and diversified computing will inevitably pose challenges in managing numerous firmware platforms and adaptations

To advance, Inspur released the InBry open management platform based on OpenBMC the previous year, which has addressed challenges related to unifying multiple management standards and adapting various firmware branch versions, establishing a cohesive management protocol to facilitate asynchronous and custom iterative upgrades to propel AI advancements.

Moreover, regarding the surging power consumption of AI chips, as AI clusters comprising tens of thousands of cards proliferate, energy consumption in data centers has become a notable concernThe entire industrial chain needs to collaborate efficiently, pushing for the industrialization of liquid cooling technologies that must be integrated into every data centerTo address this, Inspur has partnered with industry stakeholders to set forth four liquid cooling-related standards aimed at transitioning GPU, CPU, and other computational components towards liquid cooling protocols, establishing modular standardized interfaces and liquid-cooled cabinets to tackle energy consumption challenges arising from large-scale AI clusters.

In summary, open computing's significance for the future of the computational industry is profound