Introducing Fujitsu KG Enhanced RAG (4 sessions) #1 Fujitsu KG Enhanced RAG for RCA (Root Cause Analysis)

Hello. I'm Kikuzuki from the Artificial Intelligence Laboratory.

To promote the use of generative AI at enterprises, Fujitsu has developed a generative AI framework for enterprises that can flexibly respond to diverse and changing corporate needs and easily comply with the vast amount of data held by a company and laws and regulations. The framework was successfully launched in July 2024 as part of Fujitsu Kozuchi (R&D)'s AI service lineup.

This article introduces the Fujitsu Knowledge Graph Enhanced RAG for RCA (Root Cause Analysis), which constitutes this framework. ( *1 )

Some of the challenges that enterprise customers face when leveraging specialized generative AI models include:

Difficulty handling large amounts of data required by the enterprise
Generated AI cannot meet cost, response speed, and various other requirements
Requirement to comply with corporate rules and regulations

To address these challenges, the generative AI framework for enterprises consists of the following technologies:

Fujitsu Knowledge Graph enhanced RAG
Amalgamation Technology
Generative AI Audit Technology

In this series, we introduce the "Fujitsu Knowledge Graph enhanced RAG" every week. We hope this helps you solve your problems. At the end of the article, we'll also tell you how to try out the technology.

Fujitsu Knowledge Graph Enhanced RAG Technology Overcomes the Weakness of Generative AI that Cannot Accurately Reference Large-Scale Data

Existing RAG techniques for making generative AI refer to related documents, such as internal documents, have the challenge of not accurately referencing large-scale data. To solve this problem, we have developed Fujitsu Knowledge Graph enhanced RAG (hereinafter, KG enhanced RAG) technology that can expand the amount of data that can be referred to by LLM from hundreds of thousands to millions of tokens to more than 10 million tokens by developing existing RAG technology and automatically creating a knowledge graph that structures a huge amount of data such as corporate regulations, laws, manuals, and videos owned by companies. In this way, knowledge based on relationships from the knowledge graph can be accurately fed to the generative AI, and logical reasoning and output rationale can be shown.

This technology consists of four technologies, depending on the target data and the application scene.

(1) Root Cause Analysis (This Article)

This technology creates a report on the occurrence of a failure based on system logs and failure case data, and suggests countermeasures based on similar failure cases.

(2) Question & Answer (Now Showing)

This technology makes it possible to conduct advanced Q&A based on a comprehensive view of a large amount of document data such as product manuals.

(3) Software Engineering (Now Showing)

This technology not only understands source code, but also generates high-level functional design documents, summaries, and enables modernization.

(4) Vision Analytics (Now Showing)

This technology can detect specific events and dangerous actions from video data, and even propose countermeasures.

In this article, I will introduce (1) Root Cause Analysis (hereinafter, RCA) in detail.

What is the KG Enhanced RAG for RCA?

What are the Benefits of RCA?

This technology enables RCA. First, we will explain in what situations RCA can be used and why we would be happy when realizing it.

Familiar failures include, for example, problems with smartphones and PCs such as "I turned on Wi-Fi, but I can't connect", or problems with consumer electronics such as "I turned on the washer and dryer, but it's not dry at all". Most of the time, the problem can be solved by searching for the event or error code on the internet or by reading the manual, trying the solution found, or replacing the product because it is already faulty.

However, when it comes to failures in business usecases, it is not so easy to deal with. For example, in large-scale networks operation & maintenance, cloud systems operation and plant maintenance, it is difficult to detect whether a fault has occurred and, if so, where it has occurred, due to the complexity of the system and equipment configuration. Even if, for example, transmission/reception errors or huge latencies occur in the network, there are too many related events to identify, such as hardware degradation, misconfiguration or temporary traffic increases. On the other hand, in the case of large-scale systems, a single failure can cause billions of yen in damages, and the impact on business can be enormous. Therefore, how to prevent failures or speed up RCA has become a critical issue. Currently, specialists spend a lot of man-hours analyzing and investigating the wide range of root causes, and there is a growing demand for automation and assistive technologies for RCA, with the aim of reducing the manpower requirements of specialists and achieving fault-free operation.

How does KG Enhanced RAG for RCA change the World?

With KG enhanced RAG for RCA, the descriptions in a large number of documents, such as trouble shooting guides and system specifications, can be analyzed automatically and the system configuration and complex causal relationships that cause failures can be extracted automatically. As a result, a graph structure called the "causal knowledge graph" can be automatically constructed, and this causal knowledge graph and LLM (e.g. GPT-4, Takane *2 ) enables rapid analysis of the causes of failures. For example, simply stating in natural language, "What is the cause of the failure of the communication disconnection?", not only can highly accurate answers be obtained, but it is also possible to provide the isolation procedure, for example, "If you check the log and find XX, it is a configuration error, and if YY is also occurring, the firmware version is considered unsuitable". This makes it possible to obtain immediate investigation results instead of having experts read, analyze and investigate a large amount of documents, and also provides a more comprehensive and rework-free recovery procedure, which can significantly shorten the time required to recover from a fault.

What are the challenges of existing technologies?

Due to the great demand for automation and assistive technologies for RCA, various technologies have been studied and developed. Typical approaches are rule-based methods ( *3 ) and RAG-based methods ( *4 ).

The rule-based methods are algorithms or AI models that classify faults according to rules and classifications predefined by experts, and there are limits to the types of those predefined. In particular, systems have become increasingly complex in recent years, and the extension of this method has been increasingly inadequate.

The RAG-based method automates RCA using a large number of trouble shooting guides and RAG technology. Unlike the rule-based method, the quality of the answers automatically improves as failure cases are accumulated, so even complex systems can be handled to some extent. However, the RAG-based method can only analyze each case in fragments despite the complex causal relationships between the various events as shown in the "KG(s) showing causal relationships of incidents" in the diagram above. As a result, it could only provide root causes that were exactly the same as those that had occurred in the past.

As described above, the approach of existing technology was unable to provide an exhaustive RCA that takes into account complex causal relationships.

How have we realized KG enhanced RAG for RCA?

To solve the problems of existing technologies, we have proposed the causal knowledge graph, which can be used to analyze complex causal relationships and provide comprehensive, high-quality RCA result, as well as isolation procedures. The most important feature of KG enhanced RAG for RCA is the automatic construction of this causal knowledge graph.

Traditionally, knowledge graphs were created manually over an enormous amount of time, but it is not realistic to continue to manually create knowledge graphs for the daily increasing number of failure cases, and with the advent of LLM, OSS that automate the creation of knowledge graphs ( *5 ), but they can only exhaustively extract nouns and their relationships in a document. Even with the use of such knowledge graphs, it is not possible to answer quiz-like questions such as "What is Primergy's maximum HDD capacity?". However, it cannot be applied to RCA because it only extracts nouns that are not related to the causal relationship of the fault, or the causal relationship is separated by paraphrasing events, etc. Therefore, it was necessary to develop technology for the automatic construction of a causal knowledge graph.

We have therefore established the world's first technology to appropriately guide the focus of LLM in order to extract causal relationships between events. We focused on the fact that there is a certain type of relationship between events in the causal relationship of faults, such as "there is a cause of the fault, there is a fault, there are some intermediate events in between, and ...". We have predefined the type of this fault causal relationships in a graph structure called the "knowledge graph schema" and have developed a framework for analyzing input documents according to the schema. The framework consists of flow transformation technology that automatically transforms the knowledge graph schema into a processing flow and prompt transformation technology that constructs concrete LLM instruction prompts. This enables the automatic construction of causal knowledge graphs.

To Try out the KG Enhanced RAG for RCA

This technology is positioned as one of the "AI Core Engines" in Fujitsu Kozuchi, a platform that allows users to quickly try out advanced AI technologies researched and developed by Fujitsu. It has been released as a web application under the name "Fujitsu Knowledge Graph enhanced RAG for Root Cause Analysis".

The diagram below shows an example screen of this web application. After registering documents such as trouble shooting guides and system specifications, simply enter the content of the failure event in natural language, as shown in the "Query input", and the results of the RCA can be immediately obtained, as shown in the "Response". If you are having trouble with RCA, please use this service!

*1:RAG technology - Retrieval Augmented Generation. A technology that combines and extends the capabilities of generative AI with external data sources.

*2:Takane - A large language model for enterprises offering the highest Japanese language proficiency in the world, jointly developed by Fujitsu and Cohere. Press release available here

*3:rule-based methods - e.g. JP 2011-171981, etc.

*4:RAG-based methods - e.g. Y. Chen et.al. "Automatic Root Cause Analysis via Large Language Models for Cloud Incidents," arXiv:2305.15778, 2023, etc.

*5:Automatic creation of knowledge graphs - e.g. LangChain's LLMGraphTransformer library, for example