Introducing Fujitsu Auto Data Wrangling - fltech - 富士通研究所の技術ブログ

Hello, I'm Lei Liu from our Artificial Intelligence Laboratory. Today, I would like to introduce an exciting new AI core engine - Fujitsu Auto Data Wrangling – which is now available from Fujitsu Kozuchi.

This article is part of a series introducing the AI core engines of Fujitsu Kozuchi. At the end of the article, you can find a list of all the previous articles!

Fujitsu Auto Data Wrangling is a technology that can automatically transform tabular data into a format suitable for AI learning, leveraging generative AI, machine learning algorithms, and a variety of automation techniques. It is one of Fujitsu Kozuchi’s AI core engines, which enables the fast testing of cutting-edge AI technologies developed by Fujitsu. The main motivation of Fujitsu Auto Data Wrangling is to reduce the manual workload for data scientists, who typically spend 80% of their time on data preparation before employing AI techniques.

AI has the capability to perform various types of predictions, such as classifications or regressions. For instance, in the manufacturing sector, AI can predict how to repair a product based on factors such as the type of product defect, model names, and product repair data. However, before AI models can effectively make such predictions, they must first learn from the available data. Yet, in many cases, AI encounters challenges in directly learning from tabular data. This is often due to data cleanliness issues, such as inconsistent value formats and unstructured fields like 'notes'. Even if the AI model manages to learn the data in its current form, the accuracy of the model may be insufficient, necessitating further data enrichment efforts.

Consequently, until now, humans have had to spend time in preparation tasks before AI can learn from tabular data. These tasks include data cleaning (such as unifying value formats, converting data into a form compatible with AI learning) and data enrichment (such as transforming the data to improve model accuracy).

Benefits of Fujitsu Auto Data Wrangling and how to use it

The principal value of Fujitsu Auto Data Wrangling lies in its capability to streamline data preparation efforts for AI training during the preprocessing phase, thereby reducing both time and human resources required. Furthermore, it holds the potential to significantly enhance AI accuracy by automatically enriching the dataset and adding new features.

First, let me use an example to elaborate on its effectiveness in reducing time and human effort in data preparation. Consider a tabular dataset that contains diverse types of information such as text, dates, numbers, categories, and URLs. Fujitsu Auto Data Wrangling automatically predicts the types of these columns (feature type inference) and subsequently transforms them into a format compatible with AI processing.

As shown in the following tabular data, the 'Price' column may contain a mix of characters and numbers representing both units and numerical values. Fujitsu Auto Data Wrangling would automatically remove the common unit part, retaining only the numerical values as input data. In the ‘Release’ column assumed to contain dates, Fujitsu Auto Data Wrangling would parse the information into distinct columns for year, month, and day. Furthermore, in cases like the ‘Genre’ column, where data is presented in a list format with items separated by commas, Fujitsu Auto Data Wrangling would segment this information into separate columns, such as the ‘Games’ and ‘Life’.

Such preprocessing requires considerable time when carried out by humans. Nevertheless, Fujitsu Auto Data Wrangling can perform these tasks automatically, reducing manual efforts by 90% or more.

Moreover, Fujitsu Auto Data Wrangling can automatically enrich the data and enhance the dataset. Take for instance the 'Phenomena' field in the product repair data provided below, it contains a wide variety of texts. Machine learning algorithms struggle to effectively handle such unstructured data. However, Fujitsu Auto Data Wrangling can automatically extract crucial keywords from these texts and integrate them as new data, thereby enriching the dataset. Such data enrichment plays a pivotal role in significantly improving prediction accuracy.

In the context of product repair data, pertinent keywords like 'Power' and 'Image' are extracted from the unstructured 'Phenomena' field and appended to the tabular data as distinct columns. Upon application of this enriched dataset to Fujitsu AutoML *1, the core engine available on Fujitsu Kozuchi, for predicting repair methods using AI, the accuracy improved by over 15% compared to using the original dataset for training.

*1 Fujitsu AutoML: Fujitsu's unique AutoML technology that automatically generates AI models from tabular data

Fujitsu Auto Data Wrangling works as a data preparation tool, specializing in data cleaning and enrichment. When integrated with core engines and components accessible on Fujitsu Kozuchi, such as Fujitsu AutoML and the Defect factor analysis component *2, it streamlines data preparation processes and improves AI accuracy. Moreover, it can also be applied to other existing machine learning tools that require cleaned and enriched tabular data.

*2 Defect factor analysis component: Improves product quality by analyzing infrequent defect data to determine the conditions of the manufacturing process under which defects occur and mitigating those factors.

Features of Fujitsu Auto Data Wrangling technology

Fujitsu Auto Data Wrangling offers cutting-edge data cleaning and enrichment capabilities through the utilization of Large Language Models (LLMs), distinguishing itself from existing tools. Fujitsu Auto Data Wrangling technology has the following key features.

Data cleaning

Fujitsu Auto Data Wrangling conducts precise 'Feature Type Inference' by validating the types of each column from a broad range of feature types. Leveraging the inference results, it automatically selects the optimal feature type, detects the possible cleanliness issues in the dataset and then executes data cleaning to purge inconsistencies and errors from the data to make the data compatible for training AI models.　

Data enrichment

By utilizing the optimal feature types inferred for each column, various data enrichment functions are applied to enrich the dataset, thereby potentially boosting prediction accuracy. Fujitsu Auto Data Wrangling provides a range of processors designed for enriching columns, encompassing features such as sentences, units, IDs, ranges, lists, datetimes, URLs, embedded numbers in strings, and more.

Achieving automation and scalability

While other data wrangling tools often require users to manually configure or set up procedures, Fujitsu Auto Data Wrangling prioritizes automation, minimizing human effort throughout the process. Users can accomplish entire data wrangling workflows with just a few clicks. Additionally, Fujitsu Auto Data Wrangling is scalable by employing multiple lightweight open-source LLMs depending on the specific data wrangling processes. For instance, it can process data comprising 50,000 rows in approximately 10 minutes.　

Advanced functions for improved usability, explainability, and performance

By leveraging the power of commercial LLMs (GPT-4) and other automation techniques, several advanced functions are introduced in the latest version (v2) of Fujitsu Auto Data Wrangling. These include prediction engineering, which automatically predicts machine learning task type and target columns, data enrichment tailored for formatted ID columns (e.g., container ID, ISBNs, etc.) and explainability features that aid users in understanding what and why changes were made to the dataset. Additionally, we offer code generation capabilities for data wrangling processes, empowering data scientists to customize codes for further applications. A table-merge function that joins input multiple tables to a target table by automatically extracting most appropriate relations (join keys) from each table is also developed. Prototypes for these advanced functions are available on Kozuchi for business users.

Interested in testing the Fujitsu Kozuchi?

If you're a data scientist, data analyst, or machine learning engineer devoting substantial time to preprocessing tabular data for various downstream applications, Fujitsu Auto Data Wrangling can offer significant time savings. With Fujitsu Auto Data Wrangling, simply upload your tabular data and obtain cleaned and enriched datasets with just a few clicks. You can easily customize the options for data cleaning or data enrichments based on your preferences, and also you can download the source codes for further usage in your downstream tasks.

You can try Fujitsu Auto Data Wrangling for free on the Fujitsu Research Portal.