Please enable JavaScript in your browser.

fltech - 富士通研究所の技術ブログ


Introducing the Fujitsu AutoML for Vision

Hello, I'm Hiroaki Yamane from the Artificial Intelligence Laboratory. Today, I would like to introduce an exciting new AI core engine – Fujitsu AutoML for Vision – which is now available from Fujitsu Kozuchi.

This article is part of a series introducing the AI core engines of Fujitsu Kozuchi. At the end of the article, you can find a list of all the previous blogs!

Developing an AI capable of recognizing objects within images demands extensive preparation of training data, substantial computational resources, and specialized expertise in AI. To address this challenge, we've introduced Fujitsu AutoML for Vision.

Fujitsu AutoML for Vision is a technology that simplifies visual AI by enabling users to ask in natural language what they would like to recognize in an image. For instance, users can request the detection of specific objects, and determine their quantity. It facilitates recognizing items within an image, recognizing locations like cities or forests, and identifying the properties of a scene like the weather. All of this can be achieved without the need for extensive datasets and with minimal computational resources.

Fujitsu AutoML for Vision achieves accuracy levels comparable to GPT-4V in visual tasks, at a fraction of the cost. It serves as one of the key AI core engines within Fujitsu Kozuchi, facilitating rapid testing of leading-edge AI technologies developed by Fujitsu.

Benefits of Fujitsu AutoML for Vision and how to use it

The underlying technology behind Fujitsu AutoML for Vision is to automatically engineer AI solutions for visual problems, without the need of a human expert to do so. To help you get to know Fujitsu AutoML for Vision, I will explain it step by step.

First, the user inputs what they want the visual AI to do. Specifically, they enter the "question for the AI" and the "options for the answer" in simple human language. For example, the question could be "How many whales are there?" and the answer options could be [None, 1, 2, Many]. These are entered with the image.

In order to answer the question about the image, Fujitsu AutoML for Vision employs a process of selecting the most suitable visual AI model from a range of pre-trained foundation models. Central to this system is a specially trained Large Language Model (LLM) known as the LLM Planner. This Planner enables the selection of the ideal visual AI model to address questions arising from the image. While numerous pre-trained AI models exist worldwide, determining the most appropriate one can be ambiguous, given the differing capabilities of each visual model.

Fujitsu AutoML for Vision efficiently chooses the best-suited visual AI model and employs it to generate responses to questions. This enables users, even those lacking expertise in AI or familiarity with individual AI models' features, to effectively utilize numerous AI models by simply inputting questions and answer options.

In this way, users can obtain the recognition answers to the questions without the need of using training data. The accuracy in visual tasks rivals that of GPT-4V, all at a fraction of the cost.

Furthermore, Fujitsu AutoML for Vision can also use training data to further improve the recognition accuracy. Existing pre-trained vision models can broadly recognize a wide variety of objects. However, there are some cases that you want to recognize a specific object, for example, a certain type of whale with higher accuracy. In such cases, it is effective to retrain (finetune) the AI model with images of that specific object. Pre-trained models have already learned from large amounts of training data. Therefore, you can retrain them using only new data that is specifically tailored for a specific objective, in this case, images of a specific type of whale. Therefore, by preparing a minimal amount of training data, users can obtain a more accurate image recognition AI.

*As of May 2024, the part marked with ★ in the figure is not available in the free demo.

Features of Fujitsu AutoML for Vision technology

The LLM as Planner has the highest level of identification accuracy and is both computationally and data efficient compared to other competing methods.

Interested in testing the Fujitsu Kozuchi?

Due to this engine's ability to simplify the use of vision AI at a low cost, with minimal data, computing, and expertise requirements, we anticipate that our system will facilitate the development of new applications. Often, technology has the potential to enhance society, but due to high costs, it remains underutilized. With Fujitsu AutoML for Vision, thanks to the reduced cost of cutting-edge technology, we envision numerous opportunities to create positive impact. I'd like to introduce a practical example of how Fujitsu AutoML for Vision is making a difference. Hachijojima Island in Tokyo is renowned for its whale watching opportunities from the shore. However, tourists often face the challenge of not knowing when and where whales will appear. To address this, we are developing a system that leverages the power of Fujitsu AutoML for Vision to detect whales in footage from cameras installed along the coast. Then tourists will receive real-time information about the location and time of whale sightings, enhancing their overall experience.

For a demonstration or to test our Fujitsu AutoML for Vision, please contact us here:

In addition to Fujitsu AutoML for Vision, we also introduce other AI core engines of Fujitsu Kozuchi on our TECH BLOG.

・Synthetic Image Generation

・Fujitsu Neuro-Symbolic Explainer

・Multi-Camera Tracking

・Camera Angle Change Detection

・Adversarial Example Attack Detector

・Fujitsu LLM Bias Diagnosis

・Fujitsu Auto Data Wrangling