Artificial intelligence

Naver’s HyperCLOVA X Vision proves a picture is worth a thousand words

The company has also unveiled Speech X, an upgraded voice synthesis technology that can express human emotion

By Aug 22, 2024 (Gmt+09:00)

3 Min read

leeswoo@hankyung.com

Most Read

LG Chem to sell water filter business to Glenwood PE for $692 million

KT&G eyes overseas M&A after rejecting activist fund's offer

Mirae Asset to be named Korea Post’s core real estate fund operator

StockX in merger talks with Naver’s online reseller Kream

Meritz backs half of ex-manager’s $210 mn hedge fund

Naver　unveils　its　hyperscale　AI　platform　HyperCLOVA　X　on　Aug.　24,　2023

A picture is worth a thousand words, as the adage goes, stressing the power of vision over text.

People also say eyes are windows to the soul, emphasizing the importance of humans’ ability to take in nuanced visual information.

Naver Corp., a leading South Korean tech giant, said on Thursday it has trained the brains of its latest artificial intelligence platform, HyperCLOVA X, to understand images as well as text.

On Aug. 27, Naver plans to unveil HyperCLOVA X Vision (HCX Vision), another upgraded version of HyperCLOVA X, after training it with large amounts of text and image data to process visual information, including documents.

“We are adding image capabilities to HyperCLOVA X without compromising on its text capabilities,” the company said in a statement.

Naver said HCX Vision has migrated from a large language model (LLM) to a large vision-language model (LVLM).

Trained on wide-ranging visual and language data, HCX Vision supports text and image modalities and performs tasks in various scenarios, such as recognizing documents and understanding text within images, it said.

SCORED HIGHER THAN GPT-4o

Naver said it used over 30 benchmarks to track the performance of HCX Vision relative to Open AI’s commercial AI models GPT-4v and GPT-4o.

One benchmark Naver used to measure and showcase its model’s Korean capabilities were the Korean General Educational Development (K-GED) tests, which are primary and secondary education equivalency diplomas.

The benchmark consisted of 1,480 four-option multiple-choice questions. When testing with image inputs, HCX Vision correctly answered 83.8% of the questions, surpassing the K-GED test’s 60% pass threshold and the 77.8% scored by GPT-4o, according to Naver.

Under the image captioning category, it said HCX Vision can accurately identify and describe small details in an image without using a separate object detection model.

HCX Vision can name historical figures, landmarks, products and food with just image inputs. It can also reason and predict possible next steps based on images.

UNDERSTANDING CHARTS, TABLES AND GRAPHS

Naver said the AI model also understands charts, tables and data in an Excel file.

“If the data is a screenshot of an image, getting responses for your prompts is more complicated because the model must first recognize text and understand how the numbers are related,” it said.

HCX Vision supports documents in Korean, English, Japanese and Chinese, it said.

Naver said HCX Vision has been trained on large amounts of image and text pairs and can even understand humor and memes.

Other capabilities include understanding equations; code generation using shapes, charts or graphs; solving math problems that include shapes; and creative writing such as poems.

“Right now, HyperCLOVA X Vision can understand one image at a time. But soon, with context length support in the millions, we expect HCX Vision to understand hours-long movies and video streams,” Naver said.

SPEECH X

On Thursday, Naver also unveiled Speech X, a voice synthesis technology based on its HyperCLOVA X.

Naver said Speech X is a model more advanced than existing voice recognition and synthesis technology, boasting improved language structure and pronunciation accuracy. It can also express emotions like a human, Naver said.

The company has already proven its technological competitiveness with various voice AI services such as AI voice recording CLOVA Note, AI phone service CLOVA Care Call and AI voice synthesis CLOVA Dubbing.

“HCX, which started as a large-scale language model, is evolving into a massive visual language model with added image understanding capabilities, and further into a voice multimodal language model,” said Sung Nako, head of Hyperscale AI Technology at Naver Cloud Corp., the AI affiliate of Naver Corp.

“We will expand our HCX ecosystem by applying HCX’s advanced capabilities to various Naver services, including CLOVA X.”

Write to Seung-Woo Lee at leeswoo@hankyung.com

In-Soo Nam edited this article.

Naver’s HyperCLOVA X Vision proves a picture is worth a thousand words

The company has also unveiled Speech X, an upgraded voice synthesis technology that can express human emotion

Cookies on KED Global

Currency Converter

Naver’s HyperCLOVA X Vision proves a picture is worth a thousand words

The company has also unveiled Speech X, an upgraded voice synthesis technology that can express human emotion

Cookies on KED Global

Fill in the information to subscribe to our newsletter and you can also getunlimited access to the latest intelligence on Korean asset owners.

Fill in the information to download the full story ofHidden Champions and Next Unicorns.

Currency Converter

Fill in the information to subscribe to our newsletter and you can also get
unlimited access to the latest intelligence on Korean asset owners.

Fill in the information to download the full story of
Hidden Champions and Next Unicorns.