Artificial intelligence
Naver’s HyperCLOVA X Vision proves a picture is worth a thousand words
The company has also unveiled Speech X, an upgraded voice synthesis technology that can express human emotion
By Aug 22, 2024 (Gmt+09:00)
3
Min read
Most Read
LG Chem to sell water filter business to Glenwood PE for $692 million


KT&G eyes overseas M&A after rejecting activist fund's offer


Mirae Asset to be named Korea Post’s core real estate fund operator


StockX in merger talks with Naver’s online reseller Kream


Meritz backs half of ex-manager’s $210 mn hedge fund



A picture is worth a thousand words, as the adage goes, stressing the power of vision over text.
People also say eyes are windows to the soul, emphasizing the importance of humans’ ability to take in nuanced visual information.
Naver Corp., a leading South Korean tech giant, said on Thursday it has trained the brains of its latest artificial intelligence platform, HyperCLOVA X, to understand images as well as text.
On Aug. 27, Naver plans to unveil HyperCLOVA X Vision (HCX Vision), another upgraded version of HyperCLOVA X, after training it with large amounts of text and image data to process visual information, including documents.
“We are adding image capabilities to HyperCLOVA X without compromising on its text capabilities,” the company said in a statement.

Naver said HCX Vision has migrated from a large language model (LLM) to a large vision-language model (LVLM).
Trained on wide-ranging visual and language data, HCX Vision supports text and image modalities and performs tasks in various scenarios, such as recognizing documents and understanding text within images, it said.
SCORED HIGHER THAN GPT-4o
Naver said it used over 30 benchmarks to track the performance of HCX Vision relative to Open AI’s commercial AI models GPT-4v and GPT-4o.
One benchmark Naver used to measure and showcase its model’s Korean capabilities were the Korean General Educational Development (K-GED) tests, which are primary and secondary education equivalency diplomas.

The benchmark consisted of 1,480 four-option multiple-choice questions. When testing with image inputs, HCX Vision correctly answered 83.8% of the questions, surpassing the K-GED test’s 60% pass threshold and the 77.8% scored by GPT-4o, according to Naver.
Under the image captioning category, it said HCX Vision can accurately identify and describe small details in an image without using a separate object detection model.
HCX Vision can name historical figures, landmarks, products and food with just image inputs. It can also reason and predict possible next steps based on images.

UNDERSTANDING CHARTS, TABLES AND GRAPHS
Naver said the AI model also understands charts, tables and data in an Excel file.
“If the data is a screenshot of an image, getting responses for your prompts is more complicated because the model must first recognize text and understand how the numbers are related,” it said.
HCX Vision supports documents in Korean, English, Japanese and Chinese, it said.
Naver said HCX Vision has been trained on large amounts of image and text pairs and can even understand humor and memes.

Other capabilities include understanding equations; code generation using shapes, charts or graphs; solving math problems that include shapes; and creative writing such as poems.
“Right now, HyperCLOVA X Vision can understand one image at a time. But soon, with context length support in the millions, we expect HCX Vision to understand hours-long movies and video streams,” Naver said.
SPEECH X
On Thursday, Naver also unveiled Speech X, a voice synthesis technology based on its HyperCLOVA X.

Naver said Speech X is a model more advanced than existing voice recognition and synthesis technology, boasting improved language structure and pronunciation accuracy. It can also express emotions like a human, Naver said.
The company has already proven its technological competitiveness with various voice AI services such as AI voice recording CLOVA Note, AI phone service CLOVA Care Call and AI voice synthesis CLOVA Dubbing.
“HCX, which started as a large-scale language model, is evolving into a massive visual language model with added image understanding capabilities, and further into a voice multimodal language model,” said Sung Nako, head of Hyperscale AI Technology at Naver Cloud Corp., the AI affiliate of Naver Corp.
“We will expand our HCX ecosystem by applying HCX’s advanced capabilities to various Naver services, including CLOVA X.”
Write to Seung-Woo Lee at leeswoo@hankyung.com
In-Soo Nam edited this article.
More to Read
-
Artificial intelligenceHyperCLOVA X surpasses GPT-4 in Korean AI evaluation
Feb 27, 2024 (Gmt+09:00)
1 Min read -
Artificial intelligenceNaver eyes global market with new LLM HyperCLOVA X
Aug 24, 2023 (Gmt+09:00)
4 Min read -
Artificial intelligenceSamsung, Naver to jointly develop generative AI to rival ChatGPT
May 14, 2023 (Gmt+09:00)
4 Min read -
Artificial intelligenceNaver’s HyperCLOVA X: More Korean-proficient than ChatGPT
Feb 28, 2023 (Gmt+09:00)
1 Min read -
Artificial intelligenceKorea’s first ChatGPT-authored, AI-proofread book due this week
Feb 19, 2023 (Gmt+09:00)
3 Min read
Comment 0
LOG IN