Alibaba has earned in the latest global VQA (Visual Question Answering) Leaderboard, better than a human’s performance in the same context. This is the first time a machine has outperformed people in interpreting images to answer text questions, with Alibaba’s algorithm achieving an accuracy rate of 81.26 percent in answering image-related queries, compared to 80.83 percent for humans (in test-standard part).
The Challenge, which has been held yearly since 2015 by CVPR, the world’s premier visual conference, attracts international participants such as Facebook, Microsoft, and Stanford University. The evaluation asks participants to deliver an accurate natural language answer to a picture and a linked natural language inquiry. More than 250,000 photos and 1.1 million questions were submitted for this year’s challenge.
The breakthrough in machine intelligence in answering image-related queries was made possible by Alibaba DAMO Academy’s revolutionary algorithm design, which is part of the Alibaba Group’s worldwide research and development programme. The Alibaba team was able to make significant progress in not only analysing the images and understanding the intent of the questions, but also in answering them with proper reasoning while expressing it in a human-like conversational style.
The VQA technology is already in use throughout Alibaba’s network. It’s been utilised in Alibaba’s intelligent chatbot Alime Shop Assistant, which tens of thousands of merchants on Alibaba’s retail platforms use.
“We are proud that we have achieved another significant milestone in machine intelligence, which underscores our continuous efforts in driving the research and development in related AI fields,” said Si Luo, Head of Natural Language Processing (NLP) at Alibaba DAMO Academy. “This is not implying humans will be replaced by robots one day. Rather, we are confident that smarter machines can be used to assist our daily work and life, and hence, people can focus on the creative tasks that they are best at.”
VQA can be used in a wide range of areas, Si Luo added. For example, it can be used when searching for products on e-commerce sites, for supporting the analysis of medical images for initial disease diagnosis, as well as for smart driving, as the auto AI assistant can offer basic analysis of photos captured by the in-car camera.
Alibaba’s machine-learning algorithm has previously outperformed competitors. Alibaba’s model also won the GLUE benchmark rankings, which are considered the most significant baseline test for NLP models in the industry. Alibaba’s model surpassed human baselines by a substantial margin, marking a watershed moment in the development of strong natural language understanding systems.
In 2019, Alibaba’s model outperformed humans on the Microsoft Machine Reading Comprehension dataset, one of the most difficult reading comprehension tests in the artificial intelligence industry. The model outperformed the human score of 0.539 in the MS Marco question-answer test, according to Microsoft’s benchmark. Alibaba also outperformed the human benchmark in the Stanford Question Answering Dataset in 2018, which is one of the most widely used machine reading comprehension challenges in the world.