Alibaba has secured first place in the current worldwide VQA (Visual Question Answering) Leaderboard, outperforming a person in the same scenario. This is the first time a computer has surpassed humans when it comes to understanding photos to answer text queries. To learn more about the technology underlying this feat, TECHx secured an exclusive interview with Si Luo, Head of Natural Language Processing (NLP) at Alibaba DAMO Academy. Take a peek at the conversation.
TECHx: Let’s begin with the recent accomplishment that has put another feather on Alibaba’s cap. Alibaba took first place in the worldwide Visual Question Answering Leaderboard, in which a machine surpassed humans in understanding pictures to answer text questions. Tell us how far we are from robots taking over jobs and what new opportunities this shift in the landscape will bring?
Si Luo:We are proud that we have achieved another significant milestone in machine intelligence, which underscores our continuous efforts in driving the research and development in related AI fields. This is not implying humans will be replaced by robots one day. Rather, we are confident that smarter machines can be used to assist our daily work and life, and hence, people can focus on the creative tasks that they are best at.
TECHx: Could you kindly educate our audience on the technology behind Alibaba DAMO Academy’s groundbreaking algorithm design, which enabled the team to reach this milestone?
Si Luo:Alibaba’s algorithm design achieves several breakthroughs:
First, Alibaba has established a robust foundation model base called AliceMind (Alibaba’s Collection of Encoder-decoders from Machine Intelligence of DAMO), which includes a set of pre-trained language models with state-of-the-art performance to encode/model language knowledge.
(More information about AliceMind could be found on https://nlp.aliyun.com/portal#/alice and https://github.com/alibaba/AliceMind)
Secondly, the VQA model extends from language to multimodal scenario, by leveraging AliceMind to better model language and image information in the same high-dimensional semantic space. We have developed innovative technology to do the cross-modal semantic alignment and fusion, which is the key component in the VQA system.
Thirdly, we have knowledge-guided Mixture-of-Experts (MoE) in the VQA system to deal with several kinds of tasks including OCR text expert, clock reading expert, and counting expert, which further enhances the algorithm design.
TECHx: What role do AI, machine learning, and automation play in e-commerce? In the adoption race, which industries are on top? What are the ones that aren’t up to par?
Si Luo:For the VQA technology by Alibaba, it has already been widely applied across Alibaba’s ecosystem. For example, it has been used in Alibaba’s intelligent chatbot Alime Shop Assistant, which is used by tens of thousands of merchants on Alibaba’s retail platforms.
Today AI and machine learning technologies have already been broadly used in various industries from e-commerce, logistics including last-mile delivery, smart speaker, to smart manufacturing.
One good example is Alibaba’s Global Shopping Festival, where cutting-edge AI and machine learning technologies were used to support one of the world’s biggest online shopping events. During the 2020 11.11 Global Shopping Festival, AliExpress, Alibaba’s global retail marketplace, unveiled the world’s first real-time livestreaming translation feature on an e-commerce platform powered by DAMO’s innovative speech and language processing technologies, supporting simultaneous translation from Chinese to English, Russian, Spanish and French.