Artificial intelligence testing: case study on unmanned vehicle testing
original title: artificial intelligence testing: case study on unmanned vehicle testing Lei Feng new intelligent driving press: Recently, the Institute of system engineering, Department of automation, Tsinghua University
original title: artificial intelligence testing: case study on unmanned vehicle testing
Lei Feng new intelligent driving press: Recently, Li Li, associate professor of the Institute of systems engineering, Department of automation, Tsinghua University, as the first author, and Lin Yilun, Zheng Nanning, Wang Feiyue, Liu Yuehu, Cao Dongpu, Wang Kunfeng, and Huang Wuling published an English paper on artificial intelligence test: a case study of intelligent vehicles, which focuses on the testing and design methods of intelligence in the field of artificial intelligence applications. The article believes that the process of intelligent testing is similar to that of machine learning. They are like two sides of the same coin. "Lifelong testing" will be a protracted war. At the end of the paper, the parallel testing method of the combination of virtual and real is also proposed
the following is the Chinese version of artificial intelligence test and unmanned vehicle test, which is authorized by Associate Professor Li Li. In addition, the download link of the English version of the paper is attached at the end of the article. Welcome to check it
1. Overview
this article mainly describes the testing of intelligence in the application field of artificial intelligence, the description of the testing system based on scenarios and tasks, and introduces how to design simulation based testing and its test indicators in intelligence testing, and gives examples in the typical field of artificial intelligence, intelligent vehicle
2. Driverless and artificial intelligence
artificial intelligence (AI) usually refers to the intelligence shown by machines that is similar to human beings. Nowadays, AI has greatly changed our lives, ranging from autonomous vehicle to floor sweeping robots. We firmly believe that AI will further change our lives in the next 20 years, including health, education, entertainment, safety and other fields. While enjoying the various conveniences brought by AI, it also brings some questions: how to ensure that AI machines operate correctly according to human design ideas? Will driverless vehicles lose control and cause accidents in some extreme environments? Will kitchen robots set the house on fire? Based on the above, we urgently need to test and measure the reliability of artificial intelligence
in order to answer the above questions, we need to think about the definition of artificial intelligence: Wikipedia's definition of artificial intelligence: the intelligence shown by machines; We extend it and give a definition: AI refers to the intelligence that machines (in the same task) show (similar to, or the same as, or even surpass human beings). Minsky (1968) gave a similar definition of AI "[ai] is the science of making machines capable of performing tasks that would require intelligence if done by [humans]" Minsky's definition pays more attention to the intelligence needed to complete the task (cause oriented), while this definition is more inclined to the intelligence shown by the completed task (result oriented)
at the same time, it must be noted that the tasks selected for testing intelligence are also specific. Different tasks test different aspects of intelligence. For example, an illiterate may become a good driver, but an illiterate scholar cannot drive
Turing test is the earliest intelligent test we know so far. Turing test is Turing's wise thinking about artificial intelligence. Its core idea is to require computers to disguise themselves as human beings to answer human questions as much as possible without direct physical contact. However, Turing test cannot be applied to the intelligence test of unmanned vehicles
nowadays, intelligence testing has more and more application fields, so what method should we use to test intelligence? What are the advantages of our task-based intelligence testing method? Next, we will list the difficulties of intelligence testing and the testing methods we propose. For example, if there are software problems, we can consult our technicians how to solve these difficulties, and how to better design "task" based test cases
3. Design and test of driverless intelligence
3.1 The dilemma of intelligence testing
3.1.1 Definition/description of tasks
the first dilemma is how to better define tasks in intelligence testing:
the biggest weakness in Turing testing is task description. It should be pointed out that there are great differences between the intelligent test of driverless vehicles today and the early Turing test such as the Chinese house: first, the early Turing test did not clearly specify the test task and what answer can be regarded as correct, which led to some machines trying to pass the Turing test often using ambiguous ways to try to avoid direct answers. Nowadays, the intelligent test of driverless vehicles clearly defines the task; Second, in the early Turing test, someone will judge the test results. In order to test whether the recognition algorithm of driverless vehicles passes the tests of various possible scenarios, we must use machines to help judge whether tens of thousands of test tasks pass
in short, we need to establish a series of quantifiable test tasks, which is the most fundamental foundation of intelligent testing
3.1.2. Task verification
the second dilemma is: how to ensure that the tested intelligent machine shows the consistency of its behavior in all scenes it encounters. Therefore, it is necessary to ensure the enumeration/coverage of task testing
generally speaking, we can regard the task as the input of intelligent machine test. If the task is completed, output "yes", otherwise output "no". For some relatively simple intelligence tests, we can exhaust the possible traffic scenarios by enumerating all possible task combinations. If the vehicle can pass through all these scenes, the vehicle will be smart enough. However, due to the space-time continuity of task space, enumeration is impossible. Therefore, we must rely on virtual sampling testing to increase how to sample reasonably. While reducing the complexity of scene generation, improving test coverage has become the key technology of testing. By recording the tracks of the test vehicle and other vehicles, we can quantitatively characterize the intelligent level (driving performance) of the vehicle
3.1.3. Design of simulation test
in order to solve the problem of task coverage within limited time and financial resources, researchers now mostly adopt simulation test to make up for the lack of field test [4]. From this point of view, researchers further studied the following derivative problems:
1) how to ensure the authenticity of virtual object behavior in virtual testing
2) how to ensure the richness of virtual objects in virtual testing
3) how to ensure the coverage of scenarios and tasks in virtual testing
4) how to realize the correctness of machine judgment in virtual testing
for example, in terms of simulation testing, current driverless vehicle researchers consider how to extract 3D attributes of objects from 2D image data collected in reality, re render them in 3D engine and generate new 2D virtual test data. Other researchers consider how to generate new 2D virtual test data directly from 2D measured image data based on generative countermeasures
moreover, the setting of test standards is also one of the hot topics discussed by researchers. For typical multi-objective problems such as driving, it is still very difficult to evaluate the advantages and disadvantages of different algorithms and design test standards to meet the requirements of different users
3.1.4. Setting of test indicators
there are several methods to set test indicators. The first is to require intelligent machines to behave like human beings. In this method, we first need to determine how people will behave when completing the task, and then make a judgment according to the difference between the performance of intelligent machines and human performance in the process of completing the task
the second way to set test indicators is to require intelligent machines to have the best performance. For example, when designing an AI machine for go, we require it to always win, rather than playing chess like a human player. This method is more suitable for this kind of target when it is relatively simple. In intelligent vehicle testing, the goal is often complex, and it cannot be the goal of winning the chess game like go. Other complex factors such as driving safety, speed, fuel efficiency and so on need to be considered. Targeting different factors will lead to completely different designs. For example, in the 2016-2017 China unmanned vehicle future challenge, the time of the smart car passing the set 10 specific scene tasks was taken as one of the evaluation indicators. If there was a collision, pressing the line, running the red light, the corresponding score would also be deducted. When people's feelings are included in the evaluation factors, considering that everyone's feelings about the same thing will be different, the setting of test indicators will become more difficult
3.2. Smart car intelligence test
here we take the smart car intelligence test as an example to illustrate our point of view:
3.3.1 The setting of test tasks in intelligent testing
the traditional intelligent testing of driverless vehicles is mainly divided into two schools: scene testing school and function testing school
1) scenario testing
often refers to the test system in a specific time and space. For example, a traffic scenario generally refers to a traffic system composed of many traffic participants and a specific road environment. If the test vehicle can drive through the traffic system autonomously, it is called passing the driving test of the specific scene. For example, the DARPA 2005 unmanned Car Challenge selected 212 kilometers of desert roads as the test scenario (in fact, the desert was also selected as the test scenario in 2004, but "the whole army was destroyed", in contrast, 2005 was a glorious time) (grand challenge 2005). DARPA 2007 unmanned Car Challenge selected 96 km of urban roads as the test scenario (Urban Challenge 2007)
2) function test
function test focuses more on the realization of single or multiple functions of driverless driving. According to the functional classification of human intelligence, driving intelligence can be divided into three general categories: information perception, analysis and decision-making, action execution and so on. For example, path planning belongs to the single intelligence of analysis and decision-making. This definition emphasizes the commonness of methods and technologies to realize these single intelligences. However, it is not enough to measure the intelligent level of driverless because it cannot be connected with specific traffic scenes and driverless test tasks. The implicit assumption of the function test is that if the driverless driver passes one or more tests of a certain function, it can also be executed smoothly when the function needs to be used in the future. This assumption seems logical, but it turns out to be too optimistic. In addition, there are other problems in the current function testing:
there are more single function tests, and the comprehensive tests involve less, so it is impossible to test the synergy between multiple functions
there is a lack of complete, fair and open benchmark set
we believe that the intelligence of driverless vehicles can be defined by the generalized semantic network
semantic complex is a kind of complex shape
LINK
Copyright © 2011 JIN SHI