News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Technology and Application of AI Knowledge Mapping
Technology and Application of AI Knowledge Mapping
1. Introduction
The
With the development of the mobile Internet, the Internet of Things has become possible. The data generated by this interconnection is also growing in bursts, and these data can be used as an effective raw material for analyzing relationships. Assume that previous intelligent analysis focused on each individual. In addition to individuals during the mobile Internet era, this relationship between individuals will inevitably become a very important part of our needs to deepen the analysis. In a task, there is only a need for relational analysis, and the knowledge map is useful to the “possible” party.
2. What is a learning graph?
The learning graph is a new concept proposed by Google in 2012. From an academic point of view, we can give a definition of the learning graph: "The learning map is essentially the knowledge bank of the Semantic Network." But this is somewhat general, so from a different perspective, starting from the perspective of theoretical application can actually be understood as a multi-relational graph.
What is it called a multi-graph? All who have studied data structures should know what a graph is. Graphs are composed of nodes (Vertex) and Edges, but these graphs usually contain only one type of node and edge. However, instead, multi-graphs usually contain multiple types of nodes and multiple types of edges. For example, the lower left figure shows a classic graph structure, and the right graph shows a multi-graph, because the graph contains many types of nodes and edges. These types are marked by different colors.
In the learning map, we usually use "Entity" to express the nodes in the graph, and "Relation" to express the "edge" in the graph. Entities refer to things in the ideal world such as people, place names, concepts, drugs, companies, etc. Relationships are used to express some kind of liaison between different entities, such as people - "dwelling in" - Beijing, Zhang San, and Li Si It is "friends" and logistic regression is "leadership learning" of deep learning and so on.
Many scenes in the ideal world are well suited to be expressed in a learning graph. For example, in a social network map, we can have both "human" and "corporate" entities. The relationship between people can be "friends" or "colleagues". The relationship between a person and a company may be a relationship of “incumbent” or “previously employed”. Similarly, a risk control map can contain entities for "telephone" and "company". The relationship between the phone and the phone can be a "call" relationship, and each company will have a fixed phone number.
3. The representation of the learning map
The premise of the application of the learning graph is that once the map of learning has been constructed, it can be thought of as a library of knowledge. This is why it can be used to answer some of the search-related issues. For example, enter "Who is the wife of Bill Gates?" in the Google search engine. We can directly get the answer - "Melinda Gates". This is due to the fact that at the system level we have created an entity that contains "Bill Gates" and "Melinda Gates" and the relationship between them. So, when we perform the search, we can get the final answer directly through the keyword extraction ("Bill Gates", "Melinda Gates", "wife") and the matching on the knowledge base. This search method is different from the traditional search engine. A traditional search engine returns web pages instead of the final answer, so there is an additional layer of user-selected and filtered information.
In the ideal world, entities and relationships also have their own attributes. For example, people can have "name" and "age". When a learning map has attributes, we can use the Property Graph to represent it. The following figure shows a simple property diagram. Li Ming and Li Fei are father-son relationship, and Li Ming has a telephone number starting with 138. The opening time of the telephone number is 2018, and 2018 can be used as the attribute of the relationship. Similarly, Li Ming himself also carries some attribute values such as age 25 years old, and the position is general manager.
The expression of this attribute map is very close to the ideal life scene, and it can also describe the logic contained in the business. In addition to the attribute map, the learning map can also be represented by RDF, which is composed of a large number of triples. The main feature of RDF design is that it is easy to publish and share data, but it does not support entities or relationships that have attributes. Assuming that attributes are added, the design needs to be modified. At present, RDF is mainly used for academic scenarios. In the industrial world, we still prefer to use graph databases (such as attribute maps). Interested readers can refer to the relevant literature of RDF and do not provide more explanations in the text.
4. Learn to extract
The construction of the learning map is the basis for subsequent applications, and the prerequisite for the construction is the requirement to extract data from different data sources. Regarding the vertical spectrum of learning, their data sources mainly come from two channels: one is the data of the business itself, and this part of the data is usually contained in the company's database table and stored in a structured way; the other is The data that is exposed and crawled on the Internet. These data are usually in the form of web pages and are therefore unstructured data.
The former generally only needs simple pre-disposal, which can be used as the input of subsequent AI systems, but the latter requires the use of technologies such as natural speech processing to extract structured information. For example, in the above search example, the relationship between Bill Gates and Malinda Gate can be extracted from unstructured data, such as Wikipedia and other data sources.
The difficulty of information extraction lies in the disposal of unstructured data. In the following figure, we give an example. On the left is an unstructured English text, and on the right is the entities and relationships extracted from the text. In the process of constructing similar maps, the following topics are addressed in natural language disposition techniques:
a. Name Entity Recognition
b. Relation Extraction
c. Entity Resolution
d. Refers to Coreference Resolution
The following is a brief description of each technical disposal problem. As a result, how these are done in detail is not done here one by one. Interested readers can consult related materials or study my courses.
The first is the entity name recognition, which is to extract entities from the text and classify/tag each entity: For example, from the above text, we can extract the entity - "NYC" and mark the entity type as "Location"; We can also extract "Virgil's BBQ" from it and mark the entity type as "Restarant". This process is called entity name recognition. This is a relatively mature technology. There are some ready-made tools that can be used to do this. Second, we can use relational extraction techniques to extract the relationships between entities from the text. For example, the relationship between the entity "hotel" and "Hilton property" is "in"; the relationship between "hotel" and "Time Square" is "near" and so on.
In addition, in the entity naming recognition and relation extraction process, there are two more difficult issues: one is the entity unity, that is to say, some entities are not the same way of writing, but it actually refers to the same entity. For example, “NYC” and “New York” are different strings on the surface, but they actually refer to the city of New York and need mergers. Consolidation of entities can not only reduce the types of entities, but also can reduce the sparsity of maps; another problem is to refer to dissolution, which is also the "it" presented in the text, and the word "he" and "she" refer to which entity. For example, in this article, two marked "it" points to the "hotel" entity.
The
Entity harmonization and referential elimination problems are more responsive to the first two issues.
5. Storage of learning maps
There are two main storage methods for learning graphs: one is based on RDF storage; the other is based on map database storage. The difference between them is shown in the figure below. An important design criterion of RDF is the easy publishing and sharing of data, and the graph database focuses on efficient graph query and search. Second, RDF stores data in triples and does not include attribute information. However, graph databases generally use attribute maps as their basic representation, so entities and relationships can contain attributes, which means that it is easier to express an ideal business scenario. .
According to the latest statistics (in the first half of 2018), the graph database is still the fastest growing storage system. On the contrary, the growth of the relational database basically adheres to a stable level. At the same time, we also listed the commonly used graph database systems and their latest rankings. Among them, the Neo4j system is still the graph database with the highest utilization rate. It has a vivid community, and the query efficiency of the system itself is high, but the lack of uniqueness does not support quasi-distributed. On the contrary, OrientDB and JanusGraph (formerly Titan) support distributed, but these systems are relatively new and the community is not as lively as Neo4j, which means that it will inevitably encounter some stinging problems during the application process. Suppose you choose to use RDF storage system, Jena may be a good choice.
6. Establishment of financial knowledge map
Next, we look at a detailed case of the theory and explain how to build a knowledge map system in the field of financial risk control that can be set up step by step. First of all, the need to explain is that many people may think that building a knowledge map system focuses on algorithms and development. But the facts are not as imaginary, but the most important center is the understanding of the business and the design of the learning graph itself. This is similar to a business system. The design of the database table is particularly critical, and this design is absolutely indispensable. Intensive understanding of the business and projections of changes to future business scenarios. Of course, we do not discuss the importance of data here.
The
The construction of an intact knowledge map consists of the following steps: 1. Defining detailed business problems 2. Data collection & pretreatment 3. Design of the learning map 4. Saving data into the learning map 5. Development of upper application, and System evaluation. Here we follow this process to explain what needs to be done and what needs to be considered for each step.
6.1 Defining Detailed Business Issues
In the P2P online loan environment, the most central issue is risk control, which is how to evaluate the risk of a borrower. In the online environment, the risk of blackmail is particularly serious, and many of these risks are hidden in a complex network of relationships, and the learning map is precisely designed for such problems, so we “likely” to wait for it to be deceitful. This issue brings some value.
Before moving on to the discussion of the next topic, one thing to understand is that there is no need to support the graph system in the end. Because in many theoretical scenarios, even if there is a certain demand for the analysis of the relationship, it is theoretically possible to apply the traditional database to complete the analysis. Therefore, in order to avoid the use of knowledge maps and the selection of knowledge maps, as well as better technology selection, the following gives a few points for reference.
The
6.2 Data Collection & Pretreatment
The next step is to affirm the data source and make necessary data pre-disposal. Need to think about the data source, we need to consider the following points: 1. What data did we have? 2. Although not usual, what data is it possible to obtain? 3. Which part of the data can be used to reduce risk? 4. Which part of the data can be used to construct the learning map? The point here is that not all the data related to anti-blackmail must be entered into the learning graph. Some guidelines for this part of the decision will have more detailed introductions in the following sections.
With respect to anti-blackmailing, there are several data sources that we can easily imagine, including basic user information, behavioral data, operator data, public information on the Internet, and so on. Assuming that we once had a list of data sources, the next step is to see which data needs to be further processed. For example, unstructured data requires more or less the use of technologies related to natural speech processing. The basic information that the user fills is basically stored in the business table. In addition to the individual field requirements for further disposal, many fields can be directly used for modeling or added to the learning graph system. With regard to behavioral data, we need to go through some simple steps and extract valid information such as “how long a user stays on a page” and so on. With regard to webpage data published on the Internet, some information is required to extract related technologies.
For example, with regard to the basic information of the user, we will probably need the following operations. On the one hand, user information such as name, age, education, etc. can be extracted and used directly from the structured database. On the other hand, regarding the company name that is filled in, we may need to do further dispositions. For example, some users fill in "Beijing Greedy Technology Co., Ltd." and another part of users fill out "Beijing Wangjing Greedy Technology Co., Ltd.", which actually points to the same company. Therefore, at this point we need to align the company name. The technical details used can refer to the entity alignment techniques mentioned above. The
6.3 Design of Learning Map
The design of the map is an art. It not only requires a deep understanding of the business, but also requires some estimation of the possible changes in the future business, so as to design a system that is most closely related to the status quo and has high performance. In the question of the design of learning maps, we will certainly face the following common problems: 1. Which entities, relationships and attributes are required? 2. Which attributes can be used as entities and which entities can be used as attributes? 3. What information does not need to be placed in the learning map?
Based on these common problems, we have developed a series of design criteria from our past design experience. These design criteria are similar to the paradigms in traditional database design to guide relevant personnel to design a more rational learning graph system while ensuring the efficiency of the system.
Next, we will give a few simple examples to illustrate some of the criteria. The first is the Business Principle. Its meaning is "Everything starts from the business logic, and after observing the design of the map, it is easy to speculate the logic behind the business."
The
With the development of the mobile Internet, the Internet of Things has become possible. The data generated by this interconnection is also growing in bursts, and these data can be used as an effective raw material for analyzing relationships. Assume that previous intelligent analysis focused on each individual. In addition to individuals during the mobile Internet era, this relationship between individuals will inevitably become a very important part of our needs to deepen the analysis. In a task, there is only a need for relational analysis, and the knowledge map is useful to the “possible” party.
2. What is a learning graph?
The learning graph is a new concept proposed by Google in 2012. From an academic point of view, we can give a definition of the learning graph: "The learning map is essentially the knowledge bank of the Semantic Network." But this is somewhat general, so from a different perspective, starting from the perspective of theoretical application can actually be understood as a multi-relational graph.
What is it called a multi-graph? All who have studied data structures should know what a graph is. Graphs are composed of nodes (Vertex) and Edges, but these graphs usually contain only one type of node and edge. However, instead, multi-graphs usually contain multiple types of nodes and multiple types of edges. For example, the lower left figure shows a classic graph structure, and the right graph shows a multi-graph, because the graph contains many types of nodes and edges. These types are marked by different colors.
In the learning map, we usually use "Entity" to express the nodes in the graph, and "Relation" to express the "edge" in the graph. Entities refer to things in the ideal world such as people, place names, concepts, drugs, companies, etc. Relationships are used to express some kind of liaison between different entities, such as people - "dwelling in" - Beijing, Zhang San, and Li Si It is "friends" and logistic regression is "leadership learning" of deep learning and so on.
Many scenes in the ideal world are well suited to be expressed in a learning graph. For example, in a social network map, we can have both "human" and "corporate" entities. The relationship between people can be "friends" or "colleagues". The relationship between a person and a company may be a relationship of “incumbent” or “previously employed”. Similarly, a risk control map can contain entities for "telephone" and "company". The relationship between the phone and the phone can be a "call" relationship, and each company will have a fixed phone number.
3. The representation of the learning map
The premise of the application of the learning graph is that once the map of learning has been constructed, it can be thought of as a library of knowledge. This is why it can be used to answer some of the search-related issues. For example, enter "Who is the wife of Bill Gates?" in the Google search engine. We can directly get the answer - "Melinda Gates". This is due to the fact that at the system level we have created an entity that contains "Bill Gates" and "Melinda Gates" and the relationship between them. So, when we perform the search, we can get the final answer directly through the keyword extraction ("Bill Gates", "Melinda Gates", "wife") and the matching on the knowledge base. This search method is different from the traditional search engine. A traditional search engine returns web pages instead of the final answer, so there is an additional layer of user-selected and filtered information.
In the ideal world, entities and relationships also have their own attributes. For example, people can have "name" and "age". When a learning map has attributes, we can use the Property Graph to represent it. The following figure shows a simple property diagram. Li Ming and Li Fei are father-son relationship, and Li Ming has a telephone number starting with 138. The opening time of the telephone number is 2018, and 2018 can be used as the attribute of the relationship. Similarly, Li Ming himself also carries some attribute values such as age 25 years old, and the position is general manager.
The expression of this attribute map is very close to the ideal life scene, and it can also describe the logic contained in the business. In addition to the attribute map, the learning map can also be represented by RDF, which is composed of a large number of triples. The main feature of RDF design is that it is easy to publish and share data, but it does not support entities or relationships that have attributes. Assuming that attributes are added, the design needs to be modified. At present, RDF is mainly used for academic scenarios. In the industrial world, we still prefer to use graph databases (such as attribute maps). Interested readers can refer to the relevant literature of RDF and do not provide more explanations in the text.
4. Learn to extract
The construction of the learning map is the basis for subsequent applications, and the prerequisite for the construction is the requirement to extract data from different data sources. Regarding the vertical spectrum of learning, their data sources mainly come from two channels: one is the data of the business itself, and this part of the data is usually contained in the company's database table and stored in a structured way; the other is The data that is exposed and crawled on the Internet. These data are usually in the form of web pages and are therefore unstructured data.
The former generally only needs simple pre-disposal, which can be used as the input of subsequent AI systems, but the latter requires the use of technologies such as natural speech processing to extract structured information. For example, in the above search example, the relationship between Bill Gates and Malinda Gate can be extracted from unstructured data, such as Wikipedia and other data sources.
The difficulty of information extraction lies in the disposal of unstructured data. In the following figure, we give an example. On the left is an unstructured English text, and on the right is the entities and relationships extracted from the text. In the process of constructing similar maps, the following topics are addressed in natural language disposition techniques:
a. Name Entity Recognition
b. Relation Extraction
c. Entity Resolution
d. Refers to Coreference Resolution
The following is a brief description of each technical disposal problem. As a result, how these are done in detail is not done here one by one. Interested readers can consult related materials or study my courses.
The first is the entity name recognition, which is to extract entities from the text and classify/tag each entity: For example, from the above text, we can extract the entity - "NYC" and mark the entity type as "Location"; We can also extract "Virgil's BBQ" from it and mark the entity type as "Restarant". This process is called entity name recognition. This is a relatively mature technology. There are some ready-made tools that can be used to do this. Second, we can use relational extraction techniques to extract the relationships between entities from the text. For example, the relationship between the entity "hotel" and "Hilton property" is "in"; the relationship between "hotel" and "Time Square" is "near" and so on.
In addition, in the entity naming recognition and relation extraction process, there are two more difficult issues: one is the entity unity, that is to say, some entities are not the same way of writing, but it actually refers to the same entity. For example, “NYC” and “New York” are different strings on the surface, but they actually refer to the city of New York and need mergers. Consolidation of entities can not only reduce the types of entities, but also can reduce the sparsity of maps; another problem is to refer to dissolution, which is also the "it" presented in the text, and the word "he" and "she" refer to which entity. For example, in this article, two marked "it" points to the "hotel" entity.
The
Entity harmonization and referential elimination problems are more responsive to the first two issues.
5. Storage of learning maps
There are two main storage methods for learning graphs: one is based on RDF storage; the other is based on map database storage. The difference between them is shown in the figure below. An important design criterion of RDF is the easy publishing and sharing of data, and the graph database focuses on efficient graph query and search. Second, RDF stores data in triples and does not include attribute information. However, graph databases generally use attribute maps as their basic representation, so entities and relationships can contain attributes, which means that it is easier to express an ideal business scenario. .
According to the latest statistics (in the first half of 2018), the graph database is still the fastest growing storage system. On the contrary, the growth of the relational database basically adheres to a stable level. At the same time, we also listed the commonly used graph database systems and their latest rankings. Among them, the Neo4j system is still the graph database with the highest utilization rate. It has a vivid community, and the query efficiency of the system itself is high, but the lack of uniqueness does not support quasi-distributed. On the contrary, OrientDB and JanusGraph (formerly Titan) support distributed, but these systems are relatively new and the community is not as lively as Neo4j, which means that it will inevitably encounter some stinging problems during the application process. Suppose you choose to use RDF storage system, Jena may be a good choice.
6. Establishment of financial knowledge map
Next, we look at a detailed case of the theory and explain how to build a knowledge map system in the field of financial risk control that can be set up step by step. First of all, the need to explain is that many people may think that building a knowledge map system focuses on algorithms and development. But the facts are not as imaginary, but the most important center is the understanding of the business and the design of the learning graph itself. This is similar to a business system. The design of the database table is particularly critical, and this design is absolutely indispensable. Intensive understanding of the business and projections of changes to future business scenarios. Of course, we do not discuss the importance of data here.
The
The construction of an intact knowledge map consists of the following steps: 1. Defining detailed business problems 2. Data collection & pretreatment 3. Design of the learning map 4. Saving data into the learning map 5. Development of upper application, and System evaluation. Here we follow this process to explain what needs to be done and what needs to be considered for each step.
6.1 Defining Detailed Business Issues
In the P2P online loan environment, the most central issue is risk control, which is how to evaluate the risk of a borrower. In the online environment, the risk of blackmail is particularly serious, and many of these risks are hidden in a complex network of relationships, and the learning map is precisely designed for such problems, so we “likely” to wait for it to be deceitful. This issue brings some value.
Before moving on to the discussion of the next topic, one thing to understand is that there is no need to support the graph system in the end. Because in many theoretical scenarios, even if there is a certain demand for the analysis of the relationship, it is theoretically possible to apply the traditional database to complete the analysis. Therefore, in order to avoid the use of knowledge maps and the selection of knowledge maps, as well as better technology selection, the following gives a few points for reference.
The
6.2 Data Collection & Pretreatment
The next step is to affirm the data source and make necessary data pre-disposal. Need to think about the data source, we need to consider the following points: 1. What data did we have? 2. Although not usual, what data is it possible to obtain? 3. Which part of the data can be used to reduce risk? 4. Which part of the data can be used to construct the learning map? The point here is that not all the data related to anti-blackmail must be entered into the learning graph. Some guidelines for this part of the decision will have more detailed introductions in the following sections.
With respect to anti-blackmailing, there are several data sources that we can easily imagine, including basic user information, behavioral data, operator data, public information on the Internet, and so on. Assuming that we once had a list of data sources, the next step is to see which data needs to be further processed. For example, unstructured data requires more or less the use of technologies related to natural speech processing. The basic information that the user fills is basically stored in the business table. In addition to the individual field requirements for further disposal, many fields can be directly used for modeling or added to the learning graph system. With regard to behavioral data, we need to go through some simple steps and extract valid information such as “how long a user stays on a page” and so on. With regard to webpage data published on the Internet, some information is required to extract related technologies.
For example, with regard to the basic information of the user, we will probably need the following operations. On the one hand, user information such as name, age, education, etc. can be extracted and used directly from the structured database. On the other hand, regarding the company name that is filled in, we may need to do further dispositions. For example, some users fill in "Beijing Greedy Technology Co., Ltd." and another part of users fill out "Beijing Wangjing Greedy Technology Co., Ltd.", which actually points to the same company. Therefore, at this point we need to align the company name. The technical details used can refer to the entity alignment techniques mentioned above. The
6.3 Design of Learning Map
The design of the map is an art. It not only requires a deep understanding of the business, but also requires some estimation of the possible changes in the future business, so as to design a system that is most closely related to the status quo and has high performance. In the question of the design of learning maps, we will certainly face the following common problems: 1. Which entities, relationships and attributes are required? 2. Which attributes can be used as entities and which entities can be used as attributes? 3. What information does not need to be placed in the learning map?
Based on these common problems, we have developed a series of design criteria from our past design experience. These design criteria are similar to the paradigms in traditional database design to guide relevant personnel to design a more rational learning graph system while ensuring the efficiency of the system.
Next, we will give a few simple examples to illustrate some of the criteria. The first is the Business Principle. Its meaning is "Everything starts from the business logic, and after observing the design of the map, it is easy to speculate the logic behind the business."