Abstract
In landslide research, databases (here synonymous with inventories) are of particular importance, as they are used to record and document information necessary for statistical and process-oriented analyses. The databases used in these circumstances range in type from analogue document repositories to complex software applications. The latter are preferred in this context due to their technical suitability for efficient data processing. Among software applications, so-called ‟relational database systems” (RDBS) have distinguished themselves in recent years. However, landslide data are currently collected and analysed largely independently of such a database system, although an RDBS would provide a central location for data processing. The operation of a database thus takes place on two separate levels: on the one hand, on the level of data acquisition and analysis by operators and users of the respective database and, on the other hand, on the level of central data storage and distribution by a corresponding database system. Operators and users are thus confronted with problems of their operational level without being able to receive support from a RDBS. A particular challenge at this point is that landslides are widely distributed in time as well as in space and are the result of complex processes. Comprehensive data collection consequently involves a large amount of work, which moreover affects analyses that depend on the availability of up-to-date and numerous data. For operators and users of a landslide database, this therefore regularly leads to self-imposed restrictions in their problem definitions in order to limit the required scope of data, and hence the associated effort of data collection.
The overall objective of the present work is to counteract the presented problems by minimizing the effort for operation and use of a landslide database, so that self-imposed restrictions become less important. For this purpose, an ‟Integrated Landslide Inventory System” (IRIS) is developed, which integrates the level of data collection and analysis into an RDBS by means of automation. The users of this system are thus relieved to the extent that they only have to monitor automated processes.
In the context of this objective, the technical basis for IRIS was created within the framework of the publication ‟A Landslide Inventory System as a Base for Automated Process and Risk Analyses”. For this purpose, the requirements for such a technical basis were first worked out. It was necessary to find a software that implements the common data processing methods of an RDBS, can additionally process spatial data and for which it is possible to make changes to the program logic in order to integrate automated collection and analysis methods. In addition, it had to be ensured that it is still possible for database operators to enter data — digital as well as analogue — from decentralised surveys (e.g. fieldwork, manual internet research) into the system. Therefore, the applied software solution had to support the digitisation of analogue data, which could then be made available for automated data processing. The software ‟PostgreSQL” was chosen accordingly, which fulfills these requirements and thus represents a RDBS, which was further enhanced for common GIS functionality by means of the extension ‟PostGIS”. PostgreSQL/PostGIS is therefore able to store and process not only the primary landslide data but also supports data such as digital maps and digital terrain models. Another special feature in the context of the requirement set is that the software is made available as Open Source and may be modified as desired. Under these conditions, the software was extended to include the possibility of entering self-collected data as well as an automated analysis for risk assessment. Following this, a case study in the Franconian Alb was used to automatically generate a map showing the risk of infrastructure objects being threatened by active landslides in the vicinity. In this respect, it was first necessary to digitise analogue landslide data from previous work via the input interface and to feed digital infrastructure maps and digital terrain models into the system, with all data remaining stored in the system. As soon as supplementary or more up-to-date data from the various types of surveys are available to the system, the analysis can therefore also be updated ‟at the touch of a button” without any further effort on the part of the operator.
In accordance with the overarching goal and after the establishment of a technical basis, including the automatic analysis possibilities, it is further necessary to support the operator of a database in data acquisition. This was initially done in the course of the publication ‟Automated Digital Data Acquisition for Landslide Inventories” by developing a process chain for the automated data acquisition of digital texts and their accompanying images — the texts and images originate, for example, from scientific papers, police reports, expert opinions, or even newspaper articles. Using a further modification of PostgreSQL/PostGIS, a process chain was integrated into IRIS in order to supply the system centrally and continuously with the most up-to-date data possible. This process chain itself consists of four links, which finally, recurrently in certain time intervals, collect landslide-relevant texts from the internet and make them available to the operator of a database. In view of this, the main task of this process chain is to sort out large quantities of accumulating and irrelevant texts and to identify text duplicates in order to limit the data to relevant information. The process chain is structured as follows: First, each text that is registered for the first time on the Internet by the search engine operator ‟Google” is checked for predefined keywords (e. g., landslide, mudflow, rockfall) and their inflections. The presence of one of the keywords in a text is a necessary condition for landslide-related content, so only such texts are passed on to the next link, which then checks whether the keywords found are in grammatically complete sentences. In this way, it is ensured that the landslide related content is a self-contained information unit, in addition, existing images are extracted as further information units. Using machine learning methods, all information units found are then classified in the next link of the process chain as relevant or irrelevant with respect to landslides — irrelevant would be, for example, a text about a political ‟landslide victory”, or a picture of a destroyed windscreen due to a rockfall. The final link then decides whether a text previously classified as relevant is a duplicate of an already recorded text from another source. A duplicate is considered to be a duplicate if it exceeds a certain threshold using a content similarity metric, however, due to additional information that may be included, the identified duplicate is not discarded entirely but is just hidden from the operator. As a result, the amount of data is further reduced, but the ability to view duplicates remains possible at all times. In total, over the test period of 87 weeks, 4381 documents were analyzed using the implemented process chain and 90 % of these irrelevant documents were sorted out, with the result that 385 text sources (excl. duplicates) on slide events could be made directly available to the operator of IRIS.
With regard to the two-pronged use of IRIS (decentralized/manual and centralized/automated, see above), a quantitative evaluation of the usefulness of various textual source types (e.g. newspaper article, police report, scientific publication, technical report) was carried out in connection with the publication ‟Quantitative Assessment of Information Quality in Textual Sources for Landslide Inventories”, particularly to optimize manual data acquisition. This is because a manual sifting of possible sources means a high effort and this effort can be reduced by a preselection of the source type based on usefulness. In particular, the question arises whether a certain type of source is useful for landslide inventories if the information it contains does not come from landslide experts but, for example, from journalists, police officers, or foresters. To answer this question, a ‟usefulness” was defined, which corresponds to the quantitative probability of finding specified landslide information, weighted according to their respective degrees of detail. Frequent occurrence of a high level of detail accordingly translates into higher usefulness compared to source types that contain the same type of information but more frequently with a lower level of detail. Since usefulness here corresponds to a mathematical probability, the well-known rules of combinatorics also apply. In this way, usefulness can be specified not only for one type of source, but also for any combination. As an example, a data set of a German landslide inventory was investigated, which contains not only selected landslide information on individual landslide processes, but also their original source type. Specifically, the noted source types were analyzed according to the content of location, date and process type of a landslide in various degrees of detail. It was found that the three most useful source types had a greater than 86 % probability of finding the required information when combined. The three source types, in descending order of individual usefulness, are: newspaper articles, expert opinions, and administrative documents. It was further shown that the inclusion of additional source types would only increase this probability logarithmically, so that, with regard to an efficient use of available resources, it can be dispensed with for the time being.
Together, the three works listed above form the technological and conceptual foundation of IRIS. This foundation makes it possible to link the previously separate level of operating and using a database with the level of data processing, whereby the automation of data acquisition and the risk analysis of collected data have been integrated into a relational database system. Thus, knowledge of usefulness of different types of sources enables the efficient control and focus in manual, as well as digital, data acquisition. Consequently, the IRIS is a quasi-closed, extensible, and self-sufficient system controlled by the operator that allows for the management of large and continuously accumulating landslide data. Future work to extend the system with respect to data acquisition could be the automated extraction of information contained in the retrieved text sources and/or the integration of automated landforms recognition using remote sensing methods. With respect to data analysis, future integration of other established analysis methods would increase hazard detection and assessment capabilities. Together, a fully automated, ‟living” landslide inventory appears possible in this form, which can continuously provide up-to-date and comprehensive information and forecasts on landslide events on the regional to global scale.
Links and resources
Tags