2018. Welche Teile des Webs sollen für zukünftige Generationen archiviert werden? Das erkundet derzeit die Deutsche Nationalbibliothek und befragt Internetnutzer. Im Interview spricht Vizedirektorin Ute Schwens über den Stand der Dinge bei der Webarchivierung und die Auswirkungen des neuen Urheberrechts.
2012. Metadata Statistics for a Large Web Corpus
ABSTRACT
We provide an analysis of the adoption of metadata standards on the Web based a large crawl of the Web. In particular, we look at what forms of syntax and vocabularies publishers are using to mark up data inside HTML pages. We also describe the process that we have followed and the difficulties involved in web data extraction.
This is the public wiki for the Heritrix archival crawler project. Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits).
HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the 'mirrored' website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system. WinHTTrack is the Windows 2000/XP/Vista/Seven/8 release of HTTrack, and WebHTTrack the Linux/Unix/BSD release.
SocSciBot works by (a) crawling one or more web sites and then (b) analysing them to produce standard statistics about the interlinking between the sites and network diagrams of the interlinking. It can also run a limited linguistic analysis of the text in the collection of web sites.
hoover up those sites. Getleft is a web site downloader, that downloads complete web sites according to the settings provided by the user. It automatically changes all the absolute links to relative ones, so you can surf the downloaded pages (web sites) on your local computer without the need to connect to the internet. so that you can surf the site in your hard disk. Getleft supports several filters, allowing you to limit the download to certain files, as well as resuming , following of external links, sitemap and more. Getleft supports proxy connections and can be scheduled to update downloaded pages automatically.
R. Yu, U. Gadiraju, B. Fetahu, and S. Dietze. Proceedings of the 2015 Conference on Web Information Systems Engineering (WISE), page 554--569. Springer, (November 2015)
G. Gossen, E. Demidova, and T. Risse. Advances in Information Retrieval: 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015., page 797--800. Springer International Publishing, (2015)