2018. Welche Teile des Webs sollen für zukünftige Generationen archiviert werden? Das erkundet derzeit die Deutsche Nationalbibliothek und befragt Internetnutzer. Im Interview spricht Vizedirektorin Ute Schwens über den Stand der Dinge bei der Webarchivierung und die Auswirkungen des neuen Urheberrechts.
2012. Metadata Statistics for a Large Web Corpus
ABSTRACT
We provide an analysis of the adoption of metadata standards on the Web based a large crawl of the Web. In particular, we look at what forms of syntax and vocabularies publishers are using to mark up data inside HTML pages. We also describe the process that we have followed and the difficulties involved in web data extraction.
This is the public wiki for the Heritrix archival crawler project. Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits).
HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the 'mirrored' website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system. WinHTTrack is the Windows 2000/XP/Vista/Seven/8 release of HTTrack, and WebHTTrack the Linux/Unix/BSD release.
SocSciBot works by (a) crawling one or more web sites and then (b) analysing them to produce standard statistics about the interlinking between the sites and network diagrams of the interlinking. It can also run a limited linguistic analysis of the text in the collection of web sites.
hoover up those sites. Getleft is a web site downloader, that downloads complete web sites according to the settings provided by the user. It automatically changes all the absolute links to relative ones, so you can surf the downloaded pages (web sites) on your local computer without the need to connect to the internet. so that you can surf the site in your hard disk. Getleft supports several filters, allowing you to limit the download to certain files, as well as resuming , following of external links, sitemap and more. Getleft supports proxy connections and can be scheduled to update downloaded pages automatically.
C. Schmitz, S. Staab, R. Studer, G. Stumme, und J. Tane. Proc. of E-Learning 2002 World Conference on E-Learning in Corporate, Government, Healthcare and Higher Education on (E-Learning 2002), AACE, Seite 909-915. Norfolk, (2002)Awarded paper.
D. Gruhl, D. Meredith, J. Pieper, A. Cozzi, und S. Dill. WWW '06: Proceedings of the 15th international conference on World Wide Web, Seite 183-192. New York, NY, USA, ACM, (2006)
G. Manku, A. Jain, und A. Sarma. WWW '07: Proceedings of the 16th international conference on World Wide Web, Seite 141--150. New York, NY, USA, ACM, (2007)
Z. Bar-Yossef, I. Keidar, und U. Schonfeld. WWW '07: Proceedings of the 16th international conference on World Wide Web, Seite 111--120. New York, NY, USA, ACM, (2007)
A. Broder, M. Najork, und J. Wiener. WWW '03: Proceedings of the 12th international conference on World Wide Web, Seite 679--689. New York, NY, USA, ACM, (2003)
M. Ehrig, J. Hartmann, und C. Schmitz. Workshop ``Semantische Technologien für Informationsportale'' (GI-Jahrestagung 2004), Gesellschaft für Informatik, (September 2004)
C. Schmitz, S. Staab, R. Studer, G. Stummen, und J. Tane. Proc. of E-Learning 2002 World Conference on E-Learning in Corporate, Government, Healthcare and Higher Education on (E-Learning 2002), AACE, Seite 909-915. Norfolk, (2002)Awarded paper.