group :: kdeseminarso12

bookmarks (hide)3
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

5BibSonomy :: scraping service
http://scraper.bibsonomy.org/
13 years ago by @dbenz
show all tags
scrapingservice
scrapers
scraper
bibsonomy
scrapingservicescrapersscraperbibsonomy
copydelete
- community post
- history of this post
2boilerpipe - Project Hosting on Google Code
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides
14 years ago by @macek
show all tags
Scraper
Development
ScraperDevelopment
copydelete
- community post
- history of this post
1Web-Harvest Project Home Page
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
14 years ago by @macek
show all tags
Java
Scraper
Development
JavaScraperDevelopment
copydelete
- community post
- history of this post

⟨⟨
⟨
1
⟩
⟩⟩

publications (hide)1
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

4A comparison of layout based bibliographic metadata extraction techniques
M. Granitzer, M. Hristakeva, R. Knight, K. Jack, and R. Kern. Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, page 19:1--19:8. New York, NY, USA, ACM, (2012)
13 years ago by @dbenz
show all tags
comparison
ie
extraction
scraper
comparisonieextractionscraper
copydeleteadd this publication to your clipboard

⟨⟨
⟨
1
⟩
⟩⟩

BibSonomy

bookmarks (hide)3
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

5BibSonomy :: scraping service

2boilerpipe - Project Hosting on Google Code

1Web-Harvest Project Home Page

publications (hide)1
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

4A comparison of layout based bibliographic metadata extraction techniques

KDE Seminar SoSe 2012

browse

related tags

tags

BibSonomy

bookmarks (hide)3 displayallbookmarks onlybookmarks per page5102050100 sort byadded attitle RSSBibTeXXML

5BibSonomy :: scraping service

2boilerpipe - Project Hosting on Google Code

1Web-Harvest Project Home Page

publications (hide)1 displayallpublications onlypublications per page5102050100 sort byadded attitleauthorpublication dateentry typehelp for advanced sorting... RSSBibTeXRDFmore...

4A comparison of layout based bibliographic metadata extraction techniques

KDE Seminar SoSe 2012

browse

related tags

tags

bookmarks (hide)3
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

publications (hide)1
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...