copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Bag-of-concepts: Comprehending document representation through clustering words in distributed representation.

H. Kim, H. Kim, and S. Cho. Neurocomputing, (2017)
DOI: 10.1016/j.neucom.2017.05.046

Abstract

Two document representation methods are mainly used in solving text mining problems. Known for its intuitive and simple interpretability, the bag-of-words method represents a document vector by its word frequencies. However, this method suffers from the curse of dimensionality, and fails to preserve accurate proximity information when the number of unique words increases. Furthermore, this method assumes every word to be independent, disregarding the impact of semantically similar words on preserving document proximity. On the other hand, doc2vec, a basic neural network model, creates low dimensional vectors that successfully preserve the proximity information. However, it loses the interpretability as meanings behind each feature are indescribable. This paper proposes the bag-of-concepts method as an alternative document representation method that overcomes the weaknesses of these two methods. This proposed method creates concepts through clustering word vectors generated from word2vec, and uses the frequencies of these concept clusters to represent document vectors. Through these data-driven concepts, the proposed method incorporates the impact of semantically similar words on preserving document proximity effectively. With appropriate weighting scheme such as concept frequency-inverse document frequency, the proposed method provides better document representation than previously suggested methods, and also offers intuitive interpretability behind the generated document vectors. Based on the proposed method, subsequently constructed text mining models, such as decision tree, can also provide interpretable and intuitive reasons on why certain collections of documents are different from others.

Links and resources

BibTeX key

journals/ijon/KimKC17

entry type

article

year

2017

journal

Neurocomputing

pages

336-352

volume

266

ee

https://doi.org/10.1016/j.neucom.2017.05.046

DOI

10.1016/j.neucom.2017.05.046

url

https://www.sciencedirect.com/science/article/abs/pii/S0925231217308962

additional links

@ghagerer's tags highlighted

Cite this publication

%0 Journal Article %1 journals/ijon/KimKC17 %A Kim, Han Kyul %A Kim, Hyunjoong %A Cho, Sungzoon %D 2017 %J Neurocomputing %K bag-of-concepts classification clustering document-embeddings topic-modeling word-vectors %P 336-352 %R 10.1016/j.neucom.2017.05.046 %T Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. %U https://www.sciencedirect.com/science/article/abs/pii/S0925231217308962 %V 266 %X Two document representation methods are mainly used in solving text mining problems. Known for its intuitive and simple interpretability, the bag-of-words method represents a document vector by its word frequencies. However, this method suffers from the curse of dimensionality, and fails to preserve accurate proximity information when the number of unique words increases. Furthermore, this method assumes every word to be independent, disregarding the impact of semantically similar words on preserving document proximity. On the other hand, doc2vec, a basic neural network model, creates low dimensional vectors that successfully preserve the proximity information. However, it loses the interpretability as meanings behind each feature are indescribable. This paper proposes the bag-of-concepts method as an alternative document representation method that overcomes the weaknesses of these two methods. This proposed method creates concepts through clustering word vectors generated from word2vec, and uses the frequencies of these concept clusters to represent document vectors. Through these data-driven concepts, the proposed method incorporates the impact of semantically similar words on preserving document proximity effectively. With appropriate weighting scheme such as concept frequency-inverse document frequency, the proposed method provides better document representation than previously suggested methods, and also offers intuitive interpretability behind the generated document vectors. Based on the proposed method, subsequently constructed text mining models, such as decision tree, can also provide interpretable and intuitive reasons on why certain collections of documents are different from others.

@article{journals/ijon/KimKC17, abstract = {Two document representation methods are mainly used in solving text mining problems. Known for its intuitive and simple interpretability, the bag-of-words method represents a document vector by its word frequencies. However, this method suffers from the curse of dimensionality, and fails to preserve accurate proximity information when the number of unique words increases. Furthermore, this method assumes every word to be independent, disregarding the impact of semantically similar words on preserving document proximity. On the other hand, doc2vec, a basic neural network model, creates low dimensional vectors that successfully preserve the proximity information. However, it loses the interpretability as meanings behind each feature are indescribable. This paper proposes the bag-of-concepts method as an alternative document representation method that overcomes the weaknesses of these two methods. This proposed method creates concepts through clustering word vectors generated from word2vec, and uses the frequencies of these concept clusters to represent document vectors. Through these data-driven concepts, the proposed method incorporates the impact of semantically similar words on preserving document proximity effectively. With appropriate weighting scheme such as concept frequency-inverse document frequency, the proposed method provides better document representation than previously suggested methods, and also offers intuitive interpretability behind the generated document vectors. Based on the proposed method, subsequently constructed text mining models, such as decision tree, can also provide interpretable and intuitive reasons on why certain collections of documents are different from others.}, added-at = {2020-05-05T16:50:00.000+0200}, author = {Kim, Han Kyul and Kim, Hyunjoong and Cho, Sungzoon}, biburl = {https://www.bibsonomy.org/bibtex/286614b9f26d45e8346bec0024e9ed893/ghagerer}, doi = {10.1016/j.neucom.2017.05.046}, ee = {https://doi.org/10.1016/j.neucom.2017.05.046}, interhash = {eb1c57ef73eb720c41186dbdcd799e87}, intrahash = {86614b9f26d45e8346bec0024e9ed893}, journal = {Neurocomputing}, keywords = {bag-of-concepts classification clustering document-embeddings topic-modeling word-vectors}, pages = {336-352}, timestamp = {2020-06-24T15:06:50.000+0200}, title = {Bag-of-concepts: Comprehending document representation through clustering words in distributed representation.}, url = {https://www.sciencedirect.com/science/article/abs/pii/S0925231217308962}, volume = 266, year = 2017 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Bag-of-concepts: Comprehending document representation through clustering words in distributed representation.

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Bag-of-concepts: Comprehending document representation through clustering words in distributed representation.

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Bag-of-concepts: Comprehending document representation through clustering words in distributed representation.

Comments and Reviews
(0)