In mathematics, the Wasserstein or Kantorovich–Rubinstein metric or distance is a distance function defined between probability distributions on a given metric space M {\displaystyle M} M.
Intuitively, if each distribution is viewed as a unit amount of "dirt" piled on M {\displaystyle M} M, the metric is the minimum "cost" of turning one pile into the other, which is assumed to be the amount of dirt that needs to be moved times the mean distance it has to be moved. Because of this analogy, the metric is known in computer science as the earth mover's distance.
We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. The cost of training Vicuna-13B is around $300. The code and weights, along with an online demo, are publicly available for non-commercial use.
Kedro versioned datasets can be mixed with incremental and partitioned datasets [ �� Unsure what kedro is? Check out this post. This was a question presented to
In previous articles, we explored how Snowpark Container Services can open doors to a complete data stack running solely on Snowflake (here) and showcased all essential tools Snowflake provides to achieve this (here). Now, it’s time to dive into the practical side of things. This article will guide you through a step-by-step implementation of running dbt in Snowpark Container Services, covering everything from setup and containerisation all the way to scheduling and monitoring. If you’re trying to create a simple containerised dbt setup, this guide will help you put all theory into action!
TLDR — Extractive question answering is an important task for providing a good user experience in many applications. The popular Retriever-Reader framework for QA using BERT can be difficult to scale…
In natural language understanding (NLU) tasks, there is a hierarchy of lenses through which we can extract meaning — from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics. The process of learning, recognizing, and extracting these topics across a collection of documents is called topic modeling.
In this post, we will explore topic modeling through 4 of the most popular techniques today: LSA, pLSA, LDA, and the newer, deep learning-based lda2vec.
Facebook Research open sourced a great project recently – fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are an extension of word2vec.
You want to discern how many clusters we have (or, if you prefer, how many gaussians components generated the data), and you don’t have information about the “ground truth”. A real case, where data do not have the nicety of behaving good as the simulated ones.
Definition of NLP coherence scores, in particular intrinsic UMass measure and PMI.
Human judgment not being correlated to perplexity (or likelihood of unseen documents) is the motivation for more work trying to model the human judgment. This is by itself a hard task as human judgment is not clearly defined; for example, two experts can disagree on the usefulness of a topic.
One can classify the methods addressing this problem into two categories. \textit{Intrinsic} methods that do not use any external source or task from the dataset, whereas \textit{extrinsic} methods use the discovered topics for external tasks, such as information retrieval [Wei06], or use external statistics to evaluate topics.
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data
The data is uniformly distributed on Riemannian manifold;
The Riemannian metric is locally constant (or can be approximated as such);
The manifold is locally connected.
From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
MACE (Multi-Annotator Competence Estimation) is an implementation of an item-response model that let's you evaluate redundant annotations of categorical data. It provides competence estimates of the individual annotators and the most likely answer to each item.
If we have 10 annotators answer a question, and five answer with 'yes' and five with 'no' (a surprisingly frequent event), we would normaly have to flip a coin to decide what the right answer is. If we knew, however, that one of the people who answered 'yes' is an expert on the question, while one of the others just alwas selects 'no', we would take this information into account to weight their answers. MACE does exactly that. It tries to find out which annotators are more trustworthy and upweighs their answers. All you need to provide is a CSV file with one item per line.
In tests, MACE's trust estimates correlated highly wth the annotators' true competence, and it achieved accuracies of over 0.9 on several test sets. MACE can take annotated items into account, if they are available. This helps to guide the training and improves accuracy.
The main aim of SenticNet is to make the conceptual and affective information conveyed by natural language (meant for human consumption) more easily-accessible to machines.
H. Zhuang, T. Hanratty, and J. Har. Proceedings of the 2019 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, (Mai 2019)
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 1480--1489. (2016)
M. Windl, V. Winterhalter, A. Schmidt, и S. Mayer. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, Association for Computing Machinery, (2023)