pgloader will keep a separate file of rejected data, but continue trying to copy good data in your database.
pgloader also implements data reformatting, a typical example of that being the transformation of MySQL datestamps 0000-00-00 and 0000-00-00 00:00:00 to PostgreSQL NULL value
A very common workflow is to index some data based on its embeddings and then given a new query embedding retrieve the most similar examples with k-Nearest Neighbor search. For example, you can imagine embedding a large collection of papers by their abstracts and then given a new paper of interest retrieve the most similar papers to it.
TLDR in my experience it ~always works better to use an SVM instead of kNN, if you can afford the slight computational hit
["slug" being an entity attribute]
Spring Data offers an existsBy query method, which we can define in the PostRepository, as follows:
1
2
3
4
5
6
@Repository
public interface PostRepository
extends JpaRepository<Post, Long> {
boolean existsBySlug(String slug);
}
[another] option to emulate existence is using a CASE WHEN EXISTS native SQL query:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@Repository
public interface PostRepository
extends JpaRepository<Post, Long> {
@Query(value = """
SELECT
CASE WHEN EXISTS (
SELECT 1
FROM post
WHERE slug = :slug
)
THEN 'true'
ELSE 'false'
END
""",
nativeQuery = true
)
boolean existsBySlugWithCase(@Param("slug") String slug);
}
@Repository
public interface PostRepository extends BaseJpaRepository<Post, Long> {
@Query("""
select p
from Post p
where date(p.createdOn) >= :sinceDate
"""
)
@QueryHints(
@QueryHint(name = AvailableHints.HINT_FETCH_SIZE, value = "25")
)
Stream<Post> streamByCreatedOnSince(@Param("sinceDate") LocalDate sinceDate);
}
The FETCH_SIZE JPA query hint is necessary for PostgreSQL and MySQL to instruct the JDBC Driver to prefetch at most 25 records. Otherwise, the PostgreSQL and MySQL JDBC Drivers would prefetch all the query results prior to traversing the underlying ResultSet.
When Hibernate loads an object into a Session it creates a state snapshot of the current database state of the object, so that it can perform dirty checking against the snapshot.
As a read only object will never be modified, this snapshot is not needed and memory can be saved.
I think the ~/.mozilla/firefox/XXX.default-YYY/storage/default/https+++ZZZ.com/cache/https+++domain.com/ style dirs are the storage for what's called "service workers" which is persistent code related to each website that sends notifiications even if no related tab is open.
Suppose you have a favorite website that sells something, you might register with them that you're interested in a particular kind of product. A serviceworker for that site would be in the "ZZZ" folder named after that site, the code in there would run even if you don't have a tab open for that site. It's done so you can get a notification. In other cases it's some other code that the web designers don't want to have to reload each time you visit, caching it in your storage folder saves time and network.
You can see all your service workers in the Firefox menu: Help -> More troubleshooting information -> about:serviceworkers ( or load about:serviceworkers )
If you plan to store UUID values in a Primary Key column, then you are better off using a TSID (time-sorted unique identifier).
One such implementation is offered by the Hypersistence TSID OSS library, which provides a 64-bit TSID that’s made of two parts:
a 42-bit time component
a 22-bit random component
The random component has two parts:
a node identifier (0 to 20 bits)
a counter (2 to 22 bits)
The node identifier can be provided by the tsid.node system property when bootstrapping the application:
-Dtsid.node="12"