intfloat
/

multilingual-e5-large

Feature Extraction

sentence-transformers

Sentence Transformers

sentence-similarity

Eval Results (legacy)

text-embeddings-inference

Model card Files Files and versions

How did you create (title, paragraph) pairs from c4?

#45

by itayair - opened Aug 7, 2024

In addition, did you filter the C4 data?

Owner Aug 9, 2024

To create contrastive pairs, please refer to the discussion at https://huggingface.co/intfloat/multilingual-e5-large/discussions/37#664b1fe87a1ed3e001471b2f

And yes, we filter the mC4 data using the consistency-based filtering approach in Text Embeddings by Weakly-Supervised Contrastive Pre-training

One more thing, what does the page_content you take? (The web pages might be much longer than 512)

Owner Aug 9, 2024

In that case, texts will be truncated to fit the model's maximum support length.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment