Sam Horradarn
AI & ML interests
Recent Activity
Organizations
sirahd/testtesttest
Building for an Open Future - our new partnership with Google Cloud
On the Shifting Global Compute Landscape
Welcome EmbeddingGemma, Google's new efficient embedding model
- +4
Building Tensors from Scratch in Rust (Part 1.3): Data Operations
Parquet Content-Defined Chunking
Migrating the Hub from Git LFS to Xet
- +1
Migrating the Hub from Git LFS to Xet
- +1
Xet is now the default storage option for new users and organizations
Welcome Llama 4 Maverick & Scout on Hugging Face
- +5
Xet is on the Hub
- +4
Xet is on the Hub
- +4
How can we find the chunk content using chunk hash?
Chunk hash is calculated via content-defined chunking (CDC), which means that if two chunks have the same content they will share the same hash. CDC removes the need to store the mapping between chunk hash -> chunk content because we know if two chunks share the same hash, they will have identical content.
The CAS system only stores "block_hash -> block_content", Where does the map of chunk to block?
This is explained in the "key chunks" section in the blog post above. Essentially we only store a tiny subset of chunk -> block by leveraging spatial locality in the file. Trying to store every mapping of chunk -> block can get impractical very quickly.
what does the shards store? Is it "file_name, shard_id, chunk_hash, block_hash"
You can think of the shards as storing mappings between file (identified via file hash) to list of chunks that make up the file.
I hope this help explains our underlying tech better!