sentence-transformers
`[refactor]` model loading - no more unnecessary file downloads
#2345
Merged

`[refactor]` model loading - no more unnecessary file downloads #2345

tomaarsen
tomaarsen1 year agoπŸ‘ 1

Hello!

Pull Request overview

  • Refactor the model loading;
    • No longer download the full model repository.
    • Update cache format to git style via hf_hub_download.
    • No longer use deprecated cached_download.
    • Soft deprecation of use_auth_token in favor of token as required by recent transformers/huggingface_hub versions.
  • Add test to ensure that correct/appropriate files are downloaded.

Details

In short, model downloading has moved from greedy full repository downloading to lazy per-module downloading, where no files are downloaded for Transformers modules.

Original model loading steps

  1. Greedily load the full model repository to the cache folder.
  2. Check if modules.json exists.
  3. If so, load all modules individually using the local files downloaded in the last step.
  4. If not, load Transformer using the local files downloaded in the last step + Pooling.
  5. Done

New model loading steps

  1. Check if modules.json exists locally or on the Hub.
  2. If so,
    a. Download the ST configuration files ('config_sentence_transformers.json', 'README.md', 'modules.json') if they're remote.
    b. For each module, if it is not transformers, then download (if necessary) the directory with configuration/weights for that module. If it is transformers, then do not download & load the model using the model_name_or_path.
  3. If not, load Transformer using the model_name_or_path + Pooling.
  4. Done

With this changed setup, we defer downloading any transformers data to transformers itself. In a test model that I uploaded with both pytorch_model.bin and model.safetensors, only the safetensors file is loaded. This is verified in the attached test case.

Additional changes

As required by huggingface_hub, we now use token instead of use_auth_token. If use_auth_token is still provided, then token = use_auth_token is set, and a warning is given. I.e. a soft deprecation.

  • Tom Aarsen
tomaarsen Refactor model loading: no full repo download
4bf9e994
tomaarsen Add simple test regarding efficient loading
31646a9e
tomaarsen Replace use_auth_token with token in docstring
a1a1cd75
tomaarsen tomaarsen changed the title Refactor model loading - no more unnecessary file downloads `[refactor]` model loading - no more unnecessary file downloads 1 year ago
bwanglzu
bwanglzu commented on 2023-11-14
Conversation is marked as resolved
Show resolved
sentence_transformers/SentenceTransformer.py
3839 :param modules: This parameter can be used to create custom SentenceTransformer models from scratch.
3940 :param device: Device (like 'cuda' / 'cpu') that should be used for computation. If None, checks if a GPU can be used.
4041 :param cache_folder: Path to store models. Can be also set by SENTENCE_TRANSFORMERS_HOME enviroment variable.
41
:param use_auth_token: HuggingFace authentication token to download private models.
42
:param token: HuggingFace authentication token to download private models.
bwanglzu1 year ago

seems use_auth_token is still in the constructor, no need to delete the docstring.

tomaarsen1 year agoπŸ‘ 1

I've deleted the docstring because use_auth_token will be softly deprecated. I'd rather not have deprecated arguments in the docstring.

bwanglzu
bwanglzu commented on 2023-11-14
Conversation is marked as resolved
Show resolved
sentence_transformers/SentenceTransformer.py
4546 device: Optional[str] = None,
4647 cache_folder: Optional[str] = None,
47 use_auth_token: Union[bool, str, None] = None
48 token: Optional[Union[bool, str]] = None,
49
use_auth_token: Optional[Union[bool, str]] = None,
bwanglzu1 year ago (edited 1 year ago)

maybe Optional is not needed? a default False would be better, also not quite sure why str is needed.

tomaarsen1 year ago

Optional is needed whenever None is a valid value to pass, which it is. I've kept str because the token/use_auth_token is passed directly to transformers and huggingface_hub, who use Optional[Union[str, bool]] as the typing:

https://github.com/huggingface/transformers/blob/f1185a4a73a03d238afce1b40456588d22520dd2/src/transformers/modeling_utils.py#L2303

Note also that you can either pass:

  • str: The token to use as HTTP bearer authorization for remote files.
  • True: will use the token generated when running huggingface-cli login (stored in ~/.huggingface).
  • False: will not use any token.
  • None: Same as True (so: not same as False)
bwanglzu1 year ago❀ 1

i see, thanks for the clarification!

bwanglzu
bwanglzu commented on 2023-11-14
Conversation is marked as resolved
Show resolved
sentence_transformers/util.py
421421
422def is_sentence_transformer_model(model_name_or_path: str, token: Optional[Union[bool, str]] = None) -> bool:
423 if os.path.exists(model_name_or_path):
424
return os.path.exists(os.path.join(model_name_or_path, "modules.json"))
bwanglzu1 year ago

not sure if modules.json is deterministic, maybe sentence_bert_config is better? https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/sentence_bert_config.json

tomaarsen1 year ago

I think modules.json must always be there, while sentence_bert_config.json is only usually there. For example, this code makes me believe that there may be models that don't use sentence_bert_config.json:

#Old classes used other config names than 'sentence_bert_config.json'
for config_name in ['sentence_bert_config.json', 'sentence_roberta_config.json', 'sentence_distilbert_config.json', 'sentence_camembert_config.json', 'sentence_albert_config.json', 'sentence_xlm-roberta_config.json', 'sentence_xlnet_config.json']:
sbert_config_path = os.path.join(input_path, config_name)
if os.path.exists(sbert_config_path):
break

And I can find some older models that don't even use Transformers that do still have modules.json: https://huggingface.co/sentence-transformers/average_word_embeddings_levy_dependency/tree/main

I appreciate you looking into this though!

bwanglzu1 year ago

i see it make sense. To be frankly the way it current handle config is not so beautiful lol, maybe later a better design is needed :)

tomaarsen1 year ago (edited 1 year ago)πŸ‘ 1

I agree completely! I've already started brainstorming some options, though I think my main idea would be infeasible due to some breaking changes here and there. It involves subclassing SentenceTransformer as PreTrainedModel rather than nn.Sequential. Then it could use more of the functionality from transformers, e.g. load_in_8bit, PEFT, etc.

The modules.json and sentence_bert_config.json would be removed in favor of placing that information inside of config_sentence_transformers.json, and the other folders with e.g. ..._Pooling, ..._Dense or ..._Normalize could be removed as well. The configuration for that would live inside of the 1 config file: config_sentence_transformers.json, and the weights (e.g. for Dense) would be stored natively by transformers via save_pretrained because the SentenceTransformer is now a special subclass of PreTrainedModel.

My primary concerns are models that don't use transformers, but those are few and far between.

I'd love your thoughts on this!

bwanglzu1 year ago

i'm not familiar as you on the transofmrers source code, let me read a bit the class:PretrainedModel and get back to you with some proper thoughts

bwanglzu
bwanglzu commented on 2023-11-14
bwanglzu1 year ago

left some very minor comments, do you think it make sense, at some point to refactor tests in pytests? i personally find it much more effective than unittest

tomaarsen
tomaarsen1 year agoπŸš€ 1

I also prefer pytest. I would indeed like to fully refactor the tests and heavily improve them. The current coverage is quite low for my tastes! Thanks for the review by the way!

  • Tom Aarsen
Sirri69
Sirri691 year ago

Somebody for the love of god, please merge this and update pypi

tomaarsen Prevent crash if internet is down
e1ca4083
tomaarsen Merge branch 'master' into feat/efficient_loading
7618a4f2
Sirri69
Sirri691 year ago

THANK YOU

tomaarsen
tomaarsen1 year agoπŸ˜„ 1

@Sirri69 I'm on it πŸ˜‰ Give it a few days.

I made updates to introduce better support if Internet is unavailable. Now, we can run the following script under various settings:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode("This is a test sentence", normalize_embeddings=True)
print(embeddings.shape)

These are now the outputs under the various settings:

Internet No Internet
Cache (384,) (384,)
No Cache modules.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 349/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 116/116 [00:00<?, ?B/s]
README.md: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10.6k/10.6k [00:00<?, ?B/s]
sentence_bert_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 53.0/53.0 [00:00<?, ?B/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 612/612 [00:00<?, ?B/s]
pytorch_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 90.9M/90.9M [00:06<00:00, 14.9MB/s]
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 350/350 [00:00<?, ?B/s]
vocab.txt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 232k/232k [00:00<00:00, 1.36MB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 466k/466k [00:00<00:00, 4.97MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 112/112 [00:00<00:00, 90.1kB/s]
1_Pooling/config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 190/190 [00:00<?, ?B/s]
(384,)
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like sentence-transformers/all-MiniLM-L6-v2 is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

This is exactly what I would hope to get.

cc: @nreimers as we discussed this.

  • Tom Aarsen
tomaarsen Use load_file_path in "is_sbert_model"
f26ba94d
tomaarsen Merge branch 'master' of https://github.com/UKPLab/sentence-transform…
a00482f2
tomaarsen Merge branch 'master' into feat/efficient_loading
033bf6d5
tomaarsen Merge branch 'master' into feat/efficient_loading
255e828d
tomaarsen tomaarsen merged 331549c0 into master 1 year ago
tomaarsen tomaarsen deleted the feat/efficient_loading branch 1 year ago
peiyangL
peiyangL289 days agoπŸ‘ 1

@tomaarsen

Hi, I appreciate this update to support model loading without an internet connection.

However, I find that loading the model is very slow without an internet connection. My testing code is as follows:

import time
start = time.time()
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, device='cpu')
emb = model.encode(["hello world"])
print(emb.shape)
print('time:', time.time()-start)

The output is as follows:

# without internet
<All keys matched successfully>
(1, 768)
time: 376.90756702423096

# with internet
<All keys matched successfully>
(1, 768)
time: 15.75501823425293

Additionally, I found that adding the local_files_only=True parameter speeds up model loading without an internet connection, but it is still quite slow.

import time
start = time.time()
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, device='cpu', local_files_only=True)
emb = model.encode(["hello world"])
print(emb.shape)
print('time:', time.time()-start)

# output:
# <All keys matched successfully>
# (1, 768)
# time: 145.69492316246033

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone