microsoft/LLM2CLIP-EVA02-B-16 · model dimension mismatch 1024 v.s. 768

I encountered a compatibility issue when attempting to load the checkpoint LLM2CLIP-EVA02-B-16.pt with the model architecture specified as 'EVA02-CLIP-B-16' using create_model_and_transforms() with force_custom_clip=True. The error indicates a mismatch in the expected model state and the checkpoint provided.

Error Details:

Unexpected key(s) in state_dict: "visual.blocks.12.norm1.weight", "visual.blocks.12.norm1.bias", ..., "visual.blocks.23.mlp.w3.weight", "visual.blocks.23.mlp.w3.bias".
Size mismatch for visual.cls_token: copying a param with shape torch.Size([1, 1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 1, 768]).
It seems the model architecture being created has a mismatch with the provided checkpoint in terms of expected token sizes (1024 in the checkpoint vs. 768 in the model). Additionally, there are unexpected keys for visual.blocks layers in the checkpoint state_dict.

Steps to Reproduce:

Load the model and checkpoint as follows:

'''
model, _, preprocess_val = create_model_and_transforms('EVA02-CLIP-B-16', force_custom_clip=True)
ckpt = torch.load('LLM2CLIP-EVA02-B-16.pt')
'''