RTX 3090 24GB working with extra env var

#4
by cktlco - opened

Thanks for the great work!

FYI that I was able to get the README demo script to run on Linux with a RTX 3090 24GB only after using the following env var to avoid a CUDA OOM:

 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python readme_demo.py

Its running with this change, but its extremely slow for me on an L4

Agreed, it's too slow to be usable for anything other than adhoc tests.

As an alternative for enthusiasts, here is a 4-bit quantized version of molmo-7B which fits in ~12GB VRAM and is much more responsive:
https://huggingface.co/cyan2k/molmo-7B-O-bnb-4bit

Nice! Yeah the transformers integration is extremely slow unfortunately as it for loops through the experts; We need to integrate it into vLLM/SGLang/llama.cpp like the other OLMoE models.

What hardware / variable settings is needed to run this model properly? I was able to run it with only CPU 64 GRAM i9 processor but inference is 165 sec per image. And I am getting all kind of errors if use CUDA GPU. This is my settings:
CUDA version: 12.4
CUDA device count: 1
Current CUDA device: 0
Current CUDA device name: NVIDIA GeForce RTX 4090 Laptop GPU

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'GPTNeoXTokenizer'.
The class this function is called from is 'Qwen2TokenizerFast'.
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Some parameters are on the meta device device because they were offloaded to the cpu.
Running on local URL: http://127.

To create a public link, set share=True in launch().
C:\Users\15023.venv\Lib\site-packages\transformers\generation\utils.py:1885: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cuda') before running .generate().
warnings.warn(
C:\Users\15023.cache\huggingface\modules\transformers_modules\allenai\MolmoE-1B-0924\d33e4c2b8f093f5262875cad2c77fbf52e0c86ed\modeling_molmoe.py:1052: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:566.)
attn_output = F.scaled_dot_product_attention(

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

File "c:\Users\15023\Documents\Models\molmo_test.py", line 71, in describe_image_async
print(f"Input device: {inputs['pixel_values'].device}")
~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'pixel_values'

Used torch.float16 (half-precision) when loading the model to reduce memory usage.

Sign up or log in to comment