Hugging Face Spaces with jupyter notebooks and GPU

Also known as the poor-man Google Colab alternative

This article could also be renamed to how to enjoy some good time at LLM exploring, without getting poor!

If you ever scratched the surface of the “modern” AI (mostly generative stuff), you noticed for sure that, sooner or later, either a gigantic CPU or a decent GPU is needed - even just to deal with models inference and not die of boredom while waiting for your results.

There are many (too many!!) vendors out there, they all want your money, badly! Among these bunch of thieves providers, there are some more affordable, for example it’s really difficult to find a solution better than Google Colab - which with the Pro (at the cost of 12EUR/month) gives you a nice set of GPUs and CPUs minutes - you choose how to use them!

Google Colab is great but even the pro version has some serious limitations, for example I find annoying that:

  • you cannot have background processes without the Web UI fully loaded in your browser and actively used
  • as the instance gets closed after a while of user inactivity
  • you cannot easily SSH into the system (damn, there are workarounds but…. seriously, are we here to experiment networking?)
  • there are high cost when you want to use GPU (from few cents to few euro/hour, which can be a lot for what you get!)
  • ephemeral storage (but you can save on GDrive the notebooks and some file)
  • cannot version your notebooks - unless you use additional workarounds, like mounting gdrive, use git, etc etc

Nevertheless Colab is extremely useful and powerful, I highly recommend it - should you need specific hardware and you have clear objectives!

Why Hugging Face instead of Google Colab?

https://huggingface.co/ is a grown startup who “in August 2023 announced that it raised $235 million in a Series D funding, at a $4.5 billion valuation” - and these guys are definitely investing back, giving their users good services at a reasonable price.
This article focuses on their Pro account, which is sold at 9USD (not EURO!!) per month:

If you look deeply at the picture, you will read a nice and mysterious sentence:

Dev mode for spaces

The Dev mode for spaces is a beta feature which allows users to SSH into a running container and troubleshoot the code deployed, in place.
In few words:

We can run a container with Python interpreter inside an HF server, SSH to it, connect a Visual Studio Code instance and start running custom code.

The ZeroGPU project

Recently HF provided the possibility to run Gradio application (we can actually say python code) on a shared GPU.
This is a very interesting and promising feature, as when experimenting with LLMs we do not need a GPU reserved for 100% of the time, only when we execute something on it!
This is actually one of Colab downsides, once you turn on an instance with, for example, an A100 GPU, you pay from the first second up to the moment you turn it off.

I can definitely try to anticipate the future and say that shared GPU environments will be the future.

Reserving a GPU can sometimes be a waste of money - especially on interactive environments, which push you to a “trial and error” approach.

Dev mode and ZeroGPU, an example step by step

We need a new space:

The configuration is trivial, after having assigned a new name:

  1. Gradio is the only type of space which allows GPUZero (for now!)
  2. The blank template is enough
  3. ZeroGPU must be selected
  4. The Dev mode can also be turned on later, although it makes no sense to not enable it now!
  5. This is not a public Gradio application, therefore the space should be private.

HF Pro account has a number of limitations, for example you can have only 10 spaces running on ZeroGPU - but we just need one for our Colab alternative!

When the space is created, the browser will show the default Gradio example application:

The Dev mode is the most important part, as well as the fact that HF is telling you that ZeroGPU hardware is enabled! You can find more information about the Dev mode here https://huggingface.co/spaces/dev-mode-explorers/README

Open the container in Visual Studio Code

The beauty of VSCode is that you can run it on your browser!
Actually the UI/UX experience which you can achieve is very similar to Google Colab (in my opinion is even much better!).

Clicking the run button will lead you to a new instance of VSCode (on an other Tab of your browser).
The space filesystem only contain the demo Gradio stuff, one app.py python file a readme.md and few other git files which you’d better not touch!

(my instance is not vanilla, that’s why you see also other files!)

Install the python and jupyter extension on VSCode

This articles assumes that you already know how to run notebooks on VSCode if not, search in google how to do it - it is pretty easy!

The container filesystem is ephemeral

The space container is ephemeral. When it is created HF just pull your repository HEAD - but any change you do from VSCode is not automatically persisted, unless you commit it back.

From VSCode you can use Git commands from the integrated console or you can push using the integrated Git features in the UI.

If you wish to persist changes made while Dev Mode is enabled, you need to use git from inside the Space container (using VS Code or SSH). For example:

# Add changes and commit them
git add .
git commit -m "Persist changes from Dev Mode"

# Push the commit to persist them in the repo
git push

We need a virtual-env

The user logged into the space is clearly not super-user, moreover messing with the container python distribution is not very convenient.
My suggestion is to start your journey creating a Python virtual environment.
This can be easily achieved opening the console:

And typing:

python -m venv .venv

Visual studio at this point will also identify that you are working on a virtual environment, so it will switch the python interpreter to it (confirming the operation on a popup).

The virtual env can be also enabled in the console:

source .venv/bin/activate

Once all the steps are done correctly, the visual feedback should be similar to the below:

From the virtual env we can pip install as many packages we want.
A good way to keep these operations simple and grouped, is creating a new init.sh file in your space root:

#!/bin/bash
python -m venv .venv
.venv/bin/pip install ipykernel ipywidgets

(rememebr to chmod +x init.sh)

Every time the container is restarted, this init file can be launched from the VSCode console.
You can add, for example, all the needed python packages on a requirements.txt file, and ask pip to install them - but I actually prefer to have !pip in my notebook!

Create a new notebook

Using VSCode command prompt (CTRL + Shift + p) we can create a new Jupyter Notebook:

Each notebook must have a kernel selected:

This is the final result:

VSCode will detect that your venv does not have any kernel installed, and it will deploy it!!

Test out the whole system:

Use ZeroGPU from within a notebook

This is a procedure which is not (yet) promoted by HF, although it works very well.
The long story short is that ZeroGPU seems to work also outside Gradio. When launched the system gives you a warning which you can just ignore.

To leverage the use of the GPU from within the container, create a function on a notebook cell and use the annotation @spaces.GPU . Then you can simply call this function from any cell!

For example:

@spaces.GPU  # Use the free GPU provided by Hugging Face Spaces
def predict(image, text):
    # Prepare the input messages
    messages = [
        {"role": "user", "content": [
            {"type": "image"},  # Specify that an image is provided
            {"type": "text", "text": text}  # Add the user-provided text input
        ]}
    ]
    
    # Create the input text using the processor's chat template
    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    
    # Process the inputs and move to the appropriate device
    inputs = processor(image, input_text, return_tensors="pt").to(device)
    
    # Generate a response from the model
    outputs = model.generate(**inputs, max_new_tokens=100)
    
    # Decode the output to return the final response
    response = processor.decode(outputs[0], skip_special_tokens=True)
    return response

Then call the function:

What’s happening under the hood

As you can see there is a wait time, this can be further exploited, for example specifying that your function will last only few seconds.
The assignation of the GPU happens on a priority queue, and the system is built to avoid the abuse of it.

You must read all the information in here https://huggingface.co/spaces/zero-gpu-explorers/README
In few words “If you expect your GPU function to take more than 60s then you need to specify a duration param in the decorator like:”

@spaces.GPU(duration=120)
def generate(prompt):
   return pipe(prompt).images

GPUZero is not suitable for any sort of task

Given the limitations imposed by HF, not all sort of activities can be carried out (in my opinion this is the biggest limitation). According to the documentation the Zero environment is supported only by:

  • Gradio: 4+
  • PyTorch: 2.0.1, 2.1.2, 2.2.2 and 2.4.0 (2.3.x is not supported due to a PyToch bug)
  • Python: 3.10.13

The Dev container specifications (and limitations)

According to my initial tests, the container runs on servers with below specs:

  • 96 core (presumably 2x Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz, which has 48 cores each)
  • 128GB of RAM
  • 10TB SSD nvme

Being a shared environment, HF pushes some quotas (resources limits) to the container.
Some of them can be queried from the /sys/fs/cgroup filesystem, for example:

  • cat /sys/fs/cgroup/memory/memory.limit_in_bytes -> 68GB (it seems we can consume up to this amount of memory before getting killed by the OOM)

What happens when the memory limit is hit? The container gets killed and is restarted.
This implies that one must be very cautious when dealing with large models loaded in memory!

Are there uptime limitations?

The HF Space can in theory be put on idle automatically, if not used after a while.
According to my tests, so far it did not happened to me yet!
My space for example is running since hundreds of hours - all without a single problem!

Of course the use of Jupyter Notebooks inside a Space Dev Container is not officially supported, therefore you might expect deviations from my experience!

Some people complained about being dropped into a server with no space left on the device, this might happen - the system is shared and if someone abuses it, the imposed quotas might not be enough!

How to further persist your data?

So far my needs have been quite simple, but a more efficient way to persist you data (e.g. big files), might be found.
HF allows you to store into the Git repository also big files, but I’m not sure this would be ethical!
Worst case scenario, packages can be reinstalled at every restart - being the .venv ephemeral and not persisted into the repo.

Enjoy your new HF experience! The Pro account comes also with other important features, like the possibility to invoke up to 20K API calls to many pre-trained models.
The above plus all the other provided features, can give you hours of fun and a beautiful learning experience!