Deploy a local LLM
Run models locally using Ollama, Xinference, or other frameworks.
RAGFlow supports deploying models locally using Ollama, Xinference, IPEX-LLM, or jina. If you have locally deployed models to leverage or wish to enable GPU or CUDA for inference acceleration, you can bind Ollama or Xinference into RAGFlow and use either of them as a local "server" for interacting with your local models.
RAGFlow seamlessly integrates with Ollama and Xinference, without the need for further environment configurations. You can use them to deploy two types of local models in RAGFlow: chat models and embedding models.
This user guide does not intend to cover much of the installation or configuration details of Ollama or Xinference; its focus is on configurations inside RAGFlow. For the most current information, you may need to check out the official site of Ollama or Xinference.
Deploy a local model using Ollama
Ollama enables you to run open-source large language models that you deployed locally. It bundles model weights, configurations, and data into a single package, defined by a Modelfile, and optimizes setup and configurations, including GPU usage.
- For information about downloading Ollama, see here.
- For information about configuring Ollama server, see here.
- For a complete list of supported models and variants, see the Ollama model library.
To deploy a local model, e.g., Llama3, using Ollama:
1. Check firewall settings
Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
sudo ufw allow 11434/tcp
2. Ensure Ollama is accessible
Restart system and use curl or your web browser to check if the service URL of your Ollama service at http://localhost:11434
is accessible.
Ollama is running
3. Run your local model
ollama run llama3
If your Ollama is installed through Docker, run the following instead:
docker exec -it ollama ollama run llama3
4. Add Ollama
In RAGFlow, click on your logo on the top right of the page > Model Providers and add Ollama to RAGFlow:
5. Complete basic Ollama settings
In the popup window, complete basic settings for Ollama:
- Because llama3 is a chat model, choose chat as the model type.
- Ensure that the model name you enter here precisely matches the name of the local model you are running with Ollama.
- Ensure that the base URL you enter is accessible to RAGFlow.
- OPTIONAL: Switch on the toggle under Does it support Vision? if your model includes an image-to-text model.
- If your Ollama and RAGFlow run on the same machine, use
http://localhost:11434
as base URL. - If your Ollama and RAGFlow run on the same machine and Ollama is in Docker, use
http://host.docker.internal:11434
as base URL. - If your Ollama runs on a different machine from RAGFlow, use
http://<IP_OF_OLLAMA_MACHINE>:11434
as base URL.
If your Ollama runs on a different machine, you may also need to set the OLLAMA_HOST
environment variable to 0.0.0.0
in ollama.service (Note that this is NOT the base URL):
Environment="OLLAMA_HOST=0.0.0.0"
See this guide for more information.
Improper base URL settings will trigger the following error:
Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff98b81ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))
6. Update System Model Settings
Click on your logo > Model Providers > System Model Settings to update your model:
You should now be able to find llama3 from the dropdown list under Chat model.
If your local model is an embedding model, you should find your local model under Embedding model.
7. Update Chat Configuration
Update your chat model accordingly in Chat Configuration:
If your local model is an embedding model, update it on the configuration page of your knowledge base.
Deploy a local model using Xinference
Xorbits Inference (Xinference) enables you to unleash the full potential of cutting-edge AI models.
- For information about installing Xinference Ollama, see here.
- For a complete list of supported models, see the Builtin Models.
To deploy a local model, e.g., Mistral, using Xinference:
1. Check firewall settings
Ensure that your host machine's firewall allows inbound connections on port 9997.
2. Start an Xinference instance
$ xinference-local --host 0.0.0.0 --port 9997
3. Launch your local model
Launch your local model (Mistral), ensuring that you replace ${quantization}
with your chosen quantization method:
$ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
4. Add Xinference
In RAGFlow, click on your logo on the top right of the page > Model Providers and add Xinference to RAGFlow:
5. Complete basic Xinference settings
Enter an accessible base URL, such as http://<your-xinference-endpoint-domain>:9997/v1
.
For rerank model, please use the
http://<your-xinference-endpoint-domain>:9997/v1/rerank
as the base URL.
6. Update System Model Settings
Click on your logo > Model Providers > System Model Settings to update your model.
You should now be able to find mistral from the dropdown list under Chat model.
If your local model is an embedding model, you should find your local model under Embedding model.
7. Update Chat Configuration
Update your chat model accordingly in Chat Configuration:
If your local model is an embedding model, update it on the configuration page of your knowledge base.
Deploy a local model using IPEX-LLM
IPEX-LLM is a PyTorch library for running LLMs on local Intel CPUs or GPUs (including iGPU or discrete GPUs like Arc, Flex, and Max) with low latency. It supports Ollama on Linux and Windows systems.
To deploy a local model, e.g., Qwen2, using IPEX-LLM-accelerated Ollama:
1. Check firewall settings
Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
sudo ufw allow 11434/tcp
2. Launch Ollama service using IPEX-LLM
2.1 Install IPEX-LLM for Ollama
IPEX-LLM's supports Ollama on Linux and Windows systems.
For detailed information about installing IPEX-LLM for Ollama, see Run llama.cpp with IPEX-LLM on Intel GPU Guide:
After the installation, you should have created a Conda environment, e.g., llm-cpp
, for running Ollama commands with IPEX-LLM.
2.2 Initialize Ollama
- Activate the
llm-cpp
Conda environment and initialize Ollama:
- Linux
- Windows
conda activate llm-cpp
init-ollama
Run these commands with administrator privileges in Miniforge Prompt:
conda activate llm-cpp
init-ollama.bat
-
If the installed
ipex-llm[cpp]
requires an upgrade to the Ollama binary files, remove the old binary files and reinitialize Ollama usinginit-ollama
(Linux) orinit-ollama.bat
(Windows).A symbolic link to Ollama appears in your current directory, and you can use this executable file following standard Ollama commands.
2.3 Launch Ollama service
-
Set the environment variable
OLLAMA_NUM_GPU
to999
to ensure that all layers of your model run on the Intel GPU; otherwise, some layers may default to CPU. -
For optimal performance on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), set the following environment variable before launching the Ollama service:
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-
Launch the Ollama service:
- Linux
- Windows
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve
Run the following command in Miniforge Prompt:
set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve
To enable the Ollama service to accept connections from all IP addresses, use OLLAMA_HOST=0.0.0.0 ./ollama serve
rather than simply ./ollama serve
.
The console displays messages similar to the following:
3. Pull and Run Ollama model
3.1 Pull Ollama model
With the Ollama service running, open a new terminal and run ./ollama pull <model_name>
(Linux) or ollama.exe pull <model_name>
(Windows) to pull the desired model. e.g., qwen2:latest
:
3.2 Run Ollama model
- Linux
- Windows
./ollama run qwen2:latest
ollama run qwen2:latest
4. Configure RAGflow
To enable IPEX-LLM accelerated Ollama in RAGFlow, you must also complete the configurations in RAGFlow. The steps are identical to those outlined in the Deploy a local model using Ollama section:
Deploy a local model using jina
To deploy a local model, e.g., gpt2, using jina:
1. Check firewall settings
Ensure that your host machine's firewall allows inbound connections on port 12345.
sudo ufw allow 12345/tcp
2. Install jina package
pip install jina
3. Deploy a local model
Step 1: Navigate to the rag/svr directory.
cd rag/svr
Step 2: Run jina_server.py, specifying either the model's name or its local directory:
python jina_server.py --model_name gpt2
The script only supports models downloaded from Hugging Face.