Skip to main content


vLLM is a fast and easy-to-use library for LLM inference and serving, offering:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

This notebooks goes over how to use a LLM with langchain and vLLM.

To use, you should have the vllm python package installed.

%pip install --upgrade --quiet  vllm -q
from langchain_community.llms import VLLM

llm = VLLM(
trust_remote_code=True, # mandatory for hf models

print(llm.invoke("What is the capital of France ?"))
API Reference:VLLM
INFO 08-06 11:37:33] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41] # GPU blocks: 861, # CPU blocks: 512
Processed prompts: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 2.00it/s]

What is the capital of France ? The capital of France is Paris.

Integrate the model in an LLMChainโ€‹

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

API Reference:LLMChain | PromptTemplate
Processed prompts: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:01<00:00,  1.34s/it]

1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.

Distributed Inferenceโ€‹

vLLM supports distributed tensor-parallel inference and serving.

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs

from langchain_community.llms import VLLM

llm = VLLM(
trust_remote_code=True, # mandatory for hf models

llm.invoke("What is the future of AI?")
API Reference:VLLM


vLLM supports awq quantization. To enable it, pass quantization to vllm_kwargs.

llm_q = VLLM(
vllm_kwargs={"quantization": "awq"},

OpenAI-Compatible Serverโ€‹

vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

This server can be queried in the same format as OpenAI API.

OpenAI-Compatible Completionโ€‹

from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
model_kwargs={"stop": ["."]},
print(llm.invoke("Rome is"))
API Reference:VLLMOpenAI
 a city that is filled with history, ancient buildings, and art around every corner

Was this page helpful?

You can also leave detailed feedback on GitHub.