Deploying SBERT In Production Using TorchServe

This is a guide on how to deploy pre-trained HuggingFace sentence-transformers model in production using TorchServe, Docker and Openshift.

Photo by Katarzyna Pe on Unsplash


In order to deploy any PyTorch model we need TorchServe. It caters to all the requirements of deploying a model in production while providing scalability and flexibility.
For deploying TensorFlow models use TensorFlow Serving. Check out my last article on it.

Here I am using bert-base-nli-mean-token model from sentence-transformers. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This is the general way of using it locally-:

The idea here is to generate sentence embeddings and then calculate cosine similarity for downstream NLP tasks. In production we will create a service that will accept sentences as input and send the corresponding vectors as output.

Docker and Openshift is used to containerize and deploy our model on cloud platform.

Let’s Begin…

The high level steps are to create a .mar file from the model, write python handler files on how to handle the model (preprocessing, post-processing, etc as needed) and then attach it with TorchServe docker image along with any dependencies and deploy. This would be cover in detail below.

  1. Save the model files in a directory as shown below.
directory structure

2. Zip the contents of this directory as pytorch_model.bin

zip -r model/pytorch_model.bin .

3. Install torch-model-archiver .

pip install torch-model-archiver

A key feature of TorchServe is the ability to package all model artifacts into a single model archive file. The CLI creates a .mar file that TorchServe's server CLI uses to serve the models.

4. Create handler files.

Handler can be TorchServe’s inbuilt handler name or path to a py file to handle custom TorchServe inference logic. TorchServe supports the following handlers: image_classifier, object_detector, text_classifier, image_segmenter.

Here we will be tweaking the inbuilt python handler file for handling model inference that accepts an array of sentences and returns array of vectors.

from sentence_transformers import SentenceTransformer
import json
import zipfile
from json import JSONEncoder
import numpy as np
import os
class NumpyArrayEncoder(JSONEncoder):
def default(self, obj):
if isinstance(obj, np.ndarray):
return obj.tolist()
return JSONEncoder.default(self, obj)
class SentenceTransformerHandler(object):
def __init__(self): super(SentenceTransformerHandler, self).__init__()
self.initialized = False
self.embedder = None
def initialize(self, context): properties = context.system_properties
model_dir = properties.get("model_dir")

with zipfile.ZipFile(model_dir + '/pytorch_model.bin', 'r') as zip_ref:
print('tried unzipping again')
self.embedder = SentenceTransformer(model_dir)
self.initialized = True
def preprocess(self, data):

inputs = data[0].get("data")

if inputs is None:
inputs = data[0].get("body")
inputs = inputs.decode('utf-8')
inputs = json.loads(inputs)
sentences= inputs['queries']
return sentences
def inference(self, data):
query_embeddings = self.embedder.encode(data)
return query_embeddings
def postprocess(self, data)
return [json.dumps(data,cls=NumpyArrayEncoder)]
"""from handler import SentenceTransformerHandler_service = SentenceTransformerHandler()def handle(data, context):
Entry point for SentenceTransformerHandler handler
if not _service.initialized:
if data is None:
return None
data = _service.preprocess(data)
data = _service.inference(data)
data = _service.postprocess(data)
return data except Exception as e:
raise Exception("Unable to process input data. " + str(e))

5. Create .mar file by running below command on terminal.

torch-model-archiver --model-name sentence_Transformer_BERT --version 1.0 --serialized-file pytorch_model.bin --handler --extra-files --export-path <output-path> --runtime python3 -f

On successful execution sentence_Transformer_BERT.mar file is created. To understand more about the usage of arguments check the official repo.

6. Create a dockerfile packaging all the contents require to serve the model and start the server.

#dockerfileFROM pytorch/torchserve            #pull the latest torchserve image
USER root
RUN chmod 777 -R .
COPY requirements.txt . #copy python dependencies
RUN python -m pip install --upgrade pip
RUN pip install -r requirements.txt
#copy the .mar file created in previous step
COPY sentence_Transformer_BERT.mar model-store/
#replace the existing with custom one
#start the server with model named SBERT
CMD ["torchserve", "--start" ,"--model-store", "model-store" ,"--models" ,"SBERT=sentence_Transformer_BERT.mar"]

For any python dependency add the library in this file:


For any configuration changes use this file and refer the official repo here for usage.


The above mentioned files are created basis on model requirements and varies accordingly.

7. Build the docker image using below command.

docker build -t ptserve-sbert:v1 dockerfile .

8. Run the container locally.

docker run --rm -it -p 3000:8080 ptserve-sbert:v1

Along with inference port can also expose and port forward management and metrics ports locally. Now this application is ready to handle requests on local port (3000) forwarded from docker inference port (8000).


Inference API is available at localhost:3000, to check the status hit the below shown endpoint.

To check the list of available endpoints, try this:
curl -X OPTIONS http://localhost:3000

For model inference, client side scripting would look like this:

API request

The similarity score obtained through this service running locally on docker is exactly same and now it can be deployed to any cloud platform like Openshift, Azure, AWS, GCP, etc.

9. Push the docker image to private/public docker registry for Openshift to access.

docker push ptserve-sbert:v1

10. Using Openshift CLI create a pod and expose a route to this service.

oc new-app ptserve-sbert:v1 --name ptserve-sbert
oc expose svc/ptserve-sbert

This would automatically create a url to access the model running in cloud and can test the endpoints in similar fashion as above.

And we are done! :)

Refer the official documentation of TorchServe for other functionalities like: batch inference, deploy multiple versions of same model, advanced features, etc, to deploy any PyTorch based model. Comment below for any doubt and give a clap if this article is useful.

Lastly, this article would not be possible without these references, check them out.

Data science professional