Fine Tuning an IBM Granite LLM with OpenShift AI & IBM watsonx
Using OpenShift AI to fine tune IBM's Granite 3.0 with watsonx.data and watsonx.governance
Overview
This post covers how IBM watsonx, Red Hat Openshift AI and IBM Granite can be used to create an LLM fine tuning pipeline with enterprise governance. The aim is to create a workflow which allows easy creation of fine tuned LLMs based on data from disparate sources, and a way to monitor their creation and lifecycle.
This is based on a recent project I worked on alongside James Hope and Joseph Vullo in IBM’s Client Engineering.
Architecture
Component | What do we use it for? | Deployment Method |
---|---|---|
OpenShift | Red Hat’s Kubernetes platform. This runs OpenShift AI, and can also run watsonx.data & watsonx.governance | IBM Cloud ROKS (SaaS) |
OpenShift AI | Red Hat’s platform for AI development on OpenShift. It contains features such as Jupyter notebooks, pipelines with Kubeflow and model serving | OpenShift on IBM Cloud |
watsonx.data | An AI data platform which allows you to access data from multiple locations and sources in a consistent manner | IBM Cloud SaaS |
Kubeflow | A tool for running machine learning pipelines | OpenShift AI |
watsonx.governance | A platform to allow managing, monitoring and governing of AI models | IBM Cloud SaaS |
vLLM | A service to host LLMs | OpenShift AI |
IBM Granite | An LLM from IBM suitable for fine tuning | vLLM |
Components
OpenShift & OpenShift AI
OpenShift AI was deployed on an OpenShift cluster on IBM Cloud, allowing us to make use of scalability and add GPU nodes when needed. OpenShift AI was installed through its Operator and deployed by following the instructions.
OpenShift AI is a data science platform offering from Red Hat which allows data scientists to create, train and deploy machine learning models including LLMs. It includes features such as Jupyter notebooks, pipelines with Kubeflow and model serving through a variety of services including vLLM and OpenVINO.
watsonx.data
watsonx.data allows access of data from multiple different source types such as CSV files in S3-compatible storage, relational databases and Apache Kafka. For our use case it was only used to access a CSV file from IBM Cloud Object Storage through an SQL interface using the included Presto engine, but the same techniques can be used to access relational data across a wide range of data sources.
watsonx.governance
watsonx.governance enables tracking & governance of all of the fine-tuned models created by the pipeline. This is important so that lineage of the data that was used to train the model can be traced, model versions can be tracked, as well as the use cases they’re deployed for.
The Pipeline
The orchestration tool used for the pipeline is Kubeflow. Included as part of OpenShift AI, it’s a pipeline tool with a focus on machine learning workflows. It allows us to break down the process into modular reusable steps and easily instantiate runs on the pipeline, either manually or using a schedule. A schedule based workflow could make sense for an LLM application to make sure your model is tuned on the latest data that you’d like for your application. A diagram of our pipeline is show below:
Above: The LLM fine-tuning pipeline as seen in OpenShift AI
The pipeline contains the following key stages:
- Download the training data from watsonx.data
- Split the data into training and test datasets
- Download the foundation model from HuggingFace
- Fine tune the foundation model using the datasets
- Deploy the model onto vLLM inside OpenShift AI
- Test that the model has correctly deployed on the cluster
- Report the latest version of the model to watsonx.governance
The icons between the named stages represent artifacts that have been created by the pipeline, which is stored in connected S3-compatible storage.
Loading the data from watsonx.data
watsonx.data allows access to structured data from different sources. In order to load our Q&A sample dataset in this pipeline, the PrestoDB python library was used to load the dataset and split it into a dataframe. The environment variables are loaded from an OpenShift secret when the container is started for the relevant step in the pipeline. While we do split the data in an 80/20 fashion, this isn’t necessarily best practice and was just for the purposes of demonstration.
Code to Load Q&A from watsonx.data
Fine Tuning the LLM
In order to tune the LLM, HuggingFace’s SFT (supervised fine-tuning) trainer is used with a generated dataset of typical Q&A questions for a fictional online store called JJP Clothing. This generates a new version of the Granite 3.0 model which has been trained on the specific dataset and should be better at answering questions in that domain. The resulting model can then be hosted in vLLM to answer queries
Code to fine tune the LLM
Deploying the Fine Tuned Model
With the fine tuned model built, the next step is to deploy it onto vLLM. Initially we set up the model serving manually, and get the pipeline to patch the InferenceService
custom resource definition in OpenShift. In the OpenShift AI interface, create a model deployment, make sure to select a data connection which points to the same S3-compatible storage that your pipeline server is using. The path can be incorrect for now as the pipeline will patch it at runtime.
The Model Serving view in OpenShift AI
Below is an example of the InferenceService
resource which is created when a model is deployed using the UI. This will create an instance of vLLM running on the cluster, loading a model from the selected S3-compatible storage
Example InferenceService YAML
With this InferenceService
created, the pipeline can patch this resource to point to the new version of the model in the S3-compatible storage and use that at runtime
Code to Update the Deployed Model
The above step will patch the InferenceService
resource and wait for the new deployment to become ready or time out after 15 minutes.
Reporting the model to watsonx.governance
In order to have visibility of your model versions, training data and metrics, IBM watsonx.governance can be used to keep track of your model versions. It has the ability to register external models, which are those which aren’t created as part of watsonx.ai or deployed in WML, create factsheets to hand to others in your business as a summary as well as evaluate the performance of the model at runtime.
The AI use case view allows you to keep track of all your models which are being used for one purpose, in this case the JJP Clothing Q&A.
The model factsheet inside watsonx.governance
The AI usecase view inside watsonx.governance
Every time the pipeline runs successfully, a new model is created inside watsonx.governance and attached to the Q&A use case. While the data for this model isn’t changing every time we run the pipeline, it’s conceivable that in a real deployment it would change and so keeping track of what the model was trained on is important, especially for audit purposes. The code to create a new model in watsonx.governance is included below and the watsonx.governance python SDK is here
Code to store model data in watsonx.governance
Outcome
We now have a pipeline which loads data from our cross-platform datastore in watsonx.governance, trains a model on an OpenShift cluster using HuggingFace’s libraries inside OpenShift AI, and keeps track of the models and their history in watsonx.governance. While in this case we trained the model on OpenShift AI, Kubeflow is very flexible, and as it runs Python, the models could be trained in other methods such as through cloud provider’s Python APIs to avoid the need for a GPU on the cluster.