LLMOps and Model Tracking with MLFlow with Open AI

By:   |   Updated: 2023-11-16   |   Comments   |   Related: More > Artificial Intelligence


As the tech industry voraciously develops and fine-tunes Large Language Models (LLMs), there comes the need to operationalize the management, maintenance, monitoring, optimization, comparison, and deployment of these LLMs as they work their way into production environments. With LLMOps, Gen AI enthusiasts will have the necessary infrastructure and tools to build and deploy LLMs easily, and the risks, challenges, and time to market for Gen AI LLMs can be reduced.

MLFlow is an open-source platform for managing artifacts and workflows within the ML and AI lifecycle. It brings rich integrations with ML Libraries. MLFlow provides tools for tracking LLMOps experiments, packaging code, and deploying models to production. It brings a central model registry for simplifying the management and sharing of model versions. How can developers get started using LLMOps and Model Tracking with MLFlow?


Since MLFlow is an open-source platform, it provides the flexibility to be platform agnostic when selecting a technology to use for your LLMOps needs. As a mature Unified Analytics Platform, Databricks brings deep AI and ML capabilities within its platform; as such, we will use Databricks for this hands-on demonstration. This tip intends to demonstrate the following:

  • Create ML Runtime Compute.
  • Import MLFlow and Open AI LLMs with APIs.
  • Create a question-answering model using prompt engineering with OpenAI and log the model to MLFlow.
  • Define system prompts and sample questions.
  • Build and evaluate the model using multiple models (GPT-3.5 turbo and GPT-4) and multiple prompts.
  • Score the best model on a new question and display results.

Create ML Runtime Compute

As a prerequisite step, you must create your ML Compute to run your code. At the time of this article, 13.3 LTS ML is the latest and includes popular ML libraries like TensorFlow, PyTorch, and XGBoost. It also comes with AutoML, Feature Stores, GPU clusters, and more.

ComputeML Create ML Compute

Create and Log Model

After creating the ML Compute resources, navigate to a new Databricks notebook and use the following Python code to build and evaluate a question-answering model. It uses the Open AI API and MLFlow. It starts by importing the necessary libraries for interacting with the Open AI API, pandas for data manipulation, and MLFlow for experiment tracking and model management.

The code defines a function called build_and_evaluate_model that takes in a system prompt, a model name, a task type, and example questions as parameters. Within this function, it starts an MLFlow run and logs the system prompt and model name as parameters. It then creates a question-answering model using prompt engineering with OpenAI and logs it to MLFlow Tracking. The model is evaluated on some example questions, with the evaluation results stored in the generated variable. Finally, it ends the MLFlow run. This code is helpful for training and evaluating models for question-answering tasks using OpenAI's GPT-3 & 4 and tracking experiments with MLFlow.

import openai
import pandas as pd
import mlflow
openai.api_key = "API KEY"
def build_and_evaluate_model(
  system_prompt, model_name, task_type, example_questions
    with mlflow.start_run(run_name="model_" + model_name):
        mlflow.log_param("system_prompt", system_prompt)
        mlflow.log_param("model_name", model_name)
        # Create a question answering model using prompt engineering with OpenAI. Log the model
        # to MLflow Tracking
        logged_model = mlflow.openai.log_model(
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": "{question}"},
        # Evaluate the model on some example questions
        questions = pd.DataFrame({"question": example_questions})
        generated = mlflow.evaluate(

Define System Prompts and Sample Questions

This next block of Python code uses the build_and_evaluate_model function created in the previous step to build and evaluate the 'GPT-4' and 'GPT-3.5-turbo models for answering questions about Azure and Databricks with a focus on Data and AI topics. The function build_and_evaluate_model is called four times, combining variations of the two models with the two system prompts. Remember that a system prompt guides the model's behavior while the example questions are used to evaluate the model's performance.

# Define the System Prompts and Sample Questions
system_prompt_1 = "Your job is to answer questions about Azure & Databricks with a focus on Data and AI topics."
system_prompt_2 = "Your job is to answer questions about Azure & Databricks with a focus on Data and AI topics. When you are asked a question about Azure or Databricks, respond to it. Make sure to include code examples. If the question is not related to Azure or Databricks, refuse to answer and say that the question is unrelated."
example_questions = [
     "Does Azure OpenAI use my company data to train any of the models?",
     "Does Azure OpenAI support VNETs and Private Endpoints?",
     "How do Data Lakehouse systems compare in performance and cost to data warehouses?",
     "How can I customize my published web app in Azure OpenAI?",
#Build and Evaluate model using GPT-4 & System Prompt 1
#Build and Evaluate model using GPT-3.5 turbo & System Prompt 1
#Build and Evaluate model using GPT-4 & System Prompt 2
#Build and Evaluate model using GPT-3.5 turbo & System Prompt 2

Once the code finishes running on your ML Compute, notice that there will be four new runs logged to an experiment in MLFlow. Click '4 runs' to open the experiments and to view more details with MLFlow.

MLFlowRuns Click 4 runs to get into the runs

Notice the four experiments that were captured and logged.

MLFlow View of models logged in MLFlow

You can drill into each run to view additional details around parameters, metrics, artifacts, code snippets, and more. Furthermore, you can register the model to version control it and deploy it as a REST endpoint for real-time serving.

ModelArtifacts Generation of Model Artifacts

You have access to the Evaluation tab, which lets you compare the output for your questions across the various models. Since it's unclear which system prompt was used with the model, you could always refine the build_and_evaluate_model function, which we created earlier, to accommodate this if needed. You could also use Langchain prompt templates to customize the format of your prompts and sample questions further. This Evaluation view is useful for comparing the outputs to questions across several model and prompt combinations.

ModelComp Evaluate the models

You could also score the best model on a new question. The Python code provided uses the MLFlow library to load the previously trained model and make a prediction. It sets a question, loads the model from the last active run in MLFlow, and uses this model to predict an answer to the question.

# Score the best model on a new question
new_question = "What distinguishes Azure Databricks from Azure Synapse Analytics and Microsoft Fabric?"
best_model = mlflow.pyfunc.load_model(f"runs:/{mlflow.last_active_run().info.run_id}/model")
response = best_model.predict(new_question)
display(f"response: {response}")

Here's the resulting response:

ScoreModel Step to score the best model on a new question.


This article demonstrates how easy it is to get started with LLMOps with MLFlow for Model Tracking. It describes how to begin creating your custom models from base models available through the Open AI API. By logging, tracking, and evaluating your model in MLFlow, you can reap the benefits of Model Registries in Workspace or Unity Catalog modes, Feature Stores, and real-time model serving capabilities. While the new Evaluation (Preview) feature within MLFlow is useful, I'd be interested to see its model metrics become more detailed to compare the details in tabular format to determine the best model to move to production. Furthermore, I'd be interested in seeing better integrations in MLFlow for visually viewing LLM vulnerability detections. Overall, this is a great way to begin a journey with LLMOps using Databricks and MLFlow.

Next Steps

sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip

About the author
MSSQLTips author Ron L'Esteve Ron L'Esteve is a trusted information technology thought leader and professional Author residing in Illinois. He brings over 20 years of IT experience and is well-known for his impactful books and article publications on Data & AI Architecture, Engineering, and Cloud Leadership. Ron completed his Masterís in Business Administration and Finance from Loyola University in Chicago. Ron brings deep tec

View all my tips

Article Last Updated: 2023-11-16

Comments For This Article