Comparing RAG Evaluation Models: Nuclia Llama-Remi-v1 Vs IBM Granite-Guardian 3.1

Developers

Even though REMi v0 was the first of its kind, since then other models of similar size have been released that have RAG evaluation capabilities. One of those models is IBM Granite-Guardian 3.1, which can evaluate multiple risks, three of these risks are the answer relevance, context relevance and groundedness in a RAG context.

In our internal testing, the RAG evaluation provided by the Granite-Guardian models do not perform as good as REMI, with scores that depart from human judgment more often than our latest REMI. Let’s exemplify the difference between REMi-v0, Llama-REMi-v1 and Granite-Guardian in its biggest size through two example RAG interactions.

The code used to run the evaluation for the Granite-Guardian 3.1 8B model is taken from the official IBM examples here. Since their scores measure close to one if the risk is present (i.e. if not relevant or not grounded) we have inverted them (close to zero if not relevant or not grounded) for the purpose of easier comparison.

These interactions have been created just for this article and are not contained on any of our training datasets. We present human evaluation scores as ranges to accommodate the inherent variability in human assessments and ensure flexibility in interpretation.

Example RAG Interaction 1

Question

“What equipment is essential for indoor bouldering?”

Answer

“The essential equipment for bouldering includes climbing shoes, a chalk bag, and, if bouldering outdoors, a bouldering pad. Additional useful items include a piece of carpet or matting to keep shoes clean, a tarp to protect the pad in wet conditions, and a guidebook for navigation.”

Contexts

“Bouldering doesn’t require much equipment, to climb indoors all you need are climbing shoes and maybe some chalk.”

“One of bouldering’s many attractions is that it doesn’t require a lot of equipment. All you need are climbing shoes, a chalk bag and, if you boulder outdoors regularly, a bouldering pad.”

“There are a few other less essential but useful items such as: a small patch of carpet or doormat to keep your shoes clean and dry before you step onto a problem, a tarp to protect your pad in wet, sandy or muddy conditions and a guidebook to help you find your way around the boulders.”

“If you are venturing off the beaten track, you should also take a map and compass, food, water, warm clothes and a head torch.”

In this second evaluation, Llama-REMi-v1 once again proves to be the model most closely aligned with human judgment, particularly in multilingual contexts. It demonstrates a clear advantage across the board, outperforming both REMi-v0 and Granite-Guardian 3.1 8B. The latter exhibits notable weaknesses, especially in evaluating multilingual contexts and in failing to assign a high relevance score to a perfect answer presented in list form.

Example RAG Interaction 2

Question

“What are the specs of the OP-1 display?”

Answer

“The specs of the OP-1 display are:

– AMOLED display running at 60 fps

– 320 x 160 pixel resolution

– Color Depth: 16.7 M

– Contrast: 10000:1 (good for outdoor use)

– Viewing Angle: 170°

– Life Time: 30,000 h”

Contexts

“OP-1 Display\n\n• AMOLED display running at 60 fps • 320 x 160 pixel resolution • Color Depth: 16.7 M • Contrast: 10000:1 (good for outdoor\n\nuse)\n\n• Viewing Angle: 170° • Life Time: 30,000 h • 1800 mAh li-ion Polymer Battery”

“To power on your OP-1, slide the white power switch located on the right side of the device towards you. The display will light up and the OP-1 loads necessary system data.\n\n1.4 OP-1 Side view\n\nFor strap\n\nTo power oﬀ, slide the power switch away from you.”

“La pantalla del OP-1 és super colorida, té una alta resolució de 320 x 160 píxels i un molt bon ratio de contrast”

“El nivel de contraste que ofrece la pantalla es increible, muy fluida con una velocidad de refresco de 60 fps, es muy sorprendente la calidad de imagen que se puede obtener con este dispositivo y además tiene una vida útil de 30,000 horas. Este OP-1 es una maravilla, sin duda lo recomiendo”

“2\n\nLayout\n\n2 Layout 2.1 Keys & knobs The layout of the OP-1 is divided into diﬀerent groups for easy reading and intuitive workﬂow.\n\nTurn the Volume knob to set the master volume.\n\nThe four color encoders are related to the graphical interface on the display.”

Comparison

When comparing the RAG evaluation of Llama-REMi-v1 with Granite-Guardian 3.1, key differences emerge in the data strategies and the interpretability of results, showcasing the unique strengths of Llama-REMi-v1:

Production-Aligned Data vs. Standard Data:

– Granite-Guardian’s RAG training dataset relies exclusively on synthetic data generated from curated datasets such as HotPotQA, SQuADv2, MNLI, and SNLI. While these datasets are ideal for controlled experimentation, they generally do not accurately reflect the messy, diverse and unstructured nature of real-world data.

– Llama-REMi-v1’s training dataset, although it includes synthetic data, contains context pieces that have been extracted and chunked directly from real documents, including markdown-structured text. This ensures the model has seen data that mirrors the complexity of production environments.

Multilingual Capability:

– With its English-only training data, IBM Granite-Guardian 3.1 is limited in its ability to serve multilingual or international use cases effectively.

– Llama-REMi-v1’s multilingual training data equips it for a wider range of applications, catering to global audiences and supporting diverse linguistic environments.

Interpretability of Metrics:

– IBM Granite-Guardian 3.1 consolidates its evaluation into a binary result with the option obtain a score for each metric, which is done through the token decoding probabilities of the risk or safe token, which may lack granularity.

– Llama-REMi-v1 takes a client-centric approach by not only giving discrete scores for each of the metrics, but also providing a detailed reasoning for the answer relevance score. This makes results more interpretable and actionable, empowering clients to better understand the areas for improvement in their RAG pipelines.

By embracing real-world-aligned data, multilingual support, and a more interpretable evaluation, Llama-REMi-v1 by Nuclia stands out as a practical and client-focused solution. These strengths make it more adaptable and robust compared to the controlled but limited approach of IBM’s Granite-Guardian 3.1.