Alert: We are aware of a fraudulent email approach to some clients to change our bank details. Please note our bank details remain unchanged. If you’ve received any suspicious communication, please contact us directly.

SubBanner banner image

Databricks in the Netherlands: GenAI Quality and Observability

Jordan Wright
Jordan Wright
Published: 14th November 2025
Last updated: 14th November 2025
Databricks in the Netherlands: GenAI Quality and Observability

As the Databricks AI World Tour arrived in Amsterdam, SGI hosted an exclusive roundtable to help Databricks users and customers tackle one of the biggest challenges in AI today: GenAI Quality and Observability.

We were joined by Angelo SgambatiDatabricks Champion, Certified Instructor and Resident Solution Architect with Databricks, as well as a Principal Consultant at RevoData—who shared practical strategies for building reliable, high-performing GenAI systems.

Why GenAI Quality Matters

GenAI models can deliver incredible results, but they can also return incorrect information, leading to unhappy or misinformed users. Why does this happen?

  • Unpredictable inputs and non-deterministic outputs make quality assessment complex.
  • Model selection, tooling, and system architecture directly impact cost, latency, and reliability.
  • Agentic systems often perform better but introduce complexity.

 

Questions Every Brickster Should Ask

When developing LLM-based solutions, it’s important to ask yourself:

  • Is the system reliable and behaving as expected?
  • Are users satisfied with the outcomes?
  • Is the solution cost-effective?
  • Are there biases or ethical concerns?

 

Testing GenAI-like Software

Prompt engineering and vibe checks are not enough. Unlike traditional software, GenAI requires continuous evaluation:

  • Run structured evaluations (unit tests, QA, E2E testing) 
  • Implement production telemetry
  • Monitor and refine metrics to align with human judgment

 

LLM-as-a-Judge & Human-in-the-Loop

Can LLMs evaluate themselves? Yes, using LLM-as-a-Judge, where rules are applied automatically to assess responses.

But metrics alone aren’t perfect. Bias and lack of context mean human oversight is essential to ensure accuracy and trust.

 

Driving Development Through Evaluation

Start small:

  • Use 100-example evaluations to track progress
  • Monitor workflow changes and refine metrics that fail to reflect real-world quality
  • Apply error analysis to identify and prioritise fixes

 

Your Next Step

Building GenAI solutions that users trust requires quality, observability, and the right talent.

If you’re looking to hire Databricks specialists in the Netherlands or want to join our next event, connect with Jordan Wright or Ross Paterson.