We use AI to make AI better

Our tech story is all about generative-AI. We used AI to make AI generation better for humans, and then used AI again to judge its own responses.

We used generative AI to produce our results and then judge them. How did we know if our AI judge was accurate? Well, we compared the judge’s results with human scores and only used the judge if the match showed strong correlation.

The Magic of Retrieval Augmented Generation (RAG)

Even the best and most expensive large language model (LLMs) hallucinate. This is a big problem for a field such as medicine where even a small inaccuracy in response can lead to big problems.

Enter RAGs. The systems that make response generation accurate by gathering facts and feeding them to the LLM for generation.

Sounds good, doesn’t it?

But of course these RAGs need to be tuned well. If the retriever of a RAG fetches too many facts, they can confuse the LLM and cause it to hallucinate again.

In our work, we built an Advanced RAG system that rewrote the patient query, did multi-step retrieval, and used this retrieval to assemble context for the LLM.

“Give me a response only if it is empathetic”

We optimized the generated response from our LLM. We received a first response from the LLM, evaluated its empathy score using LLM as a judge, and if the response was found to be lacking empathy, we asked the LLM to re-generate the response.

In our work, responses optimized for empathy in such a manner consistently outranked human usability in comparison to the non-optimized responses.

Our Architecture

As shown in the picture above, here is a summary of our Core Engine (Baseline Model) architecture:

1. Data Foundation (Offline)

Vector Store: A FAISS Vector store is seeded with 80 pairs of questions and canonical SQL queries covering 8 specific intents.
Embeddings: The system uses a large 3072-dimension model for embeddings.
Database: Patient records are migrated and stored in a Postgres SQL database.

2. Core Processing Pipeline

Context & Prompt: The system uses few-shot engineering with curated prompts retrieved via FAISS to guide the model.
Text-to-SQL LLM: The prompt is sent to gpt-4o-mini, which functions as the Text-to-SQL engine.

3. Post-LLM Pipeline (LangChain)

Results Processing: The SQL results are converted into an English summary using gpt-4o-mini.
Innovation (Agentic Optimization): The final step involves an "Agentic Optimization of response" using LangGraph and an Evaluation-Optimization Engine.