The benchmark results are interesting, but seeing HalluHard score 30.2% even with web search enabled really highlights the limits of current RAG implementations. It proves that simply retrieving more data is not a silver bullet for grounding reducing AI abstention failure