Seeing Grok-3 hit 94% on CJR is definitely impressive, but I find these benchmark numbers rarely map to real-world edge cases in production. We are still seeing models hallucinate on niche domain queries despite high scores CJR citation study findings