I'm currently developing a B2B analytics tool aimed at tracking how brands are portrayed on various AI platforms, including ChatGPT and Gemini. It's crucial for my project to gather responses directly from the ChatGPT website (chat.openai.com) rather than through the OpenAI API. The reason is that the outputs from the API may differ from what users actually see on the website due to various factors such as system prompts, retrieval behavior, and formatting differences. Since my project requires accuracy aligned with user experiences on the website, I'm trying to establish a system that can handle around 30,000 queries daily.
I'm exploring options like headless browsers (Playwright, Puppeteer, Selenium), but the challenges of creating and managing a scraper solution at this scale seem daunting. Issues like Cloudflare protections, bot detection, and possible changes to the website interface complicate matters further. I am also considering managed third-party services that can simplify this process by handling aspects like browser automation, proxy rotation, and session management. I'd love to hear any practical advice or recommendations on how to approach this challenge, or if there are reputable third-party services that could assist.
2 Answers
It's important to remember that public APIs are designed to prevent the need for scrapers like this. If you're seeing different outputs on the API compared to the website, it might be linked to how responses are generated. The nature of AI means consistent results aren't guaranteed, so sticking with the API is usually the safer route.
Honestly, scraping is really your only solid option here. You might have to get creative to bypass the bot protections and adapt to changes in the UI. Just know that you’d be venturing into some tricky territory by doing this, especially with the risks involved.

Yeah, but the differences stem from various internal processes used by the website, not just the AI response variability. That's the real issue here.