I've been looking into how the OpenAI Operator, which connects with a web browser, can display a browser inside a web app. I'm curious about the technology behind this because it seems difficult to have a browser that interacts with any website due to Same Origin Policy (SOP) restrictions. My initial thought was to run a containerized browser on OpenAI's servers and stream it to the user's browser to get around SOP. Are there any other approaches? What's the best technology available for achieving this?
7 Answers
Yeah, you definitely need to stream the browser from a different server. That's pretty much the only sound way to do it.
Using a containerized headless browser combined with a WebRTC stream is an excellent approach! It lets the backend control the DOM while keeping the client focused on just watching the stream and sending inputs.
If you want to get a bit wild, you could potentially compile a browser to WebAssembly. But you'd still run into problems because of CORS, so it's mostly about streaming video from a server.
I suspect many agents use tools like Selenium or Puppeteer. They can translate the page's HTML into manageable text or even take screenshots, which could be useful to avoid some direct interactions.
A solid approach would be running your browser in a VM or container and streaming it using VNC. It's effective but can be resource-heavy.
Most likely, you’d be running the browser on a server, which allows it to bypass those same origin challenges.
I think BrowserStack (or similar services) offers a good API for these kinds of browser interactions. It could give you flexibility without running into SOP issues.
Related Questions
Cloudflare Origin SSL Certificate Setup Guide
How To Effectively Monetize A Site With Ads