This scraper will allow you to scrape game and trophy data from the official PlayStation website. The scraper uses PhantomJS to scrape the data based on a provided user profile data. The data can then be parsed by a programming language and inserted into a database.
This is still an early version of the code. I have it finally producing an output so Im going to post this up now and update it with more code next time i am able to work on it.
There is no official PSN API so we will have to take advantage of the data avaiable on the public trophies DB on the playstation website. There are websites that have trophy list information for Playstation trophies, so I wanted to find a way to do the same. This website uses a ton of Javascript and AJAX meaning we cant easly scrape the data using PHP or any other language. To make life even harder, they seem to have a lot of safeguards in place to prevent people from getting access to this trophy data easily. I’m not sure why they have to be so secretive with this stuff!
Anyway, I was able to use PhantomJS to bypass any of the difficult stuff in place to block users from doing this and I was able to successfully dump data from a users trophy list into a text file. The hardest part is done now. Once we have access to the list its just a matter of gathering the data you want. It gets returned in HTML format which makes it nice and easy to parse.
If anyone has any suggestions or improvements to this please post them. As a group we may get a fully functioning scraper working!
This is my first time using PhantomJS so I’m still getting the hang of it. For anyone who doesnt know what it is, PhantomJS is a way to interact with a webpage in the way that a user would. You provide a URL and give commands. PhantomJS will then be able to click buttons and interact with the web page just like a user would. We can return the content of the current page at any time which allows us to pull the trophy data, or anything else for that matter.
For this i will use Hakoom as the username since he has the most trophies of any user on PSN. If you visit the website in person you will see that the page only loads a certain number of trophies and then there is an AJAX button at the bottom of the page to load more content. This is the next thing that I will add to this document in order for us to be able to scrape a huge amount of trophies at once. For now the code is able to get the first page of trophies, which I think is a good start (considering it took me ages to get working 😛 ).
In order to run this script you will need to have PhantomJS installed. This is a command line tool, but it’s not as difficult to use as it might seem. If you save the code below to a file you can run it using the following command.
phantomjs psn.js
Waiting for trophy list to load... 'waitFor()' finished in 1270ms.
The window will then dump a huge load of HTML that contains the trophy data. The important part looks like this.
<h2 class="clearfix title">The Swapper</h2></div><ul class="trophies clearfix"><li class="bronze">0</li><li class="silver">0</li><li class="gold">0</li><li class="platinum">0</li></ul>
The best way to handle this data is to use a programming language to parse the HTML and pull the data we need from the code. There are many languages you can use to do this. Once of the most simple ways to do this in my opinion is to use PHP. The exec() function will allow you to run the above command and all of the output will be dumped into a variable which you can then parse. You will need to update the path for psn.js if you do not have the php file in the same folder as the psn.js file. So the function might look like.
$trophyOutput = exec("phantomjs /var/www/psnscrapper/psn.js");
File Contents : psn.js
var page = require('webpage').create(); //open the url of the playstation trophy site. page.open('http://my.playstation.com/logged-in/trophies/public-trophies/', function(status) { page.evaluate(function() { document.getElementById("trophiesId").value = "hakoom"; //checkPTrophies(); btn click calls this function $('#btn_publictrophy').click().delay( 6000 ); }); //generally this completes in about 300-500ms. console.log("\nWaiting for trophy list to load..."); waitFor(function(){ return page.evaluate(function(){ //this div contains all of the trophy content. Once this is present then we know that the page has successfully loaded and we are now able to pull the trophy data. //This is the most difficult part of using this tool. If you try calling values that arent loaded yet it can mess things up. var e = document.querySelector("#trophyTrophyList .trophy-image"); return e; }); }, function(){ setTimeout(function(){ var trophiesDiv = page.evaluate(function(){ //dump all of the trophy list innerHTML data. return document.getElementById("trophyTrophyList").innerHTML; }); console.log(trophiesDiv); phantom.exit(); }, 1000); // wait a little longer }, 20000); }); //thanks to Artjom B for helping with this part. function waitFor(testFx, onReady, timeOutMillis) { var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, //< Default Max Timout is 3s start = new Date().getTime(), condition = false, interval = setInterval(function() { if ( (new Date().getTime() - start < maxtimeOutMillis) && !condition ) { // If not time-out yet and condition not yet fulfilled condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()); //< defensive code } else { if(!condition) { // If condition still not fulfilled (timeout but condition is 'false') console.log("'waitFor()' timeout"); phantom.exit(1); } else { // Condition fulfilled (timeout and/or condition is 'true') console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms."); typeof(onReady) === "string" ? eval(onReady) : onReady(); //< Do what it's supposed to do once the condition is fulfilled clearInterval(interval); //< Stop this interval } } }, 250); //< repeat check every 250ms }
Does this information still work?