Now you can find PageRank values for research papers. Which paper is the most influentialin our following data set?
A citation graph for Sigmod conferences [Download the graph]. It contains 6709 edges(citations) among Sigmod papers. Each line representing a citation from the first node to the second.
The meta data for Sigmod papers can be downloaded from here. The first column is the paper ID, the second and the third are titles. Note that in this file paper IDs are in HEX format. You can tranform hexadecimal to decimla using int(aHexNumber,16) in python.
Here is a starte pyhton code to process the graph. Note that it uses NetworkX to find the large component and calculate the PageRank values. The program returns the top five papers.
Code and data for XML: Here is a sample java code that parses DBLP data using SAX parser and feed into Lucene index DBLP.java. DBLP is a collection of CS papers DBLP site . some data in xml