Presentations

Course Materials

  1. Course outline (pdf) . Course introduction Slides (pdf)
  2. Boolean retrieval model Slides (pdf)
  3. Text transformation slides , IIR chapter 2
  4. Indexing and search documents Lucene (Slides)
  5. Error tolerance search slides, IIR chapter 3
  6. Text processing basics (slides) .
  7. Text statistics. [slides] Data [vldb.txt]
  8. Vector space model. [slides]
  9. Word embedding [slides]
  10. Languagde modelling Slides . Chapter 12 [IIR], Chapter 3 [SLP]
  11. Classification Naive Bayes slides , [vldb_train]. [icse_train]. [vldb_test]. [icse_test].
  12. Sentence Embedding [Slides]
  13. Vector search Slides
  14. Feature selection feature selection slides(pdf).
  15. Clustering [K-means Slides] [HAC Slides] [IIR chapter 16] IIR chapter 17
  16. Link analysis. Slides [PDF] . IIR chapter 21
      Now you can find PageRank values for research papers. Which paper is the most influentialin our following data set?
    • A citation graph for Sigmod conferences [Download the graph]. It contains 6709 edges(citations) among Sigmod papers. Each line representing a citation from the first node to the second.
    • The meta data for Sigmod papers can be downloaded from here. The first column is the paper ID, the second and the third are titles. Note that in this file paper IDs are in HEX format. You can tranform hexadecimal to decimla using int(aHexNumber,16) in python.
    • Here is a starte pyhton code to process the graph. Note that it uses NetworkX to find the large component and calculate the PageRank values. The program returns the top five papers.
    • Larger data sets including papers in three conferences. graph for SIGMOD/VLDB/ICSE , meta data for VLDB and ICSE. Here is the pyhton code to extract subgraphs subsets.py.
    • Here is a good tutorial on writing PageRank for larger graph using Java language.
    • This is the entire citation graph from Microsoft Academic data citations.txt.gz (4.1GB). It contains 528M edges. The data format is
      	PaperID \t PaperID
      
    • All the titles in Microsoft Academic data id_title.txt.gz (5.1GB). The data format is
      	PaperID \t title 
      
      In some cases, downloading from a browser does not work, but wget works fine:
      	wget http://jlu.myweb.cs.uwindsor.ca/8380/id_title.txt.gz
      
    • AMiner data has 25M citations along with titles and abstracts for more than 3M papers. The most recent one is dated Oct 2017. It is in JSON format.
    • A link to some other data sources.
  17. LSI (Latent Semantic Indexing) and SVD(Singular Value Decomposition) [slides] SVD example Chapter 18 [IIR] IIR web page
  18. Graph embedding [nbviewer] Reading: DeepWalk paper , node2vec.
  19. Near duplicate slides [PDF] .

Resources

Text Book

Other reference books: