1. Course outline (pdf) . Course introduction Slides (pdf)
  2. Boolean retrieval model Slides (pdf)
  3. Text processing basics (slides) .
  4. Indexing and search documents Lucene (Slides)
  5. Text transformation slides , IIR chapter 2
  6. Error tolerance search slides, IIR chapter 3
  7. Vector space model. [slides]
  8. Link analysis. slides (pptx file) . IIR chapter 21
      Now you can find PageRank values for research papers. Which paper is the most influentialin our following data set?
    • A citation graph for Sigmod conferences [Download the graph]. It contains 6709 edges(citations) among Sigmod papers. Each line representing a citation from the first node to the second.
    • The meta data for Sigmod papers can be downloaded from here. The first column is the paper ID, the second and the third are titles. Note that in this file paper IDs are in HEX format. You can tranform hexadecimal to decimla using int(aHexNumber,16) in python.
    • Here is a sample pyhton code to process the graph. Note that it uses NetworkX to find the large component and calculate the PageRank values. The program returns the top five papers.
    • Larger data sets including papers in three conferences. graph for SIGMOD/VLDB/ICSE , meta data for VLDB and ICSE. Here is the pyhton code to extract subgraphs subsets.py. Here is a good tutorial on writing PageRank for larger graph using Java language.
    • Now you can run node2vec on these networks to obtain shorter vector representation for papers.
    • This is the entire citation graph from Microsoft Academic data citations.txt.gz (4.1GB). It contains 528M edges. The data format is
      	PaperID \t PaperID
      
    • All the titles in Microsoft Academic data id_title.txt.gz (5.1GB). The data format is
      	PaperID \t title 
      
      In some cases, downloading from a browser does not work, but wget works fine:
      	wget http://jlu.myweb.cs.uwindsor.ca/538/id_title.txt.gz
      
    • AMiner data has 25M citations along with titles and abstracts for more than 3M papers. The most recent one is dated Oct 2017. It is in JSON format.
    • A link to some other data sources.
  9. Text statistics. slides (pdf file) . languagde models slides (pptx file) . chapter 12
  10. LSI(Latent Semantic Indexing) and SVD(Singular Value Decomposition) [slides] chapter 18 IIR web page
  11. Word embedding
  12. Near duplicate slides (pptx) .
  13. Crawling the web. crawling lecture slides. IIR chapter 20 tutirial on cralwing tools . chapter 6
  14. Classification Naive Bayes slides (pdf), feature selection slides(pdf).
  15. Evaluation. slides. chapter 8
  16. Clustering [K-means Slides] [HAC Slides] [IIR chapter 16] ; IIR chapter 17
  17. Week 12: Project presentation

Project reports

You are reuqired to submit your project reports for each stage. For the final project report, it should not exceed four pages long, using the ACM sig-alternate.cls formatting style. You can find the sample document and the tex and cls files here. In the report you describe the details of your project, including your data, experiments, results, and your analysis. Write the link to your web site, which is a demo of your search engine, and contains more details of your project. You can consider to report on the following aspects. Note that you are not REQUIRED to do all those subtopics. You can select and focus on a few of them, and report on the topics that you have done.

Resources

Text Book

Other reference books:

Tools