comp-8380: Information Retrieval Systems

Instructor: Professor Jianguo Lu. email: jlu at u windsor. Office: LT 5111. Office hours: Thursday 9:00-11:00.
Dillon Hal 359, Monday 10:00-12:50
Comp8380 2023

Remember to refresh your browser to have most updated page.
Here is the report template for your final project. [PDF] [.tex file ] [IEEEtran.cls]

Course Materials

Course outline (pdf) . Course introduction Slides (pdf)
Boolean retrieval model Slides (pdf)
- Readings: Chapter 1
Text transformation slides , IIR chapter 2
Indexing and search documents Lucene (Slides)
- Readings: Lucene in Action
- Starter code: IndexAllFilesInDirectory.java, SearchIndexedDocs.java
- Data: 10,000 citeSeer papers in gz format (about 1GB). Titles of DBLP data (272 MB, 5,459,997 paper titles) . Use wget or save as to download data.
- Code and data for XML: Here is a sample java code that parses DBLP data using SAX parser and feed into Lucene index DBLP.java. DBLP is a collection of CS papers DBLP site . some data in xml
Error tolerance search slides, IIR chapter 3
Text processing basics (slides) .
- Readings: Unix for Poets
- Jupyter Notebook Tutorial [nbviewer]
Text statistics. [slides] Data [vldb.txt]
Vector space model. [slides]
Word embedding [slides]
- Data to train embeddings Titles of DBLP data (272 MB, 5,459,997 paper titles
- Visualize embedding How to use t-SNE effectively
- Word embedding evaluation. WS353-Sim.txt
- Word co-occurrence. Embedding obtained from co-occurrence matrix. Reading: O Levy et al. Neural Word Embedding as Implicit Matrix Factorization
- Readings:
Crawling tool tutorial on cralwing tools .
Languagde modelling Slides . Chapter 12 [IIR], Chapter 3 [SLP]
- Good_Turing Smoothing. Reading: Good-Turing Smoothing Without Tears
Pre-trained language models
- Use BERT to predict the next word
- Use BERT to classify text documents based on their embeddings
- Fine tuning (the shorter version)
- Fine tuning with training loops
- Fine tuning llama
- Bert 101 A Visual Notebook to Using BERT for the First TIme.ipynb
Classification Naive Bayes slides , [vldb_train]. [icse_train]. [vldb_test]. [icse_test].
- Readings:
  - chapter 13 [IIR],
  - Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Jason D. M. Rennie
Sentence Embedding [Slides]
- Our paper out of last semester's course project:
  - Ali Forooghii, Shaghayegh Sadeghi, Jianguo Lu, Whitening Not Recommended for Classification Tasks in LLMs [PDF]. ACL Workshop on Representation Learning for NLP, 2024.
  - Latex source code arxiv.zip . You can use this as a template for your report. You can reuse the bib file. Note that the references are clickable.
  - GitHub code
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (SBERT) (Starter code is in BERT embedding colab)
- SimCSE [PDF] starter code to use SIMCSE (how it is fine-tuned) Simplified SIMCSE
- Improving Text Embeddings with Large Language Models
- Whitening
- Angle optimzed
  1. Angle-optimized text embedding.(SOTA now) This work (and the one below) includes the comparison with embeddings improved from LLAMA model.
  2. [Colab]
  3. Scaling Sentence Embeddings with Large Language Model
- Benchmarks
  - Sentence Embedding Evaluation Toolkit SentEval
  - Embedding benchmarks MTEB (Massive Text Embedding Benchmarks and Leader board (we should reuse the code here. Straightforward to use )
  - Sentence Transformers Collection of transformers ( we can obtain embeddings using one simple call on the model name).
- Other papers
  - PromptBert
  - Beyond Words: A Comprehensive Survey of Sentence Representations
Vector search Slides
Feature selection feature selection slides(pdf).
- Code: Mutual Information
- Top bigrams obtained from the code run on SE and DB papers.
- Data: VLDB, SIGMOD, ICSE.
Clustering [K-means Slides] [IIR chapter 16]
- A Survey of text clustering algorithms, Charu C. Aggarwal, Cheng-xiang Zhai
Link analysis. Slides [PDF] . IIR chapter 21
- A citation graph for Sigmod conferences [Download the graph]. It contains 6709 edges(citations) among Sigmod papers. Each line representing a citation from the first node to the second.
- The meta data for Sigmod papers can be downloaded from here. The first column is the paper ID, the second and the third are titles. Note that in this file paper IDs are in HEX format. You can tranform hexadecimal to decimla using int(aHexNumber,16) in python.
- Here is a starte pyhton code to process the graph. Note that it uses NetworkX to find the large component and calculate the PageRank values. The program returns the top five papers.
- Larger data sets including papers in three conferences. graph for SIGMOD/VLDB/ICSE , meta data for VLDB and ICSE. Here is the pyhton code to extract subgraphs subsets.py.
- Here is a good tutorial on writing PageRank for larger graph using Java language.
- This is the entire citation graph from Microsoft Academic data citations.txt.gz (4.1GB). It contains 528M edges. The data format is
```
	PaperID \t PaperID
```
- All the titles in Microsoft Academic data id_title.txt.gz (5.1GB). The data format is
```
	PaperID \t title 
```
  In some cases, downloading from a browser does not work, but wget works fine:
```
	wget http://jlu.myweb.cs.uwindsor.ca/8380/id_title.txt.gz
```
- AMiner data has 25M citations along with titles and abstracts for more than 3M papers. The most recent one is dated Oct 2017. It is in JSON format.
- A link to some other data sources.
LSI (Latent Semantic Indexing) and SVD(Singular Value Decomposition) [slides] SVD example Chapter 18 [IIR] IIR web page
Graph embedding [nbviewer] Reading: DeepWalk paper , node2vec.
Near duplicate slides [PDF] .

Resources

Latex example. You can find the sample document and the tex and cls files here.
Reuters corpus.
here is a zip file of another reuters data (around 800,000 files, 1GB). One sample file is 100000newsML.xml. Note thatit is an xml file, and you need to extract and index the relevant elements.
TREC data collections
commonCrawl data . It contains all the web pages.

Text Book

[IIR] Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schutze. Cambridge University Press, 2008. book website

Other reference books:

[SLP] Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin, 2023
[SE] Search Engines: Information Retrieval in Practice, by Bruce Croft, Donald Metzler and Trevor Strohman.
[MIR] Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
[LA] Lucene in Action , Michael McCandless, Erik Hatcher, and Otis Gospodneti. 2010.
[SA] Solr in Action , Trey Grainger and Timothy Potter. sample chapter one , sample chapter three.
[MMD]Anand Rajaraman and Jeff Ullman, Mining of massive datasets , 2014.

Comp-8380: Information Retrieval Systems (2024)

Course Materials

Resources

Text Book