Instructor: Professor Jianguo Lu.
email: jlu at u windsor. Office: LT 5111. Office hours: Tuesday Thursday 10:00-11:00.
- Lecture time and place: Monday and Wednesday 5:30-6:50. MH 105.
- Course outline (pdf) .
Course introduction Slides (pdf)
- Boolean retrieval model Slides (pdf)
- Text processing basics (slides) .
- Indexing and search documents Lucene (Slides)
- Text transformation slides ,
IIR chapter 2
- Error tolerance search slides,
IIR chapter 3
- Vector space model. [slides]
- Link analysis. slides (pptx file) .
IIR chapter 21
- Text statistics. slides (pdf file) . languagde models slides (pptx file) .
- LSI(Latent Semantic Indexing) and SVD(Singular Value Decomposition) [slides]
IIR web page
- Word embedding
- Near duplicate slides (pptx) .
- Crawling the web.
crawling lecture slides.
IIR chapter 20
tutirial on cralwing tools .
Naive Bayes slides (pdf), feature selection slides(pdf).
- Evaluation. slides.
[IIR chapter 16] ;
IIR chapter 17
- Week 12: Project presentation
You are reuqired to submit your project reports for each stage. For the final project report,
it should not exceed four pages long, using the ACM sig-alternate.cls formatting style. You can find the sample document and the tex and cls files here
In the report you describe the details of your project, including your data, experiments, results, and your analysis. Write the link to your web site, which is a demo of your search engine, and contains more details of your project. You can consider to report on the following aspects. Note that you are not REQUIRED to do all those subtopics. You can select and focus on a few of them, and report on the topics that you have done.
- Indexing and searching: how do you index your data? Any changes to the default setting? why applying those changes? any improvment on searching experience?
- Data: Which data you use? For Citeseer data, do you use the meta data to improve your result? do you use the citation network? Do you use other data?
- relevance ranking: what is the relevance function you used(tf-idf, their variants)? Have you changed any relevance functions? Which relevant faunction looks better?
- PageRank: how do you implement the page ranking algorithm? scalability issues? interesting results?
- classification: NaiveBayes? how do you normalize the text? have you selected the features? using mutual information or chi square? the impact of feature size on F1? precision/recall/F1? plot F1 as a function of number of features.
- Do you cluster the search result? which algorithm(s) do you use? what is the evaluation if you have?
- How do you run SVD? How is the scalability?
- [IIR] Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schutze. Cambridge University Press, 2008. book website
Other reference books:
- [SE] Search Engines: Information Retrieval in Practice, by Bruce Croft, Donald Metzler and Trevor Strohman.
- [MIR] Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
- [LA] Lucene in Action , Michael McCandless, Erik Hatcher, and Otis Gospodneti. 2010.
- [SA] Solr in Action , Trey Grainger and Timothy Potter. sample chapter one , sample chapter three.
- [MMD]Anand Rajaraman and Jeff Ullman, Mining of massive datasets , 2013.