Browsed by
Month: December 2021

Data Structures and Information Retrieval in Python

Data Structures and Information Retrieval in Python

I am happy to announce the first release of a new book, Data Structures and Information Retrieval in Python, which is an introduction to data structures organized around a motivating example: building a search engine.

The elements of the search engine are the Crawler, which downloads web pages and follows links to other pages, the Indexer, which builds a map from each word to the pages where it appears, and the Retriever, which looks up search terms and finds pages that maximize relevance and quality.

The index is stored in Redis, which is a NoSQL database that provides data structures like sets, lists, and hashmaps. The book presents each data structure first in Python, then in Redis, which should help readers see which features are essential and which are implementation details.

The book is loosely based on Think Data Structures, which uses Java. When I started, I thought it would be a translation, but the changes were so pervasive, I ended up writing it from scratch.

As I did with Think Bayes, I wrote this book entirely in Jupyter notebooks, and used JupyterBook to translate them to HTML. The notebooks run on Colab, which is a service provided by Google that runs notebooks in a browser. So you can read the book, run the code, and work on exercises without installing anything.

Why?

I have mixed feelings about data structures. In the computer science curriculum, it is often used as a weed-out class, and in the technical interview process, it is used as a gatekeeper. In both cases, it plays a disproportionate role in determining who studies computer science and who gets jobs in the field.

And it tends to be a bad class. The material is presented theoretically, without context or application, organized in a way that seems designed to maximize tedium, and often taught badly. The goals of the course are ambiguous — is it about theoretical foundations or programming proficiency? — and the result is often a hybrid that does nothing well.

Also, it has not kept up with changes in the software landscape. When people were learning to program in C, it might have been appropriate to focus on the implementation of data structures. But for people learning Python, it’s probably better to learn how to use them first, and see how they are implemented when (and if) necessary.

Despite my misgivings about the material, in 2016 I found myself designing a data structures curriculum for the Flatiron School, on behalf of a large software company. In 2017 I updated the curriculum and published Think Data Structures. But the software curriculum at Olin uses more Python than Java, so I have not used the Java version in a class.

Since I was scheduled to teach Data Structures this fall, I decided to write a Python version. I started last August with an outline and about 10 notebooks in various stages of completion. I developed the rest of the material during the semester, staying about a week ahead of the students.

Now there are 29 notebooks including 22 that present the material and 7 quizzes that include practice questions. I hope people will find it useful!