WebCrawlerX - A Web Crawler, Indexer & Search Engine

Web App•Completed

Project Details

Overview

WebCrawlerX is a custom-built search engine platform that handles discovering, indexing, and searching web pages using explicit data structures. It bypasses external search libraries to implement the core mechanics of indexing and retrieval from the ground up.

The Problem

Building a functional web discovery and indexing system that can efficiently traverse the web and provide ranked search results without relying on production-grade search libraries like Elasticsearch or Solr.

Approach

—BFS Traversal: Implemented a robust Breadth-First Search crawler with configurable depth control to discover web pages safely.
—Trie Data Structure: Built a custom Trie (Prefix Tree) logic for highly efficient keyword matching and autocomplete functionality.
—Inverted Indexing: Created a manual inverted index system to map keywords directly to their source URLs for near-instant lookup.
—Ranking Math: Developed a frequency-based ranking algorithm using Merge Sort to order search results by relevance.

Technical Stack

Layer	Technology
Backend	Django 5.x, Python 3.12
Scraping	BeautifulSoup4, Requests API
Database	SQLite (Django ORM)
Structures	Custom Trie, Inverted Index, Adjacency List
Frontend	HTML5, CSS3, Vanilla JavaScript

What I learned

—Data Structures: Custom Trie implementation, Graph (adjacency list), and Queue for BFS processing
—Algorithms: Breadth-First Search (BFS), Merge Sort, and manual Inverted Index creation
—Web Scraping: Deep parsing of HTML with BeautifulSoup and robust HTTP request handling
—Full-Stack: Bridging complex algorithmic logic with a Django-based web interface
—Database Modeling: Efficient relational mapping for crawled pages and their indexed metadata

← Prev ProjectStore Management System

Next Project →Space Journey