MyAnimeList Web Scraper

Personal Web scraping project (Unguided) to extract anime and manga data.

PythonPandasBeautiful SoupRequestsWeb Scraping

Personal Project

Project Purpose

This project search for extract, structure, and analyze web data at scale. By scraping MyAnimeList (a top anime and manga database), it transforms unstructured website content into actionable datasets, showcasing skills in automation, problem-solving, and data pipeline creation.

Why MyAnimeList?

Rich Data Source:
MyAnimeList is the largest anime database globally, with:
- 20,000+ anime and manga entries
- User ratings, rankings, genres, and metadata
- Seasonal popularity trends
Real-World Relevance:
This data powers:
- Market analysis for streaming platforms (e.g., Crunchyroll, Netflix)
- Recommendation engine training
- Content licensing decisions

Core Skills Highlighted

Technical

Web scraping (Beautiful Soup, Requests)
Data pipeline design (raw → CSV transformation)
Modern Python practices (Pathlib, type hints)
Developer workflow optimization (UV, Ruff)

Strategic

Balancing speed vs. website respect (ethics-first scraping)
Prioritizing maintainability over quick fixes
Translating business needs into technical specs

Business Value

Scraped data enables:

Use Case	Impact Example
Competitor Analysis	Identify trending genres for content acquisition
Audience Insights	Analyze score vs. popularity correlations
Catalog Enrichment	Auto-generate metadata for media libraries
Price Benchmarking	Compare premium vs. free-tier anime performance

Key Challenges & Solutions

Challenge	Solution	Skill Demonstrated
Anti-Scraping Measures	Randomized delays + browser-like headers	Reverse-engineering web protections
Data Consistency	Robust HTML parsing with fallback logic	Handling unstructured data
Scalability	Modular code design (scraper/storage split)	System architecture planning
Maintenance	Type hints + modern Python practices	Future-proofing codebases

How it work

Core Tools & Libraries

Tech stack

Development Essentials

UV: Blazing-fast Python package installer and resolver (alternative to pip/pip-tools)
Ruff: Rust-powered linter for immediate feedback and code consistency
Pathlib: Modern object-oriented file system paths (replaces legacy os.path)
Rich: Beautiful terminal formatting for progress tracking and debugging

Web Scraping

Beautiful Soup 4: HTML/XML parsing library for extracting structured data
Requests: HTTP client with session persistence and header customization
Pandas: Data manipulation library for CSV export functionality

Modern Python Features

Type hints (list[dict], str | Path)
Structural pattern matching (planned for future enhancements)
F-strings and context managers

Key Technical Choices

Developer Experience

UV Package Manager:
Replaced traditional pip with Astral’s UV for 10-100x faster dependency resolution

uv pip install -r requirements.txt

Ruff Linter: Configured with strict rules for:
- Auto-fixable violations (PEP 8)
- Potential bug detection
- Type checking awareness

File Handling

from pathlib import Path  # Modern path handling
self.raw_path = Path(base_path) / "data" / "raw"  # Type-safe concatenation

Anti-detection Measures

# Browser-like headers with randomized delays
self.session.headers.update({...})
delay = random.uniform(self.min_delay, self.max_delay)

Project Structure (Simplified)

.
├── data/
│   ├── raw/            # HTML snapshots
│   └── scraped/        # Processed CSV files
├── src/
│   ├── scrap.py        # Main workflow
│   ├── scraper.py      # Core scraping logic
│   ├── models.py       # Data clases of scraped items
│   └── storage.py      # File I/O operations

├── pyproject.toml      # Modern config format
└── requirements.txt    # UV-compatible dependencies

Future Technical Enhancements

Implement Playwright for JavaScript-rendered content
Add Pydantic models for data validation
Introduce async HTTP client (HTTPX/httpx)
Pre-commit hooks with Ruff and Mypy
Docker containerization for reproducibility

git clone https://github.com/kkyouma/myanimelist_scraper.git
cd myanimelist_scraper