Skip to content

NLP Portfolio: Web Mining and Applied NLP

Author: Alissa Beaderstadt
Course: Web Mining and Applied NLP
Date: 04/20/2026


Overview

This portfolio brings together the NLP work I’ve built throughout this course, focusing on turning messy text into something structured and usable.

Across these projects, I worked with web pages, APIs, and domain-specific datasets, building EVTL pipelines to extract, clean, and analyze text data. I also started incorporating more advanced NLP features like keyword extraction and domain-specific scoring, especially for biomedical and toxicology-related text.


Key Projects


1. NLP Techniques Implemented

Across these projects, I used a mix of NLP techniques to process and analyze text data:

  • Tokenization
  • Converted raw text into structured tokens across web, API, and corpus-based datasets.
  • This was one of the first steps in every project and helped make the text easier to work with downstream.
  • Evidence:

  • Frequency Analysis

  • Computed word and phrase frequencies to identify dominant patterns and themes in text.
  • I used this as a quick way to check whether my cleaning steps were actually improving the data.
  • Evidence:

  • Text Cleaning and Normalization

  • Applied lowercasing, punctuation removal, stopword filtering, and custom cleaning (e.g., hyphen splitting, HTML noise removal, API artifact cleanup).
  • Cleaning made a huge difference when working with HTML, where navigation text can dominate results if not removed.
  • Evidence:

    • Preprocessing pipelines and transformation stages
  • Feature Engineering

  • Designed and implemented derived features including:
    • token length and average word length
    • word count and content length categories
    • metadata flags (e.g., has_author)
    • domain-specific scores (e.g., bio_tox_relevance_score)
  • I especially focused on creating features that made the data more interpretable, not just more complex.
  • Evidence:

  • Co-occurrence and Context Analysis

  • Analyzed relationships between words using context windows and bigram modeling.
  • Increasing the context window helped reveal more meaningful relationships between terms.
  • Evidence:

  • Keyword Extraction

  • Extracted meaningful terms from structured text after stopword filtering.
  • Removing stopwords here made the output much more representative of the actual content.
  • Evidence:

    • top_keywords output fields
  • Web Scraping and API Processing

  • Extracted text from HTML pages using BeautifulSoup and processed structured JSON from APIs.
  • This required handling very different data formats, and reinforced why the pipeline design was so important.
  • Evidence:

  • Advanced NLP with spaCy

  • Used spaCy for tokenization, stopword removal, and linguistic preprocessing.
  • This made the pipeline more efficient compared to manual preprocessing.
  • Evidence:

2. Systems and Data Sources

I worked with multiple types of data sources:

  • Web Pages (HTML)
  • Wikipedia article (Shih Tzu breed)
  • arXiv abstracts (biomedical AI and toxicity prediction)

  • APIs (JSON)

  • NewsAPI (live news article data with nested JSON structure)

  • Text Datasets

  • Custom toxicology lab notes: text_data_beaderstadt.txt
  • Structured corpus + NIOSH dataset: corpus analysis notebook

Data Challenges:

  • Noisy and unstructured text
  • The raw HTML included navigation elements (e.g., "main", "menu", "sidebar") that introduced noise.
  • I handled this by removing non-content sections and filtering navigation-related terms during preprocessing.

  • General text cleaning and validation

  • Required punctuation removal, casing normalization, and stopword filtering.
  • I also validated inputs by checking for empty records, duplicates, and formatting issues before processing.

  • Handling different data structures

  • JSON data included nested fields that required careful parsing.
  • Some fields were inconsistent or missing (e.g., author), which required additional validation and feature engineering.

  • Domain-specific adjustments

  • Custom toxicology text required tailored stopword filtering.
  • I also created labeled categories (Exposure, Guidance, Definition, Regulatory) to support comparison.
  • Biomedical relevance required custom keyword weighting to better capture meaningful signals.

3. Pipeline Structure (EVTL)

My projects followed a structured EVTL pipeline:

  • Extract
  • Retrieved data from HTML pages, APIs, and local text files.
  • Evidence:

  • Validate

  • Performed checks for structure, missing values, duplicates, and formatting issues.
  • For example, I compared raw vs cleaned word counts to confirm preprocessing was working as expected.
  • Evidence:

  • Transform

  • Applied NLP preprocessing and feature engineering including tokenization, cleaning, and derived feature creation.
  • This ended up being one of the most useful parts of the pipeline, especially once I started applying it to new datasets and refining both the cleaning steps and feature logic.
  • Evidence:

  • Analyze

  • Computed frequency distributions, co-occurrence patterns, and domain-specific signals.
  • Evidence:

  • Load

  • Saved outputs to CSV files and visualizations for downstream analysis.
  • Evidence:

4. Signals and Analysis Methods

I analyzed several key signals to understand frequency, structure, and meaning in the text.

Frequency and Distribution

  • Word Frequency
  • Identified dominant terms across datasets and compared token distributions using Polars.
  • Word Count
  • Measured article length to understand content depth.
  • Token Length Distribution
  • Used histograms to explore overall text structure.

Structure and Context

  • Bigrams
  • Captured common word pairings (e.g., “supervised learning”) to understand phrase-level patterns.
  • Co-occurrence
  • Used sliding context windows to analyze relationships between terms.
  • Category Comparison
  • Compared token distributions across labeled categories using bar charts.

Complexity and Metadata

  • Text Complexity
  • Estimated technical difficulty using average word length.
  • Metadata Completeness
  • Evaluated missing data using a has_author flag.
  • Content Categorization
  • Grouped articles by length to support comparison.

Domain-Specific Signals

  • Keyword Extraction
  • Identified key terms from abstracts after stopword filtering.
  • Formality (Sentiment Proxy)
  • Estimated writing style using pronoun-based scoring.
  • Domain Scoring
  • Built a biomedical/toxicology relevance score using weighted keywords.
  • Domain Comparison
  • Compared machine learning vs biomedical terminology to assess focus.

Visual Outputs

  • Top 15 word frequency bar chart
  • Word cloud of key terms
  • Token length histogram

Evidence


5. Insights

My analysis revealed several meaningful insights:

Top Tokens Example

This visualization shows how quickly dominant terms emerge after cleaning, which helped confirm that the preprocessing steps were working as intended.

Data Quality and Cleaning

  • Removing noise (especially HTML navigation text and API artifacts) made a huge difference because those terms were dominating frequency results before cleaning.
  • Validation steps helped ensure the dataset was consistent and reliable before analysis.

Text Patterns and Structure

  • Frequent terms aligned with expected domain vocabulary (e.g., lab terms, breed traits).
  • Token distributions and category comparisons showed clear structural differences in the data.

Context and Meaning

  • Increasing the context window improved co-occurrence insights.
  • Bigram analysis helped capture more meaningful phrase-level patterns.

Feature Engineering and Signals

  • Word count and average word length were useful for estimating text complexity.
  • Domain-specific scoring (biomedical/toxicology) provided a more meaningful way to interpret relevance.
  • Evidence: Processed output with bio_tox_relevance_score

Pipeline and Reusability

  • Structuring the workflow as an EVTL pipeline made the process more modular and reusable.
  • In most cases, I was able to reuse the pipeline on new data by just changing the input source (e.g., URL).

6. Representative Work

This project demonstrates my ability to build a full EVTL pipeline using live API data. I extended the base pipeline by adding validation for real-world JSON structures and engineering new features to improve data quality and analytical usefulness.

### Project 5: Web Document NLP Pipeline (Biomedical AI) Link: GitHub repository: View Project 5

This project demonstrates my ability to build a reusable EVTL pipeline for extracting and analyzing structured web content. I extended the pipeline with domain-specific scoring and keyword extraction to evaluate biomedical relevance in academic abstracts.

### Project 6: End-to-End NLP Pipeline with spaCy Link: GitHub repository: View Project 6

This project demonstrates my ability to build an end-to-end NLP pipeline that integrates web scraping, text preprocessing, and linguistic analysis. I extended the pipeline with spaCy-based processing and additional analytical features to extract deeper insights from technical text.


7. Skills

Through these projects, I developed the following skills:

  • Building end-to-end NLP pipelines using EVTL architecture
  • Extracting and processing text from HTML (BeautifulSoup) and JSON APIs
  • Cleaning and normalizing messy, real-world text data
  • Engineering custom NLP features (word count, keyword extraction, domain-specific scoring)
  • Performing frequency, bigram, and co-occurrence analysis
  • Using spaCy for efficient text processing and linguistic analysis
  • Working with structured and unstructured data across multiple formats
  • Communicating analytical results using Markdown, charts, and visualizations

Final Notes

This portfolio reflects my growth from basic text preprocessing to building more structured, reusable NLP pipelines. Moving forward, I would like to expand this work by incorporating machine learning models for tasks like classification and prediction, especially in areas like biomedical text analysis.