NLP Portfolio: Web Mining and Applied NLP¶

Author: Alissa Beaderstadt
Course: Web Mining and Applied NLP
Date: 04/20/2026

Overview¶

This portfolio brings together the NLP work I’ve built throughout this course, focusing on turning messy text into something structured and usable.

Across these projects, I worked with web pages, APIs, and domain-specific datasets, building EVTL pipelines to extract, clean, and analyze text data. I also started incorporating more advanced NLP features like keyword extraction and domain-specific scoring, especially for biomedical and toxicology-related text.

Key Projects¶

1. NLP Techniques Implemented¶

Across these projects, I used a mix of NLP techniques to process and analyze text data:

Tokenization
Converted raw text into structured tokens across web, API, and corpus-based datasets.
This was one of the first steps in every project and helped make the text easier to work with downstream.
Evidence:
- Web scraping notebook: web_words_beaderstadt.ipynb
- Text preprocessing notebook: text_preprocessing_beaderstadt.ipynb
- Corpus exploration notebook: nlp_corpus_explore_beaderstadt.ipynb
Frequency Analysis
Computed word and phrase frequencies to identify dominant patterns and themes in text.
I used this as a quick way to check whether my cleaning steps were actually improving the data.
Evidence:
- Frequency visualization (Project 6): top tokens chart
- Bigram analysis output: top bigrams chart
Text Cleaning and Normalization
Applied lowercasing, punctuation removal, stopword filtering, and custom cleaning (e.g., hyphen splitting, HTML noise removal, API artifact cleanup).
Cleaning made a huge difference when working with HTML, where navigation text can dominate results if not removed.
Evidence:
- Preprocessing pipelines and transformation stages
Feature Engineering
Designed and implemented derived features including:
- token length and average word length
- word count and content length categories
- metadata flags (e.g., has_author)
- domain-specific scores (e.g., bio_tox_relevance_score)
I especially focused on creating features that made the data more interpretable, not just more complex.
Evidence:
- Transformation pipeline (web NLP): stage03_transform_beaderstadt.py
Co-occurrence and Context Analysis
Analyzed relationships between words using context windows and bigram modeling.
Increasing the context window helped reveal more meaningful relationships between terms.
Evidence:
- co-occurrence analysis and bigram outputs
Keyword Extraction
Extracted meaningful terms from structured text after stopword filtering.
Removing stopwords here made the output much more representative of the actual content.
Evidence:
- top_keywords output fields
Web Scraping and API Processing
Extracted text from HTML pages using BeautifulSoup and processed structured JSON from APIs.
This required handling very different data formats, and reinforced why the pipeline design was so important.
Evidence:
- Web pipeline: pipeline_web_html.py
- API pipeline: pipeline_api_json.py
Advanced NLP with spaCy
Used spaCy for tokenization, stopword removal, and linguistic preprocessing.
This made the pipeline more efficient compared to manual preprocessing.
Evidence:
- stage03_transform_beaderstadt.py

2. Systems and Data Sources¶

I worked with multiple types of data sources:

Web Pages (HTML)
Wikipedia article (Shih Tzu breed)
arXiv abstracts (biomedical AI and toxicity prediction)
APIs (JSON)
NewsAPI (live news article data with nested JSON structure)
Text Datasets
Custom toxicology lab notes: text_data_beaderstadt.txt
Structured corpus + NIOSH dataset: corpus analysis notebook

Data Challenges:

Noisy and unstructured text
The raw HTML included navigation elements (e.g., "main", "menu", "sidebar") that introduced noise.
I handled this by removing non-content sections and filtering navigation-related terms during preprocessing.
General text cleaning and validation
Required punctuation removal, casing normalization, and stopword filtering.
I also validated inputs by checking for empty records, duplicates, and formatting issues before processing.
Handling different data structures
JSON data included nested fields that required careful parsing.
Some fields were inconsistent or missing (e.g., author), which required additional validation and feature engineering.
Domain-specific adjustments
Custom toxicology text required tailored stopword filtering.
I also created labeled categories (Exposure, Guidance, Definition, Regulatory) to support comparison.
Biomedical relevance required custom keyword weighting to better capture meaningful signals.

3. Pipeline Structure (EVTL)¶

My projects followed a structured EVTL pipeline:

Extract
Retrieved data from HTML pages, APIs, and local text files.
Evidence:
- Extraction steps across notebooks
- Extract stage script: stage01_extract.py
Validate
Performed checks for structure, missing values, duplicates, and formatting issues.
For example, I compared raw vs cleaned word counts to confirm preprocessing was working as expected.
Evidence:
- Validation steps across notebooks
- Validation script: stage02_validate_beaderstadt.py
Transform
Applied NLP preprocessing and feature engineering including tokenization, cleaning, and derived feature creation.
This ended up being one of the most useful parts of the pipeline, especially once I started applying it to new datasets and refining both the cleaning steps and feature logic.
Evidence:
- Corpus notebook: nlp_corpus_explore_beaderstadt.ipynb
- Transformation pipeline: stage03_transform_beaderstadt.py
Analyze
Computed frequency distributions, co-occurrence patterns, and domain-specific signals.
Evidence:
- Processed dataset: beaderstadt_processed.csv
Load
Saved outputs to CSV files and visualizations for downstream analysis.
Evidence:
- Processed outputs: processed data folder

4. Signals and Analysis Methods¶

I analyzed several key signals to understand frequency, structure, and meaning in the text.

Frequency and Distribution¶

Word Frequency
Identified dominant terms across datasets and compared token distributions using Polars.
Word Count
Measured article length to understand content depth.
Token Length Distribution
Used histograms to explore overall text structure.

Structure and Context¶

Bigrams
Captured common word pairings (e.g., “supervised learning”) to understand phrase-level patterns.
Co-occurrence
Used sliding context windows to analyze relationships between terms.
Category Comparison
Compared token distributions across labeled categories using bar charts.

Complexity and Metadata¶

Text Complexity
Estimated technical difficulty using average word length.
Metadata Completeness
Evaluated missing data using a has_author flag.
Content Categorization
Grouped articles by length to support comparison.

Domain-Specific Signals¶

Keyword Extraction
Identified key terms from abstracts after stopword filtering.
Formality (Sentiment Proxy)
Estimated writing style using pronoun-based scoring.
Domain Scoring
Built a biomedical/toxicology relevance score using weighted keywords.
Domain Comparison
Compared machine learning vs biomedical terminology to assess focus.

Visual Outputs¶

Top 15 word frequency bar chart
Word cloud of key terms
Token length histogram

Evidence¶

Projects 1-3 Notebooks
Web scraping + frequency analysis: web_words_beaderstadt.ipynb
Text preprocessing: text_preprocessing_beaderstadt.ipynb
Corpus exploration + co-occurrence: nlp_corpus_explore_beaderstadt.ipynb
Processed Output (Project 4)
Cleaned dataset: beaderstadt_processed.csv
Derived Features (Projects 5-6)
top_keywords, formality_score, bio_tox_relevance_score
Token frequency charts, bigrams, and additional spaCy-based features

5. Insights¶

My analysis revealed several meaningful insights:

Top Tokens Example

This visualization shows how quickly dominant terms emerge after cleaning, which helped confirm that the preprocessing steps were working as intended.

Data Quality and Cleaning¶

Removing noise (especially HTML navigation text and API artifacts) made a huge difference because those terms were dominating frequency results before cleaning.
Validation steps helped ensure the dataset was consistent and reliable before analysis.

Text Patterns and Structure¶

Frequent terms aligned with expected domain vocabulary (e.g., lab terms, breed traits).
Token distributions and category comparisons showed clear structural differences in the data.

Context and Meaning¶

Increasing the context window improved co-occurrence insights.
Bigram analysis helped capture more meaningful phrase-level patterns.

Feature Engineering and Signals¶

Word count and average word length were useful for estimating text complexity.
Domain-specific scoring (biomedical/toxicology) provided a more meaningful way to interpret relevance.
Evidence: Processed output with bio_tox_relevance_score

Pipeline and Reusability¶

Structuring the workflow as an EVTL pipeline made the process more modular and reusable.
In most cases, I was able to reuse the pipeline on new data by just changing the input source (e.g., URL).

6. Representative Work¶

Project 4: API-Based NLP Pipeline (News Data) Link: GitHub repository: View Project 4
¶

This project demonstrates my ability to build a full EVTL pipeline using live API data. I extended the base pipeline by adding validation for real-world JSON structures and engineering new features to improve data quality and analytical usefulness.

### Project 5: Web Document NLP Pipeline (Biomedical AI) Link: GitHub repository: View Project 5

This project demonstrates my ability to build a reusable EVTL pipeline for extracting and analyzing structured web content. I extended the pipeline with domain-specific scoring and keyword extraction to evaluate biomedical relevance in academic abstracts.

### Project 6: End-to-End NLP Pipeline with spaCy Link: GitHub repository: View Project 6

This project demonstrates my ability to build an end-to-end NLP pipeline that integrates web scraping, text preprocessing, and linguistic analysis. I extended the pipeline with spaCy-based processing and additional analytical features to extract deeper insights from technical text.

7. Skills¶

Through these projects, I developed the following skills:

Building end-to-end NLP pipelines using EVTL architecture
Extracting and processing text from HTML (BeautifulSoup) and JSON APIs
Cleaning and normalizing messy, real-world text data
Engineering custom NLP features (word count, keyword extraction, domain-specific scoring)
Performing frequency, bigram, and co-occurrence analysis
Using spaCy for efficient text processing and linguistic analysis
Working with structured and unstructured data across multiple formats
Communicating analytical results using Markdown, charts, and visualizations

Final Notes¶

This portfolio reflects my growth from basic text preprocessing to building more structured, reusable NLP pipelines. Moving forward, I would like to expand this work by incorporating machine learning models for tasks like classification and prediction, especially in areas like biomedical text analysis.