NLP Portfolio: Web Mining and Applied NLP¶
Author: Alissa Beaderstadt
Course: Web Mining and Applied NLP
Date: 04/20/2026
Overview¶
This portfolio brings together the NLP work I’ve built throughout this course, focusing on turning messy text into something structured and usable.
Across these projects, I worked with web pages, APIs, and domain-specific datasets, building EVTL pipelines to extract, clean, and analyze text data. I also started incorporating more advanced NLP features like keyword extraction and domain-specific scoring, especially for biomedical and toxicology-related text.
Key Projects¶
1. NLP Techniques Implemented¶
Across these projects, I used a mix of NLP techniques to process and analyze text data:
- Tokenization
- Converted raw text into structured tokens across web, API, and corpus-based datasets.
- This was one of the first steps in every project and helped make the text easier to work with downstream.
-
Evidence:
- Web scraping notebook: web_words_beaderstadt.ipynb
- Text preprocessing notebook: text_preprocessing_beaderstadt.ipynb
- Corpus exploration notebook: nlp_corpus_explore_beaderstadt.ipynb
-
Frequency Analysis
- Computed word and phrase frequencies to identify dominant patterns and themes in text.
- I used this as a quick way to check whether my cleaning steps were actually improving the data.
-
Evidence:
- Frequency visualization (Project 6): top tokens chart
- Bigram analysis output: top bigrams chart
-
Text Cleaning and Normalization
- Applied lowercasing, punctuation removal, stopword filtering, and custom cleaning (e.g., hyphen splitting, HTML noise removal, API artifact cleanup).
- Cleaning made a huge difference when working with HTML, where navigation text can dominate results if not removed.
-
Evidence:
- Preprocessing pipelines and transformation stages
-
Feature Engineering
- Designed and implemented derived features including:
- token length and average word length
- word count and content length categories
- metadata flags (e.g.,
has_author) - domain-specific scores (e.g.,
bio_tox_relevance_score)
- I especially focused on creating features that made the data more interpretable, not just more complex.
-
Evidence:
- Transformation pipeline (web NLP): stage03_transform_beaderstadt.py
-
Co-occurrence and Context Analysis
- Analyzed relationships between words using context windows and bigram modeling.
- Increasing the context window helped reveal more meaningful relationships between terms.
-
Evidence:
- co-occurrence analysis and bigram outputs
-
Keyword Extraction
- Extracted meaningful terms from structured text after stopword filtering.
- Removing stopwords here made the output much more representative of the actual content.
-
Evidence:
top_keywordsoutput fields
-
Web Scraping and API Processing
- Extracted text from HTML pages using BeautifulSoup and processed structured JSON from APIs.
- This required handling very different data formats, and reinforced why the pipeline design was so important.
-
Evidence:
- Web pipeline: pipeline_web_html.py
- API pipeline: pipeline_api_json.py
-
Advanced NLP with spaCy
- Used spaCy for tokenization, stopword removal, and linguistic preprocessing.
- This made the pipeline more efficient compared to manual preprocessing.
- Evidence:
2. Systems and Data Sources¶
I worked with multiple types of data sources:
- Web Pages (HTML)
- Wikipedia article (Shih Tzu breed)
-
arXiv abstracts (biomedical AI and toxicity prediction)
-
APIs (JSON)
-
NewsAPI (live news article data with nested JSON structure)
-
Text Datasets
- Custom toxicology lab notes: text_data_beaderstadt.txt
- Structured corpus + NIOSH dataset: corpus analysis notebook
Data Challenges:
- Noisy and unstructured text
- The raw HTML included navigation elements (e.g., "main", "menu", "sidebar") that introduced noise.
-
I handled this by removing non-content sections and filtering navigation-related terms during preprocessing.
-
General text cleaning and validation
- Required punctuation removal, casing normalization, and stopword filtering.
-
I also validated inputs by checking for empty records, duplicates, and formatting issues before processing.
-
Handling different data structures
- JSON data included nested fields that required careful parsing.
-
Some fields were inconsistent or missing (e.g., author), which required additional validation and feature engineering.
-
Domain-specific adjustments
- Custom toxicology text required tailored stopword filtering.
- I also created labeled categories (Exposure, Guidance, Definition, Regulatory) to support comparison.
- Biomedical relevance required custom keyword weighting to better capture meaningful signals.
3. Pipeline Structure (EVTL)¶
My projects followed a structured EVTL pipeline:
- Extract
- Retrieved data from HTML pages, APIs, and local text files.
-
Evidence:
- Extraction steps across notebooks
- Extract stage script: stage01_extract.py
-
Validate
- Performed checks for structure, missing values, duplicates, and formatting issues.
- For example, I compared raw vs cleaned word counts to confirm preprocessing was working as expected.
-
Evidence:
- Validation steps across notebooks
- Validation script: stage02_validate_beaderstadt.py
-
Transform
- Applied NLP preprocessing and feature engineering including tokenization, cleaning, and derived feature creation.
- This ended up being one of the most useful parts of the pipeline, especially once I started applying it to new datasets and refining both the cleaning steps and feature logic.
-
Evidence:
- Corpus notebook: nlp_corpus_explore_beaderstadt.ipynb
- Transformation pipeline: stage03_transform_beaderstadt.py
-
Analyze
- Computed frequency distributions, co-occurrence patterns, and domain-specific signals.
-
Evidence:
- Processed dataset: beaderstadt_processed.csv
-
Load
- Saved outputs to CSV files and visualizations for downstream analysis.
- Evidence:
- Processed outputs: processed data folder
4. Signals and Analysis Methods¶
I analyzed several key signals to understand frequency, structure, and meaning in the text.
Frequency and Distribution¶
- Word Frequency
- Identified dominant terms across datasets and compared token distributions using Polars.
- Word Count
- Measured article length to understand content depth.
- Token Length Distribution
- Used histograms to explore overall text structure.
Structure and Context¶
- Bigrams
- Captured common word pairings (e.g., “supervised learning”) to understand phrase-level patterns.
- Co-occurrence
- Used sliding context windows to analyze relationships between terms.
- Category Comparison
- Compared token distributions across labeled categories using bar charts.
Complexity and Metadata¶
- Text Complexity
- Estimated technical difficulty using average word length.
- Metadata Completeness
- Evaluated missing data using a
has_authorflag. - Content Categorization
- Grouped articles by length to support comparison.
Domain-Specific Signals¶
- Keyword Extraction
- Identified key terms from abstracts after stopword filtering.
- Formality (Sentiment Proxy)
- Estimated writing style using pronoun-based scoring.
- Domain Scoring
- Built a biomedical/toxicology relevance score using weighted keywords.
- Domain Comparison
- Compared machine learning vs biomedical terminology to assess focus.
Visual Outputs¶
- Top 15 word frequency bar chart
- Word cloud of key terms
- Token length histogram
Evidence¶
- Projects 1-3 Notebooks
- Web scraping + frequency analysis: web_words_beaderstadt.ipynb
- Text preprocessing: text_preprocessing_beaderstadt.ipynb
-
Corpus exploration + co-occurrence: nlp_corpus_explore_beaderstadt.ipynb
-
Processed Output (Project 4)
-
Cleaned dataset: beaderstadt_processed.csv
-
Derived Features (Projects 5-6)
top_keywords,formality_score,bio_tox_relevance_score- Token frequency charts, bigrams, and additional spaCy-based features
5. Insights¶
My analysis revealed several meaningful insights:

This visualization shows how quickly dominant terms emerge after cleaning, which helped confirm that the preprocessing steps were working as intended.
Data Quality and Cleaning¶
- Removing noise (especially HTML navigation text and API artifacts) made a huge difference because those terms were dominating frequency results before cleaning.
- Validation steps helped ensure the dataset was consistent and reliable before analysis.
Text Patterns and Structure¶
- Frequent terms aligned with expected domain vocabulary (e.g., lab terms, breed traits).
- Token distributions and category comparisons showed clear structural differences in the data.
Context and Meaning¶
- Increasing the context window improved co-occurrence insights.
- Bigram analysis helped capture more meaningful phrase-level patterns.
Feature Engineering and Signals¶
- Word count and average word length were useful for estimating text complexity.
- Domain-specific scoring (biomedical/toxicology) provided a more meaningful way to interpret relevance.
- Evidence: Processed output with bio_tox_relevance_score
Pipeline and Reusability¶
- Structuring the workflow as an EVTL pipeline made the process more modular and reusable.
- In most cases, I was able to reuse the pipeline on new data by just changing the input source (e.g., URL).
6. Representative Work¶
Project 4: API-Based NLP Pipeline (News Data) Link: GitHub repository: View Project 4
¶
This project demonstrates my ability to build a full EVTL pipeline using live API data. I extended the base pipeline by adding validation for real-world JSON structures and engineering new features to improve data quality and analytical usefulness.
### Project 5: Web Document NLP Pipeline (Biomedical AI) Link: GitHub repository: View Project 5
This project demonstrates my ability to build a reusable EVTL pipeline for extracting and analyzing structured web content. I extended the pipeline with domain-specific scoring and keyword extraction to evaluate biomedical relevance in academic abstracts.
### Project 6: End-to-End NLP Pipeline with spaCy Link: GitHub repository: View Project 6
This project demonstrates my ability to build an end-to-end NLP pipeline that integrates web scraping, text preprocessing, and linguistic analysis. I extended the pipeline with spaCy-based processing and additional analytical features to extract deeper insights from technical text.
7. Skills¶
Through these projects, I developed the following skills:
- Building end-to-end NLP pipelines using EVTL architecture
- Extracting and processing text from HTML (BeautifulSoup) and JSON APIs
- Cleaning and normalizing messy, real-world text data
- Engineering custom NLP features (word count, keyword extraction, domain-specific scoring)
- Performing frequency, bigram, and co-occurrence analysis
- Using spaCy for efficient text processing and linguistic analysis
- Working with structured and unstructured data across multiple formats
- Communicating analytical results using Markdown, charts, and visualizations
Final Notes¶
This portfolio reflects my growth from basic text preprocessing to building more structured, reusable NLP pipelines. Moving forward, I would like to expand this work by incorporating machine learning models for tasks like classification and prediction, especially in areas like biomedical text analysis.