Date of Award

6-8-2025

Document Type

Thesis

Publisher

Santa Clara : Santa Clara University, 2025

Department

Computer Science and Engineering

First Advisor

Sean Choi

Abstract

This project will investigate the development of a transformer-based machine learning model (NetGen), with the purpose of generating realistic and usable synthetic network traffic from natural language inputs. Network research and testing in the cybersecurity field depend on publicly available packet data (PCAP). Many of these datasets are either outdated or fail to represent more modern traffic patterns. We attempt to address this issue by preprocessing available network data into structured and tokenized sequences, suitable for training a transformer model. The transformer architecture allows us to translate user-provided descriptions—“simulate a TCP handshake from IP A to IP B”—into structurally accurate network packet streams. The e↵ectiveness of NetGen was evaluated using custom metrics that include Header Completeness, Sequence Consistency Index, and Field Validity Rate, helping to demonstrate the viability of transformer-based packet generators. Our best 90 M-parameter checkpoint achieves a Header-Completeness (HC) of 0.65, Sequence-Consistency Index (SCI) of 0.43, and Field-Validity Rate (FVR) of 0.93 demonstrating that NetGen has the capacity to reproduce long-range protocol structures. We aim to improve accessibility to realistic network traffic while significantly reducing the technical complexity associated with generating network traffic for academic researchers, cybersecurity professionals, and educators.

Share

COinS