Profile
Education
Columbia University, School of Engineering and Applied Science
New York City, NY
M.S. in Data Science
Sep 2023 – Dec 2025 (Expected)
Core Courses: Machine Learning, Natural Language Processing, Algorithms for Data Science, Reinforcement Learning, Unsupervised Learning, High Performance Machine Learning, Data Science, Computer Systems for Data Science, Probability, Statistical Inference, Modern Analysis
Tongji University
Shanghai, CN
B.S. in Bioinformatics, Minor in Software Engineering
Sep 2019 – Jun 2023
Core Courses: Data Structures (C++), Machine Learning Theory, Software Engineering, Foundation of Database, Micro-service and Web Service, Calculus, Linear Algebra, Discrete Math, Numerical Methods and Algorithms
Skills
- Programming: Python, R, SQL, Java, C#, shell, HTML/CSS/JavaScript
- Data Science: sklearn, PyTorch, TensorFlow, numpy, pandas, scipy, PySpark, HuggingFace, Transformers, PEFT, tidyverse, Shiny, Power BI
- Concepts: Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Object-Oriented Programming, Data Structure, RESTful API, RDBMS, NoSQL, Agile Development, Cloud Computing, EC2
Experiences
Data Science Institute, Columbia University
New York City, NY
Research Scholar
Jan 2025 – Present
- Processed a realistic phytoplankton image dataset while utilizing OpenCV for segmentation to obtain 1 million cell items across 200+ stations
- Applied unsupervised clustering algorithms including K-Means, Spectral Clustering, and DBSCAN to classify diverse phytoplankton species cluster based on both physical attributes and ResNet50-generated image embeddings
DitecT Laboratory, Columbia University
New York City, NY
Graduate Researcher
Sep 2024 – Dec 2024
- Fine-tuned a diffusion-based video generation model with 1.5k traffic collision scenarios video preprocessed by OpenCV and captioned by LLaVA
- Employed 2 phrases training procedure to enhance domain relatedness and temporal consistency separately with LoRA technique on HPC
- Evaluated performance of collision text-to-video generation model and achieve 0.8 Contrastive Language–Image Pretraining (CLIP) metric
AIQuraishi Laboratory, Columbia University
New York City, NY
Graduate Researcher
Apr 2024 – Aug 2024
- Collected and tidied 600k+ peptide datasets and 35 protein datasets with Python to ensure high-quality data for model training and benchmark
- Trained transformer-based language models by masked sequence modeling on Slurm-supported HPC to generate protein representation
- Conducted benchmark pipeline with 5 models including Neural Network, Query Attention and Contrastive Learning with PyTorch Lightning
Radical AI Inc.
New York City, NY
AI Engineer Intern
May 2024 – Aug 2024
- Engineered a chat-based course assistant leveraging the Google Gemini model, displaying quiz generation and personalized learning instruction
- Established a robust FastAPI backend to process diverse files (YouTube videos, Microsoft documents, etc.) with LangChain and ChromaDB
- Ensured high performance through meticulous unit testing with Pytest and comprehensive integration testing within Docker environments
Shanghai Foxhub Network Technology Company
Shanghai, CN
Data Engineer Intern
Aug 2022 – Oct 2022
- Formulated relational MySQL database architecture (ER diagrams) and managed unstructured data sources (OSS) on Alibaba Cloud
- Crafted shell scripts for database access permissions and backup operations, ensuring stability in production and development environments
Projects
Automated Knowledge Graph Creation for GraphRAG
Jan 2025 – Present
- Engineered a pipeline with 3 teammates by using LLMs to construct knowledge graph from an Amazon product dataset and store entities in Neo4j
- Designed Cypher queries and applied Leiden community detection algorithms to distill context for RAG workflows developed by LangChain
- Executed rigorous performance evaluation using DeepEval library, benchmarking GraphRAG pipeline against key LLM task metrics, attaining 0.58 faithfulness and 0.91 answer relevance, demonstrating improved contextual accuracy and reliability in generated responses
Exploration of Semantic Latent Spaces in Diffusion Models
Sep 2024 – Dec 2024
- Undertook literature review to explore latent space (h-space) of DDIM model and its properties to accommodate semantic manipulation
- Applied 5 linear and non-linear dimension reduction algorithms (PCA, ICA, MDS, Random Projection, tSNE) to interpret and analyze latent representations within diffusion models, enhancing model interpretability and feature insights by extracting 6 main semantic dimensions
Custom LLM Chatbots with Character-Specific Tone
Aug 2024 – Present
- Embedded 100k+ review texts using advanced text embedding models (BAAI) to capture nuanced customer sentiment and contextual details
- Harnessed review embedding alongside product information to train 3 models (Linear Regression, Random Forest, XGBoost) to predict rating
- Incorporated collaborative filtering methods including explicit/implicit/hybrid matrix factorization to build recommendation systems while realizing 0.83 recall@5 and 0.78 precision@5
Custom LLM Chatbots with Character-Specific Tone
Aug 2024 – Sep 2024
- Scraped 100+ collections of chat datasets from public wiki websites by operating a web scraper built with BeautifulSoup and Selenium in Python
- Fine-tuned 3 state-of-the-art LLMs like LLaMA leveraging LoRA technique on the HuggingFace/PEFT platform to tailor specific tone of chatbot
- Constructed RESTful API with FastAPI as backend and a multi-page app with Streamlit as frontend for interactive usage of customized models
- Arranged 4 modules to perform exploratory data analysis (EDA) with tidyverse to uncover patterns of 10-years billionaires assets dataset
- Formed a Shiny App with 3 panels for interactive data exploration featuring dynamic visualizations in longitudinal and geographic prospective
- Created 9-entries Bootstrap-based website on GitHub Pages, showcasing comprehensive findings and insights about billionaires worldwide
Course Management System
Nov 2022 – Jan 2023
- Led 4-members group to construct a micro-service system utilizing Java and React with engaging in agile development process including requirement specification, system design, implementation and testing, achieving a web service application with 4 main functionalities
- Built a hybrid database structure with MySQL for relational data and MongoDB for archival data maintained separately with 2 Docker containers
- Implemented and tested 34 RESTful APIs with SpringBoot framework and interactive website with React, Node.js, Axios, Bootstrap, Webpack
Neurodegenerative Diseases Onset Prediction
Jun 2022 – Jul 2022
- Completed data collection and feature engineering on open-source patient data about the onset of Alzheimer’s disease and Parkinson’s disease
- Launched predictive models achieving 80%+ accuracy based on SVM, decision tree via sklearn and provided Flask website for interactive usage
PlantDB Desktop App
May 2022 – Jun 2022
- Delivered desktop app with 13 interactive interfaces and 3 roles for plant information retrieval and note-taking based on C# and VS.NET
- Devised and deployed a relational database on SQL Server platform to set up schema for user accessibility, note storage and plant searching
Neurodegenerative Diseases Onset Prediction
Jun 2022 – Jul 2022
- Collected and processed 36 solid waste datasets from Zhejiang Province to establish a robust foundation for model training and analysis
- Realized about 21% increase on metrics like Pearson coefficient of solid waste composition prediction task using neural network model with machine learning technologies including L2 regularization, Adam optimizer and dropout, batch normalization via PyTorch library
- Visualized data features and model evaluation results via matplotlib and Tensorflow library to give instructions for garbage processing schedule