Profile
Education
Columbia University, School of Engineering and Applied Science
New York City, NY
M.S. in Data Science
Sep 2023 – Dec 2025 (Expected)
Core Courses: Machine Learning, Natural Language Processing, Algorithm Analysis, Reinforcement Learning, Unsupervised Learning, High
Performance Machine Learning, Data Science, Computer Systems for Data Science, Probability, Statistical Inference, Scaling LLM Systems, Modern Mathematical Analysis
Tongji University
Shanghai, CN
B.S. in Bioinformatics and Computer Software Engineering
Sep 2019 – Jun 2023
Core Courses: Data Structures (C++), Machine Learning Theory, Software Engineering, Foundation of Database, Micro-service and Web Service, Calculus, Linear Algebra, Discrete Math, Numerical Methods and Algorithms
Skills
- Machine Learning: sklearn, Regression, Bagging, Boosting, Supervised Learning, Featuring Engineering, Deep Neural Network (DNN)
- Deep Learning: PyTorch, Tensorflow, HugingFace, accelerate, Megetron, distributed training, Ray, Lightning, Deepspeed, Optuna, PEFT, HPC
- Natural Language Processing: transformers, RNN, LSTM, BERT, GPT, T5, LLaMA, LangChain, LlamaIndex, RAG, Agent, nltk, spaCy
- Computer Vision: diffusion, torchvision, diffusor, OpenCV, Pillow, CNN, ResNet, YOLO, UNet, DDPM/DDIM, ControlNet, ViT, CLIP, VLM
- Others: numpy, pandas, PySpark, Hadoop, Docker, Kubernetes, Rabbit MQ, Kafka, Flink, Neo4j, AWS, GCP, Azure, Distributed system
Experiences
Kaliber AI
Santa Clara, CA
Machine Learning Researcher Intern
Jun 2025 – Present
- Achieved 99% accuracy on finetuned transformer-based model for speech recognition and 1.3x speed by TensorRT triton inference server
- Finetuned ViT/ResNet-based models on image recognition tasks with 90+% accuracy while applying SAM2 to enhance tracking functionality
- Integrated 3D Object Detector and tool functionalities to develop AI-agentic systems enabling robotic decision-making and task execution
L’Oreal
New York City, NY
Machine Learning Engineer Intern
Jan 2025 – May 2025
- Led 4-team to engineer a knowledge graph RAG pipeline with 3 LLM-based methods combining OpenAI and Neo4j for product recommendation
- Designed NoSQL queries with KNN and graph community detection algorithms to enhance RAG workflows to customize customer support
- Attained 0.91 answer relevance and 0.58 faithfulness on question answering task when benchmarking GraphRAG pipeline using Llamaindex
Department of Computer Science, UAlbany
New York City, NY
Research Engineer
Apr 2025 – Present
- Improved 8% on Exact Match and 9% F1 for multi-hop QA tasks by fine-tuning small-scale LLMs with margin-aware preference learning method
- Customized 3 preference learning trainers variants based on DPO, ORPO, and CPO in trl (reinforcement learning library) for reasoning tasks
Data Science Institute, Columbia University
New York City, NY
Research Scholar
Jan 2025 – Present
- Processed a realistic phytoplankton image dataset while utilizing OpenCV for segmentation to obtain 1 million cell items across 200+ stations
- Applied unsupervised clustering algorithms including K-Means, Spectral Clustering, and DBSCAN to classify diverse phytoplankton species cluster based on both physical attributes and ResNet50-generated image embeddings
DitecT Laboratory, Columbia University
New York City, NY
Graduate Researcher in CV
Sep 2024 – Dec 2024
- Fine-tuned a diffusion-based video generation model with 1.5k traffic collision scenarios video preprocessed by OpenCV and captioned by LLaVA
- Employed 2 phrases training procedure to enhance domain relatedness and temporal consistency separately with LoRA technique on HPC
- Evaluated performance of collision text-to-video generation model and achieve 0.8 Contrastive Language–Image Pretraining (CLIP) metric
AIQuraishi Laboratory, Columbia University
New York City, NY
Graduate Researcher in NLP
Apr 2024 – Aug 2024
- Collected and tidied 600k+ peptide datasets and 35 protein datasets with Python to ensure high-quality data for model training and benchmark
- Trained transformer-based language models by masked sequence modeling on Slurm-supported HPC to generate protein representation
- Conducted benchmark pipeline with 5 models including Neural Network, Query Attention and Contrastive Learning with PyTorch Lightning
Radical AI Inc.
New York City, NY
AI Engineer Intern
May 2024 – Aug 2024
- Engineered a chat-based course assistant leveraging the Google Gemini model, displaying quiz generation and personalized learning instruction
- Established a robust FastAPI backend to process diverse files (YouTube videos, Microsoft documents, etc.) with LangChain and ChromaDB
- Ensured high performance through meticulous unit testing with Pytest and comprehensive integration testing within Docker environments
Shanghai Foxhub Network Technology Company
Shanghai, CN
Data Engineer Intern
Aug 2022 – Oct 2022
- Formulated relational MySQL database architecture (ER diagrams) and managed unstructured data sources (OSS) on Alibaba Cloud
- Crafted shell scripts for database access permissions and backup operations, ensuring stability in production and development environments
Projects
- Incorporated 3 Reinforcement Learning algorithms such as GRPO with Causal Language Model and LoRA in PEFT for reinforcement fine-tuning
- Enhanced 1.7% exact match accuracy performance of lightweight Qwen2.5 model on math reasoing task like GSM8K with trl implementation
- Increased 9% performance of small models by distilling knowledge from BERT/Qwen2.5 for classifications, language modeling, summarization
- Accelerated 19% running speed by Flash Attention, mixed precision, PyTorch Dynamo for training and vLLM (Page Attention) for inference
Controlling Generative Diffusion Models with Unsupervised Machine Learning Algorithms
Sep 2024 – Dec 2024
- Undertook literature review to explore latent space (h-space) of DDIM model and its properties to accommodate semantic manipulation
- Applied 5 linear and non-linear dimension reduction algorithms (PCA, ICA, MDS, Random Projection, tSNE) to interpret and analyze latent representations within diffusion models, enhancing model interpretability and feature insights by extracting 6 main semantic dimensions
Custom LLM Chatbots with Character-Specific Tone
Aug 2024 – Oct 2024
- Embedded 100k+ review texts using advanced text embedding models (BAAI) to capture nuanced customer sentiment and contextual details
- Harnessed review embedding alongside product information to train 3 models (Linear Regression, Random Forest, XGBoost) to predict rating
- Incorporated collaborative filtering methods including explicit/implicit/hybrid matrix factorization to build recommendation systems while realizing 0.83 recall@5 and 0.78 precision@5
- Arranged 4 modules to perform exploratory data analysis (EDA) with tidyverse to uncover patterns of 10-years billionaires assets dataset
- Formed a Shiny App with 3 panels for interactive data exploration featuring dynamic visualizations in longitudinal and geographic prospective
- Created 9-entries Bootstrap-based website on GitHub Pages, showcasing comprehensive findings and insights about billionaires worldwide
Course Management System
Nov 2022 – Jan 2023
- Led 4-members group to construct a micro-service system utilizing Java and React with engaging in agile development process including requirement specification, system design, implementation and testing, achieving a web service application with 4 main functionalities
- Built a hybrid database structure with MySQL for relational data and MongoDB for archival data maintained separately with 2 Docker containers
- Implemented and tested 34 RESTful APIs with SpringBoot framework and interactive website with React, Node.js, Axios, Bootstrap, Webpack
Neurodegenerative Diseases Onset Prediction
Jun 2022 – Jul 2022
- Completed data collection and feature engineering on open-source patient data about the onset of Alzheimer’s disease and Parkinson’s disease
- Launched predictive models achieving 80%+ accuracy based on SVM, decision tree via sklearn and provided Flask website for interactive usage
PlantDB Desktop App
May 2022 – Jun 2022
- Delivered desktop app with 13 interactive interfaces and 3 roles for plant information retrieval and note-taking based on C# and VS.NET
- Devised and deployed a relational database on SQL Server platform to set up schema for user accessibility, note storage and plant searching
Neurodegenerative Diseases Onset Prediction
Jun 2022 – Jul 2022
- Collected and processed 36 solid waste datasets from Zhejiang Province to establish a robust foundation for model training and analysis
- Realized about 21% increase on metrics like Pearson coefficient of solid waste composition prediction task using neural network model with machine learning technologies including L2 regularization, Adam optimizer and dropout, batch normalization via PyTorch library
- Visualized data features and model evaluation results via matplotlib and Tensorflow library to give instructions for garbage processing schedule