kYLe

I code for a living, which I enjoy

LLM

Disagg PD in vLLM and LMCache

May 02 2025

Tested out Disagg PD in vLLM and sth about LMCache, an open-source Knowledge Delivery Network (KDN), and the Redis for LLMs.

Sliding Window Attention

April 27 2025

I was debugging a sliding window attention bug and it was fixed by this PR. I helped on the review and get it merged.

Eagle 1/2/3 + HASS

April 26 2025

Speculative Decoding w Eagles 0 Medusa review Zhihu explains Medusa building the tree attention has $\Sigma_{i=1}^N\Pi_{j=1}^i{C_i}$ branches ($N$ head and $...

LLM Scores Pass@k to Perplexity

April 24 2025

Some details about LLM measuring scores

4Bit Quantization GPTQ and GGUF and 1Bit LLM

April 21 2025

Maarten gives another greate visual guide on quantization. It’s pretty basic but have couple of interesting points

Dynamo KVindexer

April 20 2025

Source code check for KVindexer. Some help from Zhihu’s Dynamo code analysis.

Dynamo Disagg Skeleton

April 17 2025

Created a PR to add disagg skeleton example for Dynamo. This actually completes the Backend/Worker guide

Dynamo Hello World

April 10 2025

Created a PR to add multi-node hello world example for Dynamo.

SGLang - Nemotron

April 04 2025

Migrated Llama3.3-Nemontron-Super-49B support from vLLM to SGLang and submitted PR for it

RNN(LSTM) vs Mamba

March 30 2025

Hongyee’s AI course in 2025. This one]() about Mamba really clarify the relationship between RNN and Mamba, and I finally understand the intuitive behind LST...

vLLM Fuyu

March 27 2025

Worked on a bugfix PR with discrepency between get_multimodal_embedding and PlaceholderRange.

Deepseek R1 - GRPO

March 23 2025

EZ encoder’s new video on DeepSeekMath

Context Extension by YaRN

March 22 2025

LLM context length can be extended in the post training process. They are all RoPE based algorithem, like YaRN(Yet Another RoPE extensioN)

SmoothQuant and AWQ

March 18 2025

GO through LLM Quantization technologies, mainly from Han’s group in MIT

Deepseek R1 - Training

March 11 2025

Continue with Deepseek R1 from EZ Encoder. Link

Deepseek R1 - CoT

March 01 2025

Continue with Deepseek R1 from EZ Encoder. Link

Deepseek R1 - GPT history

February 28 2025

Continue with Deepseek R1 from EZ Encoder Link

SGLang

February 23 2025

How does Structured Generation Language for LLM achieve such great performances and how is it differentiate from vLLM.

vLLM update - Paligemma

February 19 2025

Notes of updating MM Processor for Paligemma model Initial PR w PromptReplacement class. It worked except for language feature is not working. After debuggin...

Structured Output

February 15 2025

How can LLM follow the format defined in structured output?

Deepseek R1 - RL review

February 14 2025

Taking notes from EZ Encoder Academy’s video series about R1.

Disaggregated Serving

February 12 2025

Disaggregated Serving is about separeting prefill(generate the first token) and decoding(generate token-by-token autoregressively) phase of LLM, and this blo...

Deepseek V3 - MTP

February 10 2025

Great explanation of the MTP used in Deepseek V3. Video source is part 4 of this series. 1 Overview Multi-Token-Prediction of Deepseek is applying Eagle’s ca...

Tensor Parallelism and Pipeline Parallelism

February 08 2025

There are multiple parallelism strategies from video

Deepseek V3 - MoE

February 05 2025

The Deepseek MoE was introduced in this paper

Deepseek V3 - MLA

February 04 2025

Let’s summarize the learning of deepseek V3 from recent weeks

Diffusion Quantization

January 23 2025

I didn’t start the first blog in 2025 till later Jan. I wasn’t quite myself for the first couple of weeks of the new year. Still recovering from the UK trip ...

MultiModal Input in vLLM

December 16 2024

I started my first vLLM contribution by adding the audio input API following OpenAI’s schema.

vLLM

November 25 2024

I have been thinking about start a new open source project beside LangChain and LlamaIndex. vLLM seems a good choice and hopefully there will be more vLLM bl...

Quantization

November 03 2024

Quantization with TRT-LLM can be achieved by customized engine built. You can get INT8 on A100 and FP8 on H100. This step is replacing convert_checkpoint.py ...

TensorRL-LLM

October 30 2024

To understand NIM, you can not avoid deep undertanding of TRT-LLM, Triton and even vLLM. Those will be focus for the near future.

Lookahead

October 24 2024

Last blog about Medusa was going longer than I expected. So I will write separete blogs about lookahead and EAGLE1/2

Boltzmann Machine

October 20 2024

I started learning ML with Andrew Ng’s course, and at the same time, I also took Neuron Network from Hinton. The second one is actually very hard for me and ...

Hopfield Network

October 11 2024

Hopfield and Hinton won Noble Prize for Physics this year. Big surprise! I found this video explains what’s Hopfield’s work. It gets me to think how NN is in...

Medusa and EAGLE

October 08 2024

Both are speculative decoding technologies used to accelerate decoding. There are lookahead, and ReDrafter as well.

Knowledge Distillation

September 15 2024

Distillation was introduced by Hinton and Dean in 2015, another masterpiece from Google.

LLM Router

August 30 2024

LLM Router was introduced by LMSYS and Anyscale. The first open sourced LLM routing and introduced 4 different routing policies.

MoCo and Contrastive Learning

June 15 2024

Basic ideas in Contrastive Learning and Kaiming’s improvment in MoCo.

Feature Pyramid Network and RetinaNet

May 28 2024

PhD Vlog talked about some OD networks, and this is the development line of ODs

GenAI by Hung-yi Lee 2024-03

May 21 2024

It’s been a while that I follow Dr Lee’s 2024 lecture. But watched his video about GPT-4o yesterday and would like to continue this series 0 RLHF First let m...

Andrej Karpathy-Tokenizer

May 18 2024

Tokenizer sounds trivial but plays such an important role in LLM. It actually simply explains why LLM is not good at math arithmatics.

Mamba

May 12 2024

I finally reached the last part of this Mamba intro after going through HiPPO and this video from Umal Jamil.

S4 Structured State Spaces for Sequence Modeling

May 10 2024

Part 2 of Study notes from the video presented by Albert Gu

Llama2 tricks (2)

May 09 2024

This is summary of explanantion of KV cache and RoPE in this video. I really like how Bai explained RoPE. 1 KV Cache First review the meaning of Q/K/V

Speculative Decoding

May 08 2024

A discussion around tokenizers in slack leads to following comment: Would be the natural progression after you arrive at the fact taking three steps to get ...

Kolmogorov-Arnold Network

May 05 2024

It’s all over the internet about how KAN will revolutionize ML by replacing MLP. Here are my frist read about KAN

Andrej Karpathy-WaveNet

April 25 2024

WaveNet published by Google in 2016, a wave generating DNN with dilated casual convolutions.

Andrej Karpathy-Backprop

April 21 2024

Andrej explain his own blog in this lecture.

Andrej Karpathy-BatchNorm

April 19 2024

This part goes deep into some training tricks. Very insightful!

Andrej Karpathy-MLP

April 18 2024

It’s based on Bengio’s paper on MLP, A Neural Probabilistic Language Model in 2003.

Andrej Karpathy-MakeMore

April 17 2024

2.5 hr video of MakeMore.

Andrej Karpathy-GPT continues

April 12 2024

Andrej took a sleep break and here is the second part of his intruct.

Andrej Karpathy-GPT from scratch

April 06 2024

If there is one video you should watch about GPT, this is it. Karpathy’s dive deep on code level of explanation of GPT, it’s a bless to all GenAI engineers.

Legendre polynomials and HiPPO

March 31 2024

Study notes from the video presented by Albert Gu on S4, Structured State Space Sequence model

State Space Machine

March 30 2024

Structured State Space for Sequence Modeliing S4 paper by Albert Gu, 2021

GenAI by Hung-yi Lee 2024-02

March 24 2024

Continue with Part 1, The 5th way of prompt engineering. 5 Model Cooperation Model Cooperatation could be due to cost. This is similar to MoE but no LLM arch...

MoE and Decoder-Only Transformer code

March 22 2024

Summary from this MoE link and this Decoder-only transformer link

LLM Pre-Training and Inference

March 20 2024

This is from Cameron Wolfe’s website and discussed LLM pretraining and Inference in details with code. Very educational and I will write multiple study notes...

GenAI by Hung-yi Lee 2024-01

March 10 2024

It’s time to learn ML/GenAI with Dr Lee AGAIN in 2024.

Prompts

March 07 2024

Prompt methods Notes for this prompt engineer website 1 Automatic Reasoning and Tool-use (ART) Key idea: Add code executing results in the chaining prompt ...

Prompt Engineering for Anthropic

March 03 2024

Varies of prompt engineering tricks for Anthropic Claud

LLM overview in 2024

February 10 2024

Why SFT is before RLHF RLHF need basic capability from the model. 2 steps in RLHF If only learns from HF, then you knows answer to a...

Llama2 tricks

January 26 2024

A good youtube video explained several tricks applied in Llama2 Here is the study notes. 1. Layer normalization Batch norm: normalized by columns (same feat...

Tokenizers in LLM

November 25 2023

Tokenizer is a basic concept in NLP, and basically it generates tokens from a sentence. A token is a bit “less” than a word, so the common ratio between toke...

Pinecone Canopy, tokenizer, poetry…

November 20 2023

Pinecone released Canopy, which is a framework for RAG. It original has OpenAI as LLM and embedding model provider and wants to cooperate with Anyscale for o...

Poe and Modal

November 15 2023

Poe, a chatbot hosting service backed by Quora, is getting popular. I tried to add AE as one of the Chatbot, with Zephyr 7B model. (It seems that first model...

OpenAI API v1 change

November 10 2023

OpenAI’s Dev Day was quite exciting, right? But do you konw they quitely release API v1 and has breaking changes in it? This would keep my busy for next coup...

OpenAI Dev Day

November 07 2023

Assistant is the killer API, for quite some startups, and even for big and popular projects like Pinecone and Open Interpreter.

Continuous Batch

November 01 2023

Read the blog about continuous batching, from Cade and Shen.

Flash Attention

October 30 2023

It’s time to dig into some LLM optimization algirthms. My first googled question was “Flash Attention vs Paged Attention”, which are two popular optimization...

RAG Fusion

October 25 2023

This is a typical example of we can enrich RAG with more advanved methods, and it does NOT required more complicated algorithm or theories. RAG Fusion is ver...

AgentTuning and AgentInstruct

October 22 2023

Came across another agent training paper from Tsinghua, AgentTuning.

FireAct and LLM Datasets

October 15 2023

There are multiple LLM related datasets and I didn’t really pay attentions to till I started working on FireAct demo.

Anyscale Endpoint Integration - LangChain

October 09 2023

First thing first, it took me couple of commits to pass the formatting check in this PR with ruff and black, so I’d better to record them down first Ruff ...

LLM functions

September 29 2023

Open Interpreter was one of the fast growing repos on Github, and it got 26k stars in a month time. I played it in last weekend and had quite some fun with O...

From LLM to Agents

September 19 2023

Yao Shunyu, the original author of ReAct paper, talk about LLM and Agents.

Building Slack Bot for LLM on AWS - Part 2

June 07 2023

“How to create Slack bot which interact with AWS Lambda” This is the prompt I entered for GPT4, and there are the answers from it

Building Slack Bot for LLM on AWS - Part 1

May 28 2023

Kind of a hackathon/weekend project. Built a Slack bot running backend on Anyscale. Actually my very first engineer project did at Amazon was using slack bot...

ML

Kaiming’s ML overview at MIT

March 13 2024

Kaiming He joint MIT as associate professor in Feb 2024 and deliveried “Deep Learning Bootcamp” as his first public talk as MIT professor. It’s very pleasant...

Pytorch - 3 Distributed Computing

January 30 2024

Pytorch gets its own distributed implemetion, either through MPI backend, point to point, inspriation for torch.distributed Meta’s own [GLOO](https://gi...

Pytorch - 2 Model Parallel

January 29 2024

Continue Pytorch notes with Single-Machine Model Parallel. Model Parallel When the model is too large to fit into a single GPU, model parallel is necessary. ...

PyTorch - 1 Data Parallel

January 28 2024

My ML journey started with building NN layer by layer with Tensorflow in 2016. Keras was invented but I still wanted to know more details of Tensorflow but s...

AlphaGeometry

January 19 2024

I have been sick since the trip to China and haven’t fully recovered even till today. I totally underestimated the damage of 雾霾（smog, a word combining smoke ...

Ray Data

December 15 2023

This is not quite the ML topic, but more about using Ray Data library to run batch progressing. Yes, the core function of Ray Data is batch progress, and her...

ML101 - Self Attention

August 20 2023

One more good resource for this introduction is here

ML101 -3

August 12 2023

Start with Regression vs Classification. and introduce softmax It seems there are long stories behind softmax, rather than normalization (Answer: Use Sigmoi...

ML101 -2

August 02 2023

This blog is mainly about optimizers. It’s good to review them all. Overall problem to be solved, different parameters need different learning rate 1 AdaGra...

ML101 -1

July 28 2023

I was recommending some online materials for ML101 to a friend, and videos from Hung-yi Lee are always my first choices. I selected ML courses from his 2021 ...

Python

Recursive

December 02 2024

A piece of code that can print itself s = 's = %r\nprint(s%%s)' print(s%s)

Concurrency Execution

October 21 2024

I was testing sending concurrent requests to LLM server and would like to record two ways to running concurrent processes. and would have a deeper dive on as...

Httpx

July 18 2024

HTTPX is another HTTP client similar to Requests. It’s used as OpenAI’s OpenAI constructor for http_client option.

Pydantic Validators

June 28 2024

Read a good introductions to validators in Pydantic here. Even though Pydantic is gonna deprecate validator and root_valiator decerators in v3, but it’s stil...

Python tips

February 08 2024

Random Python tips I collect recently.

Pytest

November 28 2023

Unit Test, is something I ignore for a long time. I know it existence but barely initiate one. If it’s already in the system, I don’t mind add one, like for ...

argparse and Namespace

May 23 2023

Learn something really trival today, but kind of interesting and useful, which is related to argparse library We all use argparse for argument parsing. The s...

Create python packages Part 1

May 15 2023

This blog records my tests on how to create a Python package 1. Simplest demo of creating a python package Creating a python packages is not hard, but there ...

Math

Elo ranking and Bradly-Terry

July 29 2024

1. try/catch/else/finally Encountered an interesting piece of code for try/catch ```python def no_env_var(var: str): try: # If you have this var, remove it i...

Spline 2

May 15 2024

Second part of this videos talks about spines again. and I think I found some clue to 1/3 of vel in the previous spine

Spline

May 06 2024

I reviewed some basics about Spline in the previous blog about KAN, and happened to find this tutorial. The most amazing part of this video is to show you wh...

Legendre polynomials and HiPPO

March 31 2024

Study notes from the video presented by Albert Gu on S4, Structured State Space Sequence model

Theorema Egregium

March 16 2024

Gauss’s Theorema Egregium, which is Latin for “Remarkable Theorem”, is a major result of differential geometry. I found a good introduction here and video by...

Entropy and Perplexity

March 12 2024

Understanding Shannon’s entropy is curcial to understand concepts like cross entropy and KL divergenece. But perplexity is the concept comes with NLP. Here i...

Poisson and Exponential Distribution

November 02 2023

马同学(Student Horse) is a great source of math concept clarifictions, both in linear algebra and statistics. I came across this explanantion for both Poisson a...

RL

RL in 2025

March 14 2025

Happy $\pi$ Day! It’s time to review RL in 2025. This zhihu gives me a much clear review of value based and policy based methods. I guess the yearly review o...

RL 2024-3

March 29 2024

Focus on PPO in this post

RL 2024-2

March 28 2024

Focus on Policy Gradient in this post from Cameron Wolfe. and with implementation from Spinning up, it’s essencially a feedforward network (MLP with 3 layer...

Reinforcement Learning - Q Learning

June 12 2023

Normally we learn RL from Q learning, seems most easy to understand. But this lecture goes PD and PPO first, then Q learning. Interesting, reminds me of line...

Reinforcement Learning - PPO

June 11 2023

PPO, Proximal Policy Optimizatoin. One of the most powerful RL algorithm, and the default RL training algorithm by OpenAI.

Reinforcement Learning - Policy Gradient

June 09 2023

Learning RL is actually my very first ML project since joined AWS. DeepRace was released at reInvent 2018, and our prototype team got hand on it at early 201...

K8S

K8S behind DGXCloud and NVCF

May 09 2025

Recently all work seems K8S related and practices around k8s helped me onboard DGXCloud and NVCF Helm deployment really fast. 0 Web Server It’s totally irrel...

K9S and Kubeadm

April 30 2025

After playing K8S for couple of weeks and started to deploy K8S and debugging network issues

K8S job for NIM

March 06 2025

To create a NIM by k8s job, I worked out it step by step translating of container operation into k8s scripts.

K8S again

March 02 2025

A k8s intro in Chinese from this video. It explains k8s concepts in a much clear way and I finally feel that I understand the architecture of k8s

Helm and Operators

October 04 2024

Helm to K8s is similar apt to Ubuntu, which is a package management system. It defines pod yaml, deployment yaml, ServiceAccount, Secrets, etc

Kubenetes 101

August 19 2024

I finially started the learning of K8S, and followed this offical doc to get my pods listed, deployment done, and service created.

Cloud

Andrej Karpathy-MicroGrad

April 15 2024

2.5 hr video of micrgrad. I wish I could’ve watched this video 5 yrs earlier! It clears out so many questions about loss.backward()!

SQL 101

February 17 2024

The first new thing I learnt in 2024 is SQL. I did a try back in 2022 job hunting, and was asked to write SQL quries in Databricks SA interview. I never writ...

Distributed System comparison

February 05 2024

Again, I summarized the comparison between each distributed systems here. Couple of interesting points to DASK

CORS setup on AWS APIGateway

September 15 2023

It has been also two month sicne my last update. There is no execuse, but I do want to claim that I have been out of home fro Aug 1st to 22nd, and got totall...

Distributing ML workload by Ray on Amazon SageMaker

May 24 2023

Ray is an open-source framework that provides a simple and flexible way to build and scale distributed applications. It is designed to enable efficient and h...

GPU

CUDA

May 21 2025

1 Concepts thread thread block, consists of warps, executed on SM(Streaming Multiprocessor) warp, is a 32 thread block. A warp is executed physically ...

GPU and related techs 101

May 02 2024

Last time I touched CUDA and GPU was in 2018, when I was preparing for job hunting at CGG. It’s time to review some basics about GPU now

Nvidia GenAI Stack

February 25 2024

Nvidia NeMo GenAI framework on DGX Cloud/Kubernetes Clusters AutoConfigurator SFT and PEFT

Nvidia GPUs

December 10 2023

Nvidia’s server GPUs have been envolving. I was quite familiar with GPU types back in CGG but after 4 or 5 years, I am totally out of sync

Git

Avoid secret leak in GIT

October 16 2024

Frankly speaking security is the area I care the least. Not interested in any security related topics except for RSA, which is only because the algorithm beh...

Git Merges

September 13 2024

Merge vs Rebase Great explanation from this tutorial and this video

Git Undo

September 10 2024

I will start with simple cheat sheet before diving into reset/revert/checkout

Git

February 21 2024

Migrate my notes on Git from Google Keep to here

CV

SAM and BLIP

June 16 2024

Segement Anything Model and Boostrap Lang-Image Pretraining

CLIP

June 12 2024

Dr Vlog gave a talk on CLIP to Math PhDs and summarized in a 50mins video.

YOLO v4-v9

June 08 2024

Continue to finish YOLO v4-v9 in this video. Just curious what they did to ship these many new versions of YOLO

SSD and YOLO

June 01 2024

There is really no need to know the details of implementation from YOLO V1 to V9. But considering it’s the one model helped me quite a lot on AWS projects, I...

AWS

Building Slack Bot for LLM on AWS - Part 2

June 07 2023

“How to create Slack bot which interact with AWS Lambda” This is the prompt I entered for GPT4, and there are the answers from it

Building Slack Bot for LLM on AWS - Part 1

May 28 2023

Kind of a hackathon/weekend project. Built a Slack bot running backend on Anyscale. Actually my very first engineer project did at Amazon was using slack bot...

Distributing ML workload by Ray on Amazon SageMaker

May 24 2023

Ray is an open-source framework that provides a simple and flexible way to build and scale distributed applications. It is designed to enable efficient and h...

GenAI

Building Slack Bot for LLM on AWS - Part 2

June 07 2023

“How to create Slack bot which interact with AWS Lambda” This is the prompt I entered for GPT4, and there are the answers from it

Building Slack Bot for LLM on AWS - Part 1

May 28 2023

Kind of a hackathon/weekend project. Built a Slack bot running backend on Anyscale. Actually my very first engineer project did at Amazon was using slack bot...

Frontend

Create Workshop by Hugo Part 2

July 09 2023

This part will focus on how to host static webpage created by Hugo online, mainly leverage online cloud services like AWS S3.

Create Workshop by Hugo Part 1

July 03 2023

Workshop instructions by Hugo is a great tool. Easy to create, goodlooking template and look professional. I am so regretful that my early workshops in AWS w...

Misc

Slurm and Enroot

May 19 2025

Finally touching on Slurm system. First heard about during CGG time, and we had some brief discussing of using it for cluster jobs. But our own implemention ...

Workshops and Buildouts

January 24 2024

One interesting part of my job is to create workshops and buildouts (one type of workshop focusing on AWS building). Here are some screenshots of my previous...

math

Bradley Terry and Elo Score

July 31 2024

Worked on SW(Similarity Weighted) routing policy on LLM Router, and learned Bradyley Terry and Elo score. Very interesting topics

Bezier Curves

May 14 2024

This is one of most amazing videos for math concept introduction，together with a great primer

container

Container system

June 18 2024

Summary from this post

Something about Docker

June 03 2024

Almost every 6 months, I would look up the differences between Docker ENTERPOINT and CMD. Now it comes to my blog

Network

NVLink, InfiniBand and SpectrumX

May 13 2025

Summary from zhihu post, which some picture from here.

Proxy and Reverse Proxy

August 10 2024

Every time I see the word “Proxy” I feel some kind of uneasy, not to mention how I feel when I see “Reverse Proxy”. Now looked it up at this intro

Ray/Anyscale

Distributing ML workload by Ray on Amazon SageMaker

May 24 2023

Ray is an open-source framework that provides a simple and flexible way to build and scale distributed applications. It is designed to enable efficient and h...

SageMaker

Distributing ML workload by Ray on Amazon SageMaker

May 24 2023

Ray is an open-source framework that provides a simple and flexible way to build and scale distributed applications. It is designed to enable efficient and h...

JSON

JSON Schema

December 08 2023

I was playing with function calling with AE, yes, now AE is the first host to enable function call/JSON model on open source models, Mistral and Mixtral seri...

Fun

Tetrics

December 22 2023

I was creating a Tetris game for the first time. Even though it was a popular programming task during college time, but I never actually tried to implement i...

R1

RL 2024-2

March 26 2024

Review some basic concepts in RL from this blog

DL

Object Detection Summary

April 03 2024

I feel I was reading a lot of LLM related topics recently but getting far away from CV. I happened to read this post from v_JULY_v and it’s a good review for...

Diffusion

Diffusion Models

May 22 2024

The original motivation of this tech blog was to understand diffusion models. It’s such a beautify algorithm that I spent lots of time reading from Lilian...

MultiModals

Image Text Fusion

May 31 2024

Jump into Multi Models before June. This video talks 6 different ways to fuse text and image together.

Torch

Torchtune

August 22 2024

A customer request, show a OSS solution for LoRA finetune. Open sourced NeMo is the backup plan and Pytorch PEFT is preferred

kYLe

Posts by Tag

LLM

ML

Python

Math

RL

K8S

Cloud

GPU

Git

CV

AWS

GenAI