kYLe

I code for a living, which I enjoy

Posts by Year

2025 35
2024 79
2023 34

2025

CUDA

May 21 2025

1 Concepts thread thread block, consists of warps, executed on SM(Streaming Multiprocessor) warp, is a 32 thread block. A warp is executed physically ...

Slurm and Enroot

May 19 2025

Finally touching on Slurm system. First heard about during CGG time, and we had some brief discussing of using it for cluster jobs. But our own implemention ...

NVLink, InfiniBand and SpectrumX

May 13 2025

Summary from zhihu post, which some picture from here.

K8S behind DGXCloud and NVCF

May 09 2025

Recently all work seems K8S related and practices around k8s helped me onboard DGXCloud and NVCF Helm deployment really fast. 0 Web Server It’s totally irrel...

Disagg PD in vLLM and LMCache

May 02 2025

Tested out Disagg PD in vLLM and sth about LMCache, an open-source Knowledge Delivery Network (KDN), and the Redis for LLMs.

K9S and Kubeadm

April 30 2025

After playing K8S for couple of weeks and started to deploy K8S and debugging network issues

Sliding Window Attention

April 27 2025

I was debugging a sliding window attention bug and it was fixed by this PR. I helped on the review and get it merged.

Eagle 1/2/3 + HASS

April 26 2025

Speculative Decoding w Eagles 0 Medusa review Zhihu explains Medusa building the tree attention has $\Sigma_{i=1}^N\Pi_{j=1}^i{C_i}$ branches ($N$ head and $...

LLM Scores Pass@k to Perplexity

April 24 2025

Some details about LLM measuring scores

4Bit Quantization GPTQ and GGUF and 1Bit LLM

April 21 2025

Maarten gives another greate visual guide on quantization. It’s pretty basic but have couple of interesting points

Dynamo KVindexer

April 20 2025

Source code check for KVindexer. Some help from Zhihu’s Dynamo code analysis.

Dynamo Disagg Skeleton

April 17 2025

Created a PR to add disagg skeleton example for Dynamo. This actually completes the Backend/Worker guide

Dynamo Hello World

April 10 2025

Created a PR to add multi-node hello world example for Dynamo.

SGLang - Nemotron

April 04 2025

Migrated Llama3.3-Nemontron-Super-49B support from vLLM to SGLang and submitted PR for it

RNN(LSTM) vs Mamba

March 30 2025

Hongyee’s AI course in 2025. This one]() about Mamba really clarify the relationship between RNN and Mamba, and I finally understand the intuitive behind LST...

vLLM Fuyu

March 27 2025

Worked on a bugfix PR with discrepency between get_multimodal_embedding and PlaceholderRange.

Deepseek R1 - GRPO

March 23 2025

EZ encoder’s new video on DeepSeekMath

Context Extension by YaRN

March 22 2025

LLM context length can be extended in the post training process. They are all RoPE based algorithem, like YaRN(Yet Another RoPE extensioN)

SmoothQuant and AWQ

March 18 2025

GO through LLM Quantization technologies, mainly from Han’s group in MIT

RL in 2025

March 14 2025

Happy $\pi$ Day! It’s time to review RL in 2025. This zhihu gives me a much clear review of value based and policy based methods. I guess the yearly review o...

Deepseek R1 - Training

March 11 2025

Continue with Deepseek R1 from EZ Encoder. Link

K8S job for NIM

March 06 2025

To create a NIM by k8s job, I worked out it step by step translating of container operation into k8s scripts.

K8S again

March 02 2025

A k8s intro in Chinese from this video. It explains k8s concepts in a much clear way and I finally feel that I understand the architecture of k8s

Deepseek R1 - CoT

March 01 2025

Continue with Deepseek R1 from EZ Encoder. Link

Deepseek R1 - GPT history

February 28 2025

Continue with Deepseek R1 from EZ Encoder Link

SGLang

February 23 2025

How does Structured Generation Language for LLM achieve such great performances and how is it differentiate from vLLM.

vLLM update - Paligemma

February 19 2025

Notes of updating MM Processor for Paligemma model Initial PR w PromptReplacement class. It worked except for language feature is not working. After debuggin...

Structured Output

February 15 2025

How can LLM follow the format defined in structured output?

Deepseek R1 - RL review

February 14 2025

Taking notes from EZ Encoder Academy’s video series about R1.

Disaggregated Serving

February 12 2025

Disaggregated Serving is about separeting prefill(generate the first token) and decoding(generate token-by-token autoregressively) phase of LLM, and this blo...

Deepseek V3 - MTP

February 10 2025

Great explanation of the MTP used in Deepseek V3. Video source is part 4 of this series. 1 Overview Multi-Token-Prediction of Deepseek is applying Eagle’s ca...

Tensor Parallelism and Pipeline Parallelism

February 08 2025

There are multiple parallelism strategies from video

Deepseek V3 - MoE

February 05 2025

The Deepseek MoE was introduced in this paper

Deepseek V3 - MLA

February 04 2025

Let’s summarize the learning of deepseek V3 from recent weeks

Diffusion Quantization

January 23 2025

I didn’t start the first blog in 2025 till later Jan. I wasn’t quite myself for the first couple of weeks of the new year. Still recovering from the UK trip ...

2024

MultiModal Input in vLLM

December 16 2024

I started my first vLLM contribution by adding the audio input API following OpenAI’s schema.

Recursive

December 02 2024

A piece of code that can print itself s = 's = %r\nprint(s%%s)' print(s%s)

vLLM

November 25 2024

I have been thinking about start a new open source project beside LangChain and LlamaIndex. vLLM seems a good choice and hopefully there will be more vLLM bl...

Quantization

November 03 2024

Quantization with TRT-LLM can be achieved by customized engine built. You can get INT8 on A100 and FP8 on H100. This step is replacing convert_checkpoint.py ...

TensorRL-LLM

October 30 2024

To understand NIM, you can not avoid deep undertanding of TRT-LLM, Triton and even vLLM. Those will be focus for the near future.

Lookahead

October 24 2024

Last blog about Medusa was going longer than I expected. So I will write separete blogs about lookahead and EAGLE1/2

Concurrency Execution

October 21 2024

I was testing sending concurrent requests to LLM server and would like to record two ways to running concurrent processes. and would have a deeper dive on as...

Boltzmann Machine

October 20 2024

I started learning ML with Andrew Ng’s course, and at the same time, I also took Neuron Network from Hinton. The second one is actually very hard for me and ...

Avoid secret leak in GIT

October 16 2024

Frankly speaking security is the area I care the least. Not interested in any security related topics except for RSA, which is only because the algorithm beh...

Hopfield Network

October 11 2024

Hopfield and Hinton won Noble Prize for Physics this year. Big surprise! I found this video explains what’s Hopfield’s work. It gets me to think how NN is in...

Medusa and EAGLE

October 08 2024

Both are speculative decoding technologies used to accelerate decoding. There are lookahead, and ReDrafter as well.

Helm and Operators

October 04 2024

Helm to K8s is similar apt to Ubuntu, which is a package management system. It defines pod yaml, deployment yaml, ServiceAccount, Secrets, etc

Knowledge Distillation

September 15 2024

Distillation was introduced by Hinton and Dean in 2015, another masterpiece from Google.

Git Merges

September 13 2024

Merge vs Rebase Great explanation from this tutorial and this video

Git Undo

September 10 2024

I will start with simple cheat sheet before diving into reset/revert/checkout

LLM Router

August 30 2024

LLM Router was introduced by LMSYS and Anyscale. The first open sourced LLM routing and introduced 4 different routing policies.

Torchtune

August 22 2024

A customer request, show a OSS solution for LoRA finetune. Open sourced NeMo is the backup plan and Pytorch PEFT is preferred

Kubenetes 101

August 19 2024

I finially started the learning of K8S, and followed this offical doc to get my pods listed, deployment done, and service created.

Proxy and Reverse Proxy

August 10 2024

Every time I see the word “Proxy” I feel some kind of uneasy, not to mention how I feel when I see “Reverse Proxy”. Now looked it up at this intro

Bradley Terry and Elo Score

July 31 2024

Worked on SW(Similarity Weighted) routing policy on LLM Router, and learned Bradyley Terry and Elo score. Very interesting topics

Elo ranking and Bradly-Terry

July 29 2024

1. try/catch/else/finally Encountered an interesting piece of code for try/catch ```python def no_env_var(var: str): try: # If you have this var, remove it i...

Httpx

July 18 2024

HTTPX is another HTTP client similar to Requests. It’s used as OpenAI’s OpenAI constructor for http_client option.

Pydantic Validators

June 28 2024

Read a good introductions to validators in Pydantic here. Even though Pydantic is gonna deprecate validator and root_valiator decerators in v3, but it’s stil...

Container system

June 18 2024

Summary from this post

SAM and BLIP

June 16 2024

Segement Anything Model and Boostrap Lang-Image Pretraining

MoCo and Contrastive Learning

June 15 2024

Basic ideas in Contrastive Learning and Kaiming’s improvment in MoCo.

CLIP

June 12 2024

Dr Vlog gave a talk on CLIP to Math PhDs and summarized in a 50mins video.

YOLO v4-v9

June 08 2024

Continue to finish YOLO v4-v9 in this video. Just curious what they did to ship these many new versions of YOLO

Something about Docker

June 03 2024

Almost every 6 months, I would look up the differences between Docker ENTERPOINT and CMD. Now it comes to my blog

SSD and YOLO

June 01 2024

There is really no need to know the details of implementation from YOLO V1 to V9. But considering it’s the one model helped me quite a lot on AWS projects, I...

Image Text Fusion

May 31 2024

Jump into Multi Models before June. This video talks 6 different ways to fuse text and image together.

Feature Pyramid Network and RetinaNet

May 28 2024

PhD Vlog talked about some OD networks, and this is the development line of ODs

Diffusion Models

May 22 2024

The original motivation of this tech blog was to understand diffusion models. It’s such a beautify algorithm that I spent lots of time reading from Lilian...

GenAI by Hung-yi Lee 2024-03

May 21 2024

It’s been a while that I follow Dr Lee’s 2024 lecture. But watched his video about GPT-4o yesterday and would like to continue this series 0 RLHF First let m...

Andrej Karpathy-Tokenizer

May 18 2024

Tokenizer sounds trivial but plays such an important role in LLM. It actually simply explains why LLM is not good at math arithmatics.

Spline 2

May 15 2024

Second part of this videos talks about spines again. and I think I found some clue to 1/3 of vel in the previous spine

Bezier Curves

May 14 2024

This is one of most amazing videos for math concept introduction，together with a great primer

Mamba

May 12 2024

I finally reached the last part of this Mamba intro after going through HiPPO and this video from Umal Jamil.

S4 Structured State Spaces for Sequence Modeling

May 10 2024

Part 2 of Study notes from the video presented by Albert Gu

Llama2 tricks (2)

May 09 2024

This is summary of explanantion of KV cache and RoPE in this video. I really like how Bai explained RoPE. 1 KV Cache First review the meaning of Q/K/V

Speculative Decoding

May 08 2024

A discussion around tokenizers in slack leads to following comment: Would be the natural progression after you arrive at the fact taking three steps to get ...

Spline

May 06 2024

I reviewed some basics about Spline in the previous blog about KAN, and happened to find this tutorial. The most amazing part of this video is to show you wh...

Kolmogorov-Arnold Network

May 05 2024

It’s all over the internet about how KAN will revolutionize ML by replacing MLP. Here are my frist read about KAN

GPU and related techs 101

May 02 2024

Last time I touched CUDA and GPU was in 2018, when I was preparing for job hunting at CGG. It’s time to review some basics about GPU now

Andrej Karpathy-WaveNet

April 25 2024

WaveNet published by Google in 2016, a wave generating DNN with dilated casual convolutions.

Andrej Karpathy-Backprop

April 21 2024

Andrej explain his own blog in this lecture.

Andrej Karpathy-BatchNorm

April 19 2024

This part goes deep into some training tricks. Very insightful!

Andrej Karpathy-MLP

April 18 2024

It’s based on Bengio’s paper on MLP, A Neural Probabilistic Language Model in 2003.

Andrej Karpathy-MakeMore

April 17 2024

2.5 hr video of MakeMore.

Andrej Karpathy-MicroGrad

April 15 2024

2.5 hr video of micrgrad. I wish I could’ve watched this video 5 yrs earlier! It clears out so many questions about loss.backward()!

Andrej Karpathy-GPT continues

April 12 2024

Andrej took a sleep break and here is the second part of his intruct.

Andrej Karpathy-GPT from scratch

April 06 2024

If there is one video you should watch about GPT, this is it. Karpathy’s dive deep on code level of explanation of GPT, it’s a bless to all GenAI engineers.

Object Detection Summary

April 03 2024

I feel I was reading a lot of LLM related topics recently but getting far away from CV. I happened to read this post from v_JULY_v and it’s a good review for...

Legendre polynomials and HiPPO

March 31 2024

Study notes from the video presented by Albert Gu on S4, Structured State Space Sequence model

State Space Machine

March 30 2024

Structured State Space for Sequence Modeliing S4 paper by Albert Gu, 2021

RL 2024-3

March 29 2024

Focus on PPO in this post

RL 2024-2

March 28 2024

Focus on Policy Gradient in this post from Cameron Wolfe. and with implementation from Spinning up, it’s essencially a feedforward network (MLP with 3 layer...

RL 2024-2

March 26 2024

Review some basic concepts in RL from this blog

GenAI by Hung-yi Lee 2024-02

March 24 2024

Continue with Part 1, The 5th way of prompt engineering. 5 Model Cooperation Model Cooperatation could be due to cost. This is similar to MoE but no LLM arch...

MoE and Decoder-Only Transformer code

March 22 2024

Summary from this MoE link and this Decoder-only transformer link

LLM Pre-Training and Inference

March 20 2024

This is from Cameron Wolfe’s website and discussed LLM pretraining and Inference in details with code. Very educational and I will write multiple study notes...

Theorema Egregium

March 16 2024

Gauss’s Theorema Egregium, which is Latin for “Remarkable Theorem”, is a major result of differential geometry. I found a good introduction here and video by...

Kaiming’s ML overview at MIT

March 13 2024

Kaiming He joint MIT as associate professor in Feb 2024 and deliveried “Deep Learning Bootcamp” as his first public talk as MIT professor. It’s very pleasant...

Entropy and Perplexity

March 12 2024

Understanding Shannon’s entropy is curcial to understand concepts like cross entropy and KL divergenece. But perplexity is the concept comes with NLP. Here i...

GenAI by Hung-yi Lee 2024-01

March 10 2024

It’s time to learn ML/GenAI with Dr Lee AGAIN in 2024.

Prompts

March 07 2024

Prompt methods Notes for this prompt engineer website 1 Automatic Reasoning and Tool-use (ART) Key idea: Add code executing results in the chaining prompt ...

Prompt Engineering for Anthropic

March 03 2024

Varies of prompt engineering tricks for Anthropic Claud

Nvidia GenAI Stack

February 25 2024

Nvidia NeMo GenAI framework on DGX Cloud/Kubernetes Clusters AutoConfigurator SFT and PEFT

Git

February 21 2024

Migrate my notes on Git from Google Keep to here

SQL 101

February 17 2024

The first new thing I learnt in 2024 is SQL. I did a try back in 2022 job hunting, and was asked to write SQL quries in Databricks SA interview. I never writ...

LLM overview in 2024

February 10 2024

Why SFT is before RLHF RLHF need basic capability from the model. 2 steps in RLHF If only learns from HF, then you knows answer to a...

Python tips

February 08 2024

Random Python tips I collect recently.

Distributed System comparison

February 05 2024

Again, I summarized the comparison between each distributed systems here. Couple of interesting points to DASK

Pytorch - 3 Distributed Computing

January 30 2024

Pytorch gets its own distributed implemetion, either through MPI backend, point to point, inspriation for torch.distributed Meta’s own [GLOO](https://gi...

Pytorch - 2 Model Parallel

January 29 2024

Continue Pytorch notes with Single-Machine Model Parallel. Model Parallel When the model is too large to fit into a single GPU, model parallel is necessary. ...

PyTorch - 1 Data Parallel

January 28 2024

My ML journey started with building NN layer by layer with Tensorflow in 2016. Keras was invented but I still wanted to know more details of Tensorflow but s...

Llama2 tricks

January 26 2024

A good youtube video explained several tricks applied in Llama2 Here is the study notes. 1. Layer normalization Batch norm: normalized by columns (same feat...

Workshops and Buildouts

January 24 2024

One interesting part of my job is to create workshops and buildouts (one type of workshop focusing on AWS building). Here are some screenshots of my previous...

AlphaGeometry

January 19 2024

I have been sick since the trip to China and haven’t fully recovered even till today. I totally underestimated the damage of 雾霾（smog, a word combining smoke ...

kYLe

Posts by Year

2025

2024

2023