A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?
Here’s a fact: GPT-4o charges 170 tokens to process each 512x512 tile used in high-res mode. At ~0.75 tokens/word, this suggests a picture is worth about 227 words—only a factor of four off from the traditional saying.
(There’s also an 85 tokens charge for a low-res ‘master thumbnail’ of each picture and higher resolution images are broken into many such 512x512 tiles, but let’s just focus on a single high-res tile.
Let's Play Jeopardy! with LLMs
How good are LLMs at trivia? I used the Jeopardy! dataset from Kaggle to benchmark ChatGPT and the new Llama 3 models. Here are the results:
There you go. You’ve already gotten 90% of what you’re going to get out of this article. Some guy on the internet ran a half-baked benchmark on a handful of LLM models, and the results were largely in line with popular benchmarks and received wisdom on fine-tuning and RAG.
My Dinner with ChatGPT
It's hard to talk about ChatGPT without cherry-picking. It's too easy to try a dozen different prompts, refresh each a handful of times, and report the most interesting or impressive thing from those sixty trials. While this problem plagues a lot of the public discourse around generative models, cherry-picking is particularly problematic for ChatGPT because it's actively using the chat history as context.
A History of Encabulation
To celebrate the 100th anniversary of the birth of encabulation - dated from Dr. Wolfgang Albrecht Klossner’s first successful run in that historic barn on the outskirts of Eisenhüttenstadt - this article* collects in one place a number of resources that provide, if not a comprehensive history, at least a catalogue of the major milestones and concepts.
ML From Scratch, Part 6: Principal Component Analysis
In the previous article in this series we distinguished between two kinds of unsupervised learning (cluster analysis and dimensionality reduction) and discussed the former in some detail. In this installment we turn our attention to the later.
In dimensionality reduction we seek a function $f : \mathbb{R}^n \mapsto \mathbb{R}^m$ where $n$ is the dimension of the original data $\mathbf{X}$ and $m$ is less than or equal to $n$.
ML From Scratch, Part 5: Gaussian Mixture Models
Consider the following motivating dataset:
It is apparent that these data have some kind of structure; which is to say, they certainly are not drawn from a uniform or other simple distribution. In particular, there is at least one cluster of data in the lower right which is clearly separate from the rest.
Adaptive Basis Functions
Today, let me be vague. No statistics, no algorithms, no proofs. Instead, we’re going to go through a series of examples and eyeball a suggestive series of charts, which will imply a certain conclusion, without actually proving anything; but which will, I hope, provide useful intuition.
The premise is this:
ML From Scratch, Part 4: Decision Trees
So far in this series we’ve followed one particular thread: linear regression -> logistic regression -> neural network. This is a very natural progression of ideas, but it really represents only one possible approach. Today we’ll switch gears and look at a model with completely different pedigree: the decision tree, sometimes also referred to as Classification and Regression Trees, or simply CART models.
ML From Scratch, Part 3: Backpropagation
In today’s installment of Machine Learning From Scratch we’ll build on the logistic regression from last time to create a classifier which is able to automatically represent non-linear relationships and interactions between features: the neural network. In particular I want to focus on one central algorithm which allows us to apply gradient descent to deep neural networks: the backpropagation algorithm.
ML From Scratch, Part 2: Logistic Regression
In this second installment of the machine learning from scratch we switch the point of view from regression to classification: instead of estimating a number, we will be trying to guess which of 2 possible classes a given input belongs to. A modern example is looking at a photo and deciding if its a cat or a dog.
ML From Scratch, Part 1: Linear Regression
To kick off this series, will start with something simple yet foundational:
linear regression via ordinary least squares. While not particularly
exciting, linear regression finds widespread use both as a standalone
learning algorithm and as a building block in more advanced learning
algorithms.
ML From Scratch, Part 0: Introduction
Motivation As an apprentice, every new magician must prove to his own satisfaction, at least once, that there is truly great power in magic. —The Flying Sorcerers, by David Gerrold and Larry Niven
How do you know if you really understand something? You could just rely on the subjective experience of feeling like you understand.
Visualizing Multiclass Classification Results
Introduction Visualizing the results of a binary classifier is already a challenge, but having more than two classes aggravates the matter considerably.
Let’s say we have $k$ classes. Then for each observation, there is one correct prediction and $k-1$ possible incorrect prediction. Instead of a $2 \times 2$ confusion matrix, we have a $k^2$ possibilities.