Skip to content

Commit 03eb4fc

Browse files
committed
update tutorial
1 parent 43d4154 commit 03eb4fc

2 files changed

Lines changed: 644 additions & 0 deletions

File tree

Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Simple RAG From Scratch"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"In this tutorial, we will use BGE, Faiss, and OpenAI's GPT-4o-mini to build a simple RAG system from scratch."
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"## 1. Data"
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"Suppose I'm a resident of New York Manhattan, and I want the AI bot to provide suggestion on where should I go for dinner. It's not reliable to let it recommend some random restaurant. So let's provide a bunch of our favorate restaurants."
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": 11,
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"corpus = [\n",
38+
" \"Cheli: A downtown Chinese restaurant presents a distinctive dining experience with authentic and sophisticated flavors of Shanghai cuisine. Avg cost: $40-50\",\n",
39+
" \"Masa: Midtown Japanese restaurant with exquisite sushi and omakase experiences crafted by renowned chef Masayoshi Takayama. The restaurant offers a luxurious dining atmosphere with a focus on the freshest ingredients and exceptional culinary artistry. Avg cost: $500-600\",\n",
40+
" \"Per Se: A midtown restaurant features daily nine-course tasting menu and a nine-course vegetable tasting menu using classic French technique and the finest quality ingredients available. Avg cost: $300-400\",\n",
41+
" \"Ortomare: A casual, earthy Italian restaurant locates uptown, offering wood-fired pizza, delicious pasta, wine & spirits & outdoor seating. Avg cost: $30-50\",\n",
42+
" \"Banh: Relaxed, narrow restaurant in uptown, offering Vietnamese cuisine & sandwiches, famous for its pho and Vietnam sandwich. Avg cost: $20-30\",\n",
43+
" \"Living Thai: An uptown typical Thai cuisine with different kinds of curry, Tom Yum, fried rice, Thai ice tea, etc. Avg cost: $20-30\",\n",
44+
" \"Chick-fil-A: A Fast food restaurant with great chicken sandwich, fried chicken, fries, and salad, which can be found everywhere in New York. Avg cost: 10-20\",\n",
45+
" \"Joe's Pizza: Most famous New York pizza locates midtown, serving different flavors including classic pepperoni, cheese, spinach, and also innovative pizza. Avg cost: $15-25\",\n",
46+
" \"Red Lobster: In midtown, Red Lobster is a lively chain restaurant serving American seafood standards amid New England-themed decor, with fair price lobsters, shrips and crabs. Avg cost: $30-50\",\n",
47+
" \"Bourbon Steak: It accomplishes all the traditions expected from a steakhouse, offering the finest cuts of premium beef and seafood complimented by wine and spirits program. Avg cost: $100-150\",\n",
48+
" \"Da Long Yi: Locates in downtown, Da Long Yi is a Chinese Szechuan spicy hotpot restaurant that serves good quality meats. Avg cost: $30-50\",\n",
49+
" \"Mitr Thai: An exquisite midtown Thai restaurant with traditional dishes as well as creative dishes, with a wonderful bar serving cocktails. Avg cost: $40-60\",\n",
50+
" \"Yichiran Ramen: Famous Japenese ramen restaurant in both midtown and downtown, serving ramen that can be designed by customers themselves. Avg cost: $20-40\",\n",
51+
" \"BCD Tofu House: Located in midtown, it's famous for its comforting and flavorful soondubu jjigae (soft tofu stew) and a variety of authentic Korean dishes. Avg cost: $30-50\",\n",
52+
"]\n",
53+
"\n",
54+
"user_input = \"I want some Chinese food\""
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"## 2. Indexing"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"metadata": {},
67+
"source": [
68+
"Now we need to figure out a fast but powerful enough method to retrieve docs in the corpus that are most closely related to our questions. Indexing is a good choice for us.\n",
69+
"\n",
70+
"The first step is embed each document into a vector. We use bge-base-en-v1.5 as our embedding model."
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": 12,
76+
"metadata": {},
77+
"outputs": [],
78+
"source": [
79+
"from FlagEmbedding import FlagModel\n",
80+
"\n",
81+
"model = FlagModel('BAAI/bge-base-en-v1.5',\n",
82+
" query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n",
83+
" use_fp16=True)\n",
84+
"\n",
85+
"embeddings = model.encode(corpus, convert_to_numpy=True)"
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": 13,
91+
"metadata": {},
92+
"outputs": [
93+
{
94+
"data": {
95+
"text/plain": [
96+
"(14, 768)"
97+
]
98+
},
99+
"execution_count": 13,
100+
"metadata": {},
101+
"output_type": "execute_result"
102+
}
103+
],
104+
"source": [
105+
"embeddings.shape"
106+
]
107+
},
108+
{
109+
"cell_type": "markdown",
110+
"metadata": {},
111+
"source": [
112+
"Then, let's create a Faiss index and add all the vectors into it.\n",
113+
"\n",
114+
"If you want to know more about Faiss, refer to the tutorial of [Faiss and indexing](https://github.com/FlagOpen/FlagEmbedding/tree/master/Tutorials/3_Indexing)."
115+
]
116+
},
117+
{
118+
"cell_type": "code",
119+
"execution_count": 14,
120+
"metadata": {},
121+
"outputs": [],
122+
"source": [
123+
"import faiss\n",
124+
"import numpy as np\n",
125+
"\n",
126+
"index = faiss.IndexFlatIP(embeddings.shape[1])\n",
127+
"\n",
128+
"index.add(embeddings)"
129+
]
130+
},
131+
{
132+
"cell_type": "code",
133+
"execution_count": 15,
134+
"metadata": {},
135+
"outputs": [
136+
{
137+
"data": {
138+
"text/plain": [
139+
"14"
140+
]
141+
},
142+
"execution_count": 15,
143+
"metadata": {},
144+
"output_type": "execute_result"
145+
}
146+
],
147+
"source": [
148+
"index.ntotal"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"## 3. Retrieve and Generate"
156+
]
157+
},
158+
{
159+
"cell_type": "markdown",
160+
"metadata": {},
161+
"source": [
162+
"Now we come to the most exciting part. Let's first embed our query and retrieve 3 most relevant document from it:"
163+
]
164+
},
165+
{
166+
"cell_type": "code",
167+
"execution_count": 16,
168+
"metadata": {},
169+
"outputs": [
170+
{
171+
"data": {
172+
"text/plain": [
173+
"array([['Cheli: A downtown Chinese restaurant presents a distinctive dining experience with authentic and sophisticated flavors of Shanghai cuisine. Avg cost: $40-50',\n",
174+
" 'Da Long Yi: Locates in downtown, Da Long Yi is a Chinese Szechuan spicy hotpot restaurant that serves good quality meats. Avg cost: $30-50',\n",
175+
" 'Yichiran Ramen: Famous Japenese ramen restaurant in both midtown and downtown, serving ramen that can be designed by customers themselves. Avg cost: $20-40']],\n",
176+
" dtype='<U270')"
177+
]
178+
},
179+
"execution_count": 16,
180+
"metadata": {},
181+
"output_type": "execute_result"
182+
}
183+
],
184+
"source": [
185+
"q_embedding = model.encode_queries([user_input], convert_to_numpy=True)\n",
186+
"\n",
187+
"D, I = index.search(q_embedding, 3)\n",
188+
"res = np.array(corpus)[I]\n",
189+
"\n",
190+
"res"
191+
]
192+
},
193+
{
194+
"cell_type": "markdown",
195+
"metadata": {},
196+
"source": [
197+
"Then set up the prompt for the chatbot:"
198+
]
199+
},
200+
{
201+
"cell_type": "code",
202+
"execution_count": 17,
203+
"metadata": {},
204+
"outputs": [],
205+
"source": [
206+
"prompt=\"\"\"\n",
207+
"You are a bot that makes recommendations for restaurants. \n",
208+
"Please be brief, answer in short sentences without extra information.\n",
209+
"\n",
210+
"These are the restaurants list:\n",
211+
"{recommended_activities}\n",
212+
"\n",
213+
"The user's preference is: {user_input}\n",
214+
"Provide the user with 2 recommended restaurants based on the user's preference.\n",
215+
"\"\"\""
216+
]
217+
},
218+
{
219+
"cell_type": "markdown",
220+
"metadata": {},
221+
"source": [
222+
"Fill in your OpenAI API key below:"
223+
]
224+
},
225+
{
226+
"cell_type": "code",
227+
"execution_count": 18,
228+
"metadata": {},
229+
"outputs": [],
230+
"source": [
231+
"import os\n",
232+
"\n",
233+
"os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\""
234+
]
235+
},
236+
{
237+
"cell_type": "markdown",
238+
"metadata": {},
239+
"source": [
240+
"Finally let's see how the chatbot give us the answer!"
241+
]
242+
},
243+
{
244+
"cell_type": "code",
245+
"execution_count": 19,
246+
"metadata": {},
247+
"outputs": [],
248+
"source": [
249+
"from openai import OpenAI\n",
250+
"client = OpenAI()\n",
251+
"\n",
252+
"response = client.chat.completions.create(\n",
253+
" model=\"gpt-4o-mini\",\n",
254+
" messages=[\n",
255+
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
256+
" {\n",
257+
" \"role\": \"user\",\n",
258+
" \"content\": prompt.format(user_input=user_input, recommended_activities=res)\n",
259+
" }\n",
260+
" ]\n",
261+
").choices[0].message"
262+
]
263+
},
264+
{
265+
"cell_type": "code",
266+
"execution_count": 20,
267+
"metadata": {},
268+
"outputs": [
269+
{
270+
"name": "stdout",
271+
"output_type": "stream",
272+
"text": [
273+
"1. Cheli - Authentic Shanghai cuisine with sophisticated flavors. \n",
274+
"2. Da Long Yi - Szechuan spicy hotpot with good quality meats.\n"
275+
]
276+
}
277+
],
278+
"source": [
279+
"print(response.content)"
280+
]
281+
}
282+
],
283+
"metadata": {
284+
"kernelspec": {
285+
"display_name": "base",
286+
"language": "python",
287+
"name": "python3"
288+
},
289+
"language_info": {
290+
"codemirror_mode": {
291+
"name": "ipython",
292+
"version": 3
293+
},
294+
"file_extension": ".py",
295+
"mimetype": "text/x-python",
296+
"name": "python",
297+
"nbconvert_exporter": "python",
298+
"pygments_lexer": "ipython3",
299+
"version": "3.12.4"
300+
}
301+
},
302+
"nbformat": 4,
303+
"nbformat_minor": 2
304+
}

0 commit comments

Comments
 (0)