百度360必应搜狗淘宝本站头条
当前位置:网站首页 > IT技术 > 正文

Oracle向量数据库操作的一些随手笔记

wptr33 2024-12-26 17:07 58 浏览

1. Basic Demo:

| c(2,6). . b(5,6)
| .
| .
| a(2,2)
|_________________________

|b-a| = sqrt( (5-2)^2 + (6-2)^2 ) = 5

SELECT VECTOR_DISTANCE( vector('[2,2]'), vector('[5,6]'), EUCLIDEAN ) as distance;

How about COSINE?

CREATE TABLE IF NOT EXISTS embedding_store_hysun (
collection_name VARCHAR2(200) NOT NULL,
embedding VECTOR(*, FLOAT32) NOT NULL,
doc CLOB NOT NULL,
src VARCHAR2(500)
);

############################ In database embedding ############################

#EXEC DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'doc_model', force => true);
#SQL> grant DB_DEVELOPER_ROLE to vector;
SQL> grant create mining model to pocuser;
Grant succeeded.
SQL> create or replace directory HYSUN_DUMP as '/u01/ords_sw/hysun_dump';
Directory HYSUN_DUMP created.
SQL> grant read on directory HYSUN_DUMP to pocuser;
Grant succeeded.

EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL('HYSUN_DUMP','bge-base-zh-v1.5.onnx','hysun_bge_zh_model',JSON('{"function" : "embedding", "embeddingOutput" : "embedding"}'));

SELECT MODEL_NAME, MINING_FUNCTION, ALGORITHM, ALGORITHM_TYPE, MODEL_SIZE
FROM USER_MINING_MODELS;

SQL> INSERT INTO embedding_store_hysun select 'DB_EMBED_TEST0', VECTOR_EMBEDDING(hysun_bge_zh_model USING 'Minimum Age to Get a Licence The minimum age to get a licence. minimum age' as input), 'Minimum Age to Get a Licence The minimum age to get a licence. minimum age', '/home/hysunhe/projects/oracle_vectordb/source_data/cdc_poc/QA_1.txt' from dual;
1 row inserted.

SQL> INSERT INTO embedding_store_hysun select 'DB_EMBED_TEST0', VECTOR_EMBEDDING(hysun_bge_zh_model USING 'Minimum Requirements for Enrolment The list of requirements/ enrolment prerequisites that needs to be met before enrolment. class 3/3a, Class 3A, class 2B, class 2, minimum requirements, enrolment' as input), 'Minimum Requirements for Enrolment The list of requirements/ enrolment prerequisites that needs to be met before enrolment. class 3/3a, Class 3A, class 2B, class 2, minimum requirements, enrolment', '/home/hysunhe/projects/oracle_vectordb/source_data/cdc_poc/QA_2.txt' from dual;
1 row inserted.

SQL> SELECT VECTOR_EMBEDDING(hysun_bge_zh_model USING 'mininum age to get a license' as input) AS embedding;

SELECT
collection_name,
embedding,
doc,
src,
VECTOR_DISTANCE(embedding, VECTOR_EMBEDDING(hysun_bge_zh_model USING 'mininum age to get a license' as input), COSINE) as distance
FROM embedding_store_hysun
WHERE
collection_name = 'DB_EMBED_TEST0'
ORDER BY distance
FETCH FIRST 3 ROWS ONLY;

######################## In database embedding end ########################

### Index:

show parameter vector_memory_size;
ALTER SYSTEM SET vector_memory_size=ON SCOPE=BOTH;
SELECT value FROM V$PARAMETER WHERE name='sga_target'; -- (max vector_memory_size = 70% SGA)
SELECT CON_ID, sum(alloc_bytes) / 1024 / 1024 FROM V$VECTOR_MEMORY_POOL GROUP BY CON_ID;
SELECT CON_ID, sum(USED_BYTES) / 1024 / 1024 FROM V$VECTOR_MEMORY_POOL GROUP BY CON_ID;

############################################################

In-Memory Neighbor Graph Vector Index(HNSW)

############################################################

create table galaxies (id number, name varchar2(50), doc varchar2(500), embedding vector);
insert into galaxies values (1, 'M31', 'Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.', '[0,2,2,0,0]');
insert into galaxies values (2, 'M33', 'Messier 33 is a spiral galaxy in the Triangulum constellation.', '[0,0,1,0,0]');
insert into galaxies values (3, 'M58', 'Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.', '[1,1,1,0,0]');
insert into galaxies values (4, 'M63', 'Messier 63 is a spiral galaxy in the Canes Venatici constellation.', '[0,0,1,0,0]');
insert into galaxies values (5, 'M77', 'Messier 77 is a barred spiral galaxy in the Cetus constellation.', '[0,1,1,0,0]');
insert into galaxies values (6, 'M91', 'Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.', '[0,1,1,0,0]');
insert into galaxies values (7, 'M49', 'Messier 49 is a giant elliptical galaxy in the Virgo constellation.', '[0,0,0,1,1]');
insert into galaxies values (8, 'M60', 'Messier 60 is an elliptical galaxy in the Virgo constellation.', '[0,0,0,0,1]');
insert into galaxies values (9, 'NGC1073', 'NGC 1073 is a barred spiral galaxy in Cetus constellation.', '[0,1,1,0,0]');
SELECT name
FROM galaxies
ORDER BY VECTOR_DISTANCE( embedding, to_vector('[0,1,1,0,0]'), COSINE )
FETCH FIRST 3 ROWS ONLY;
SELECT name,
ROUND( VECTOR_DISTANCE( embedding, to_vector('[0,1,1,0,0]'), COSINE ), 2) as distance
FROM galaxies
ORDER BY distance
FETCH APPROXIMATE FIRST 4 ROWS ONLY;
-- WITH TARGET ACCURACY 90
EXPLAIN PLAN FOR
SELECT name,
VECTOR_DISTANCE( embedding, to_vector('[0,1,1,0,0]'), COSINE ) as distance
FROM galaxies
ORDER BY distance
FETCH APPROXIMATE FIRST 4 ROWS ONLY;
select plan_table_output from table(dbms_xplan.display('plan_table',null,'all'));
CREATE VECTOR INDEX galaxies_hnsw_idx ON galaxies (embedding) ORGANIZATION
INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
CREATE VECTOR INDEX galaxies_hnsw_idx ON galaxies (embedding) ORGANIZATION
INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 90 PARAMETERS (type HNSW, neighbors 40, efconstruction
500);
SELECT name,
ROUND(VECTOR_DISTANCE( embedding, to_vector('[0,1,1,0,0]'), COSINE ), 3) distance
FROM galaxies
WHERE name <> 'NGC1073'
ORDER BY distance
FETCH APPROXIMATE FIRST 4 ROWS ONLY WITH TARGET ACCURACY 90;
drop INDEX galaxies_hnsw_idx;

##############################################################

Neighbor Partition Vector Index (IVF)

##############################################################

CREATE VECTOR INDEX galaxies_ivf_idx ON galaxies (embedding) ORGANIZATION
NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95;
CREATE VECTOR INDEX galaxies_ivf_idx ON galaxies (embedding) ORGANIZATION
NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 90 PARAMETERS (type IVF, neighbor partitions 100);
The APPROX and APPROXIMATE keywords are optional. If omitted while connected to an
ADB-S instance, an approximate search using a vector index is attempted if one
exists.
-- Accuracy report
SET SERVEROUTPUT ON
declare
report varchar2(128);
begin
report := dbms_vector.index_accuracy_query(
OWNER_NAME => 'POCUSER',
INDEX_NAME => 'GALAXIES_IVF_IDX',
qv => to_vector('[0,1,1,0,0]'),
top_K => 10,
target_accuracy => 95 );
dbms_output.put_line(report);
end;
/

-- Index detail:

grant read on VECSYS.VECTOR$INDEX to pocuser;
SELECT JSON_SERIALIZE(IDX_PARAMS RETURNING VARCHAR2 PRETTY)
FROM VECSYS.VECTOR$INDEX WHERE IDX_NAME = 'GALAXIES_IVF_IDX';
CREATE PUBLIC DATABASE LINK LinkToLA1 CONNECT TO vectordemo IDENTIFIED BY "welcome1" USING '146.235.233.91:1521/pdb1.sub08030309530.justinvnc1.oraclevcn.com';
select OWNER, DB_LINK, USERNAME, VALID, HOST from all_db_links;
alter session set global_names=false;
select 1 from dual@LINKTOLA1;

#### Memo

grant create any directory to pocuser;
create directory RAG_DOC_DIR as '/u01/hysun/rag_docs';
create table RAG_FILES (
file_name varchar2(500),
file_content BLOB
);
create table RAG_INDB_PIPELINE (
id number,
name varchar2(50),
doc varchar2(500),
embedding VECTOR
);
Declare
mFile VARCHAR2(500) := 'Oracle向量数据库_lab.pdf';
mBLOB BLOB := Empty_Blob();
mBinFile BFILE := BFILENAME('RAG_DOC_DIR', mFile);
Begin
DBMS_LOB.OPEN(mBinFile, DBMS_LOB.LOB_READONLY); -- Open BFILE
DBMS_LOB.CreateTemporary(mBLOB, TRUE, DBMS_LOB.Session); -- BLOB locator initialization
DBMS_LOB.OPEN(mBLOB, DBMS_LOB.LOB_READWRITE); -- Open BLOB locator for writing
DBMS_LOB.LoadFromFile(mBLOB, mBinFile, DBMS_LOB.getLength(mBinFile)); -- Reading BFILE into BLOB
DBMS_LOB.CLOSE(mBLOB); -- Close BLOB locator
DBMS_LOB.CLOSE(mBinFile); -- Close BFILE

INSERT INTO RAG_FILES(file_name, file_content) values (mFile, mBLOB);
commit;
End;
/
insert into RAG_FILES(file_name, file_content) values('oracle-vector-lab', to_blob(bfilename('RAG_DOC_DIR', 'Oracle向量数据库_lab.pdf')));
commit;
select DBMS_LOB.getLength(FILE_CONTENT) from RAG_FILES;
drop table rag_doc_chunks purge;
create table rag_doc_chunks (doc_id varchar2(500), chunk_id number, chunk_data varchar2(4000), chunk_embedding vector);
-- utl_to_text: PDF -> TEXT
-- utl_to_chunks: TEXT -> CHUNKS
-- utl_to_embeddings: CHUNKS -> VECTORS
insert into rag_doc_chunks
select
dt.file_name doc_id,
et.embed_id chunk_id,
et.embed_data chunk_data,
to_vector(et.embed_vector) chunk_embedding
from
rag_files dt,
dbms_vector_chain.utl_to_embeddings(
dbms_vector_chain.utl_to_chunks(
dbms_vector_chain.utl_to_text(dt.file_content),
json('{"normalize":"all"}')
),
json('{"provider":"database", "model":"mydoc_model"}')
) t,
JSON_TABLE(
t.column_value,
'$[*]' COLUMNS (
embed_id NUMBER PATH '$.embed_id',
embed_data VARCHAR2(4000) PATH '$.embed_data',
embed_vector CLOB PATH '$.embed_vector'
)
) et;
commit;
insert into rag_doc_chunks
select
dt.file_name doc_id,
et.embed_id chunk_id,
et.embed_data chunk_data,
to_vector(et.embed_vector) chunk_embedding
from
rag_files dt,
dbms_vector_chain.utl_to_embeddings(
dbms_vector_chain.utl_to_chunks(
dbms_vector_chain.utl_to_text(dt.file_content),
JSON('{ "by":"words",
"max":"240",
"overlap":"15",
"split":"recursively",
"language":"SIMPLIFIED CHINESE",
"normalize":"all" }')
),
json('{"provider":"database", "model":"mydoc_model"}')
) t,
JSON_TABLE(
t.column_value,
'$[*]' COLUMNS (
embed_id NUMBER PATH '$.embed_id',
embed_data VARCHAR2(4000) PATH '$.embed_data',
embed_vector CLOB PATH '$.embed_vector'
)
) et;
commit;
select
dbms_vector_chain.utl_to_chunks(TO_CLOB(FILE_CONTENT),
JSON('{ "by":"words",
"max":"240",
"overlap":"15",
"split":"recursively",
"language":"SIMPLIFIED CHINESE",
"normalize":"all" }'))
from RAG_FILES;
SELECT
dbms_vector.utl_to_embedding(
'This is a test',
json('{
"provider": "OCIGenAI",
"credential_name": "OCI_GENAI_CRED_FOR_APEX",
"url": "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/20231130/actions/embedText",
"model": "cohere.embed-multilingual-v3.0"
}')
) embedding
FROM dual;
SELECT
dbms_vector.utl_to_embedding(
'This is a test',
json('{
"provider": "database",
"model": "doc_model"
}')
) embedding
FROM dual;
create or replace directory MODELS_DIR as '/u01/hysun/models';
EXEC DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'mydoc_model', force => true);
-- BEGIN
-- DBMS_VECTOR.LOAD_ONNX_MODEL(
-- directory => 'MODELS_DIR',
-- file_name => 'bge-base-zh-v1.5.onnx',
-- model_name => 'mydoc_model',
-- metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input":{"input": ["DATA"]}}')
-- );
-- END;
-- /
BEGIN
DBMS_VECTOR.LOAD_ONNX_MODEL(
directory => 'MODELS_DIR',
file_name => 'bge-base-zh-v1.5.onnx',
model_name => 'mydoc_model'
);
END;
/
SELECT vector_embedding(mydoc_model using 'hello' as data);
select
chunk_data,
VECTOR_DISTANCE(chunk_embedding, VECTOR_EMBEDDING(mydoc_model USING '本次实验的先决条件' as data), COSINE) as distance
from rag_doc_chunks
order by distance
FETCH APPROX FIRST 1 ROWS ONLY;
-- grant CREATE CREDENTIAL
BEGIN
DBMS_VECTOR_CHAIN.CREATE_CREDENTIAL (
CREDENTIAL_NAME => 'LAB_OPENAI_CRED',
PARAMS => json('{ "access_token": "EMPTY" }')
);
END;
/
select dbms_vector_chain.utl_to_generate_text(
'Oracle 向量数据库是什么',
json('{
"provider": "openai",
"credential_name": "LAB_OPENAI_CRED",
"url": "http://146.235.226.110:8098/v1/chat/completions",
"model": "Qwen2-7B-Instruct"
}') ) from dual;
select *
from (
select
chunk_data
from rag_doc_chunks
order by VECTOR_DISTANCE(chunk_embedding, VECTOR_EMBEDDING(mydoc_model USING '本次实验的先决条件' as data), COSINE)
FETCH APPROX FIRST 3 ROWS ONLY
) dt,
dbms_vector_chain.utl_to_generate_text(
dt.chunk_data,
json('{
"provider": "openai",
"credential_name": "LAB_OPENAI_CRED",
"url": "http://146.235.226.110:8098/v1/chat/completions",
"model": "Qwen2-7B-Instruct"
}')
) rag
declare
l_question varchar2(500) := '本次实验的先决条件';
l_input CLOB;
l_clob CLOB;
j apex_json.t_values;
l_context CLOB;
l_rag_result CLOB;
begin
-- 第一步:从向量数据库中检索出与问题相似的内容
for rec in (
select
chunk_data
from rag_doc_chunks
order by VECTOR_DISTANCE(chunk_embedding, VECTOR_EMBEDDING(mydoc_model USING l_question as data), COSINE)
FETCH APPROX FIRST 3 ROWS ONLY
) loop
l_context := l_context || rec.chunk_data || chr(10);
end loop;

-- 第二步:提示工程:将相似内容和用户问题一起,组成大语言模型的输入
l_input := '你是一个诚实且专业的数据库知识问答助手,请仅仅根据提供的上下文信息内容,回答用户的问题,且不要试图编造答案。\n 以下是上下文信息:' || replace(l_context, chr(10), '\n') || '\n请用英文回答用户问题:' || l_question;


-- 第三步:调用大语言模型,生成RAG结果
for rec in (select dbms_vector_chain.utl_to_generate_text(
l_input,
json('{
"provider": "openai",
"credential_name": "LAB_OPENAI_CRED",
"url": "http://146.235.226.110:8098/v1/chat/completions",
"model": "Qwen2-7B-Instruct"
}')
) as rag from dual) loop
dbms_output.put_line('*** RAG Result: ' || rec.rag);
end loop;
-- apex_json.parse(j, l_clob);
-- l_rag_result := apex_json.get_varchar2(p_path => 'choices[%d].message.content', p0 => 1, p_values => j);

-- dbms_output.put_line('*** RAG Result: ' || l_rag_result);
end;
/

```

srvctl stop instance -d ai23 -i ai232 -force
srvctl status database -d ai23
srvctl start instance -d ai23 -i ai232

相关推荐

[常用工具] git基础学习笔记_git工具有哪些

添加推送信息,-m=messagegitcommit-m“添加注释”查看状态...

centos7安装部署gitlab_centos7安装git服务器

一、Gitlab介1.1gitlab信息GitLab是利用RubyonRails一个开源的版本管理系统,实现一个自托管的Git项目仓库,可通过Web界面进行访问公开的或者私人项目。...

太高效了!玩了这么久的Linux,居然不知道这7个终端快捷键

作为Linux用户,大家肯定在Linux终端下敲过无数的命令。有的命令很短,比如:ls、cd、pwd之类,这种命令大家毫无压力。但是,有些命令就比较长了,比如:...

提高开发速度还能保证质量的10个小窍门

养成坏习惯真是分分钟的事儿,而养成好习惯却很难。我发现,把那些对我有用的习惯写下来,能让我坚持住已经花心思养成的好习惯。...

版本管理最好用的工具,你懂多少?

版本控制(Revisioncontrol)是一种在开发的过程中用于管理我们对文件、目录或工程等内容的修改历史,方便查看更改历史记录,备份以便恢复以前的版本的软件工程技术。...

Git回退到某个版本_git回退到某个版本详细步骤

在开发过程,有时会遇到合并代码或者合并主分支代码导致自己分支代码冲突等问题,这时我们需要回退到某个commit_id版本1,查看所有历史版本,获取git的某个历史版本id...

Kubernetes + Jenkins + Harbor 全景实战手册

Kubernetes+Jenkins+Harbor全景实战手册在现代企业级DevOps体系中,Kubernetes(K8s)、Jenkins和Harbor组成的CI/CD流水...

git常用命令整理_git常见命令

一、Git仓库完整迁移完整迁移,就是指,不仅将所有代码移植到新的仓库,而且要保留所有的commit记录1.随便找个文件夹,从原地址克隆一份裸版本库...

第三章:Git分支管理(多人协作基础)

3.1分支基本概念分支是Git最强大的功能之一,它允许你在主线之外创建独立的开发线路,互不干扰。理解分支的工作原理是掌握Git的关键。核心概念:HEAD:指向当前分支的指针...

云效Codeup怎么创建分支并进行分支管理

云效Codeup怎么创建分支并进行分支管理,分支是为了将修改记录分叉备份保存,不受其他分支的影响,所以在同一个代码库里可以同时进行多个修改。创建仓库时,会自动创建Master分支作为默认分支,后续...

git 如何删除本地和远程分支?_git怎么删除远程仓库

Git分支对于开发人员来说是一项强大的功能,但要维护干净的存储库,就需要知道如何删除过时的分支。本指南涵盖了您需要了解的有关本地和远程删除Git分支的所有信息。了解Git分支...

git 实现一份代码push到两个git地址上

一直以来想把自己的博客代码托管到github和coding上想一次更改一次push两个地址一起更新今天有空查资料实践了下本博客的github地址coding的git地址如果是Gi...

git操作:cherry-pick和rebase_git cherry-pick bad object

在编码中经常涉及到分支之间的代码同步问题,那就需要cherry-pick和rebase命令问题:如何将某个分支的多个commit合并到另一个分支,并在另一个分支只保留一个commit记录解答:假设有两...

模型文件硬塞进 Git,GitHub 直接打回原形:使用Git-LFS管理大文件

前言最近接手了一个计算机视觉项目代码是屎山就不说了,反正我也不看代码主要就是构建一下docker镜像,测试一下部署的兼容性这本来不难但是,国内服务器的网络环境实在是恶劣,需要配置各种镜像(dock...

防弹少年团田柾国《Euphoria》2周年 获世界实时趋势榜1位 恭喜呀

当天韩国时间凌晨3时左右,该曲在Twitter上以“2YearsWithEuphoria”的HashTag登上了世界趋势1位。在韩国推特实时趋势中,从上午开始到现在“Euphoria2岁”的Has...