Hi Babak,
this is an overview of projects that I want to do in the context of our recent cooperation. This is aimed as a base for discussion about the scope and budget, possibly we may find a way to cooperate further.
Here is a page describing the reasons why we want exaxtly that what is in this page.
Goal
Our goal is to implement a simple search engine for some medical texts. There is a medical text on the input. We want to embed it into a vector (using OpenAI API) and then search for corresponding articles that contain a similar information.
In general we need to solve the following tasks:
1) Get the data of the articles from the web - the first 10k you did, the next 200+k we did not agree on the budget. In the future there may be more articles/webpages to process.
2) Split files - we will want to embed parts of the scraped articles into vectors using AI Embedding API. To do that, we first need to split each file into parts. There is a question how to do it best. In our current opinion, we want to separate the articles into several parts and embed each part into a vector and also embed the whole article as one vector. Then we will experiment with the results of these embeddings. The parts should be separated by the captions inside of the article, each part should have some maximum size, if it is bigger then we would further separate. We would only separate on the border of paragraphs, not inside of paragraphs. If there is a short remainder part, it should be appended to the last one, so that we don't create short remainders of text.
The result of this project would be a PHP script for splitting the parsed scraped files from (1). The input will be a directory of the parsed html files and the output would be a directory of same named split files. The parts would be separated by two empty lines or smt like that.
There may be some adjustment of the input files needed, e.g. remove some part of the parsed files or so, this would be finalized during the process but should be calculated with. Should be something simple.
3) Embed each article using Open AI Embedding API. This would mean writing a PHP script that calls the OpenAI embedding API for embedding the split files into vectors as described in the previous point. On the input side, we would have the result of point (2), i.e. a directory of split HTML content. On the output we should get a directory of same named files containing the list of vectors - the first vector would be the vector of the whole file and the subsequent vectors will be the vectors of the paragraphs.
The embedding should be ideally done with a PHP script called from a local copy of PHP so not on the web but using command line php but from some reason we'd prefer this language (we may want to embed the script later into our pipeline written in PHP).
The result of this part of work would be the script(s), not the data actually because we then will want to repeat the same process with other scraped data.
4) The vectors should be uploaded to a vector database. We would like to use Pinecone over the API. We don't have yet established any infrastructure in that DB. The task for this part would be to write a PHP script that takes the directory with the embedding files, see point (2) and uploads them into Pinecone (creates a new namespace in a given index, the script would get the API keys and other such data as an input). You would test the script with your DB and your OpenAI API and then give it to us along with some documentation what we need to do to prepare the whole scenario to work. Alternatively we could do it so that you use our DB and our keys and we do the preparation before that based on your guidance. We would have 2 different set of vectors for this task - each article has one summary vector and then several sub-vectors for the paragraphs inside of this article. This would be put into 2 different namespaces in Pinecone, one serving for searching for the main vectors and the other for searching for the subvectors.
Each vector would have an ID (or whatever this is called in Pinecone) given by the filename of the embedding file.
So for example: There would be embedding file 0000123.html, which would contain 1 main vector and 4 sub-vectors. The script would create 2 namespaces in Pinecone, one containing main vector → “0000123” and the other containing 4 vectors all pointing at “0000123”. If some search matches any of these vectors, we receive the ID “0000123” and we know which article was found by that search.
5) Write a PHP script that takes a text on the input, embeds it into a vector using OpenAI API and finds a given number of best matches from the given Pinecone namespace using Pinecone API.
6) Scrape the bigger data (doctors answer questions at https://www.ulekare.cz/poradna-lekare). These are 230k webpages. We need to parse these ones a little differently than in point 1 - not creating any paragraphs and captions but only have 3 parts: 1 question and 1-2 answers, separated as 3 HTML paragraphs separated by captions so that we can make sure the embedding script will embed the parts separately.
7) Now for embedding this data we would like to use the same script from point 2. That script expects that the first vector is created from the whole text and the rest is the subparts. For processing this part of data we would need to omit the first vector and only create the sub-vectors. This way the uploading script (3) will separately upload the first sub-vector (i.e. the quesion) and searately the answer(s). The script in 2 must be updated to allow for optional creation of the first vector from the whole article.
8) All of this will incur costs for the underlying services - embedding API, vector DB usage. The idea is that the vendor of the scripts bears these costs for testing purpose needed to produce the scripts and then we bear the costs for the production usage of the scripts. In both cases we would need to know in advance what costs these will approximately be because the projects specs will need to include some max covered costs specification.
9) This description basically outlines the program for several PPH projects but it is NOT yet full project specification. The projects have some unclear points and need to be discussed. The definitive form of the PPH projects includung the negotiated price must yet be derived from this description. The vendor is expected to cooperate with us on delivering this prior to starting on the work in PPH.
UPDATE 10) We need the PHP scripts to use only http reqiests, not any more complex libs. Reason see reasons page.
Last) In case you have any thoughts about the solution or suggestions how this could be improved or changed from some good reason, we can discuss.
Projects
To recap the projects and set their expected budgets:
1) Already done.
2) Splitting scripts. Expected budget 100 euro.
3) Embedding scripts. Should be straigtforward, expected budget 80 euro (plus the API service costs).
4) Script for Pinecone upload. Similar stuff but more complex, 150 euro (plus the API service costs).
5) Finding matches from Pinecone - this should be easy, 70 euro.
6) Scrape the bigger data. This is 220k pages that need to be scraped similar like you did (1). You offered a price of 1600+ euro, that's way too much for us. We could offer a price of 800 euro for this task. Couldn't you find a way how to do it for that money?
7) This is really a part of 2, I only explained it separately for better clarification.
8 and 9) These are points that must be determined before we start on any of the projects.
I would like to discuss this plan with you. If you find a way how to do some parts of it with the suggested budgets, then we could define this as PPH projects and award them to you directly.
I would prefer you as you delivered high quality result but the budget is very important for us as well so possibly I would need to try getting it done by other people. Expecting your comments.77
