IngestDocument
This plugin is currently in beta. While it is considered safe for use, please be aware that its API could change in ways that are not compatible with earlier versions in future releases, or it might become unsupported.
Ingest documents into an embedding store.
Only text documents (TXT, HTML, Markdown) are supported for now.
type: "io.kestra.plugin.langchain4j.rag.IngestDocument"
Ingest documents into a KV embedding store.\nWARNING: the KV embedding store is for quick prototyping only, as it stores the embedding vectors in a K/V Store and load them all in memory.
id: document-ingestion
namespace: company.team
tasks:
- id: ingest
type: io.kestra.plugin.langchain4j.rag.IngestDocument
provider:
type: io.kestra.plugin.langchain4j.provider.GoogleGemini
modelName: gemini-embedding-exp-03-07
apiKey: "{{ secret('GEMINI_API_KEY') }}"
embeddings:
type: io.kestra.plugin.langchain4j.embeddings.KestraKVStore
drop: true
fromExternalURLs:
- https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-22.md
Embedding Store Provider
Language Model Provider
This provider must be configured with an embedding model.
The document splitter
false
Whether to drop the store before ingestion. Useful for testing purpose.
A list of document URLs from external sources
A list of internal storage URIs representing documents
Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}
.
A path inside the task working directory that contains documents to ingest
Each document inside the directory will be ingested into the embedding store.
Additional metadata that will be added to all ingested documents
Additional outputs from the embedding store.
The number of ingested documents
The input token count
The output token count
The total token count
Endpoint URL
Project location
Model name
Project ID
API endpoint
The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/
Model name
API Key
Client ID
Client secret
API version
Tenant ID
API Key
Model name
https://api.deepseek.com/v1
API base URL
1
List of HTTP ElasticSearch servers.
Must be an URI like https://elasticsearch.com: 9200
with scheme and port.
Basic auth configuration.
List of HTTP headers to be send on every request.
Must be a string with key value separated with :
, ex: Authorization: Token XYZ
.
Sets the path's prefix for every request used by the HTTP client.
For example, if this is set to /my/path
, then any client request will become /my/path/
+ endpoint.
In essence, every request's endpoint is prefixed by this pathPrefix
.
The path prefix is useful for when ElasticSearch is behind a proxy that provides a base path or a proxy that requires all paths to start with '/'; it is not intended for other purposes and it should not be supplied in other scenarios.
Whether the REST client should return any response containing at least one warning header as a failure.
Trust all SSL CA certificates.
Use this if the server is using a self signed SSL certificate.
API Key
Model name
API Key
Model name
API base URL
Model endpoint
Model name
Basic auth password.
Basic auth username.
{{flow.id}}-embedding-store
The name of the K/V entry to use
API Key
Model name
AWS Access Key ID
Model name
AWS Secret Access Key
COHERE
COHERE
TITAN
Amazon Bedrock Embedding Model Type
The content of the document
The metadata of the document
The database name
The database server host
The database password
The database server port
The table to store embeddings in
The database user
false
Whether to use use an IVFFlat index
An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).
API Key
Model name
API base URL
The maximum size of the overlap, defined in characters. Only full sentences are considered for the overlap.
The maximum size of the segment, defined in characters.
RECURSIVE
RECURSIVE
PARAGRAPH
LINE
SENTENCE
WORD
Title the type of the DocumentSplitter
We recommend using a RECURSIVE DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.
The name of the index to store embeddings