HTMLSectionSplitter#
- class langchain_text_splitters.html.HTMLSectionSplitter(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any)[source]#
Splitting HTML files based on specified tag and font sizes. Requires lxml package.
Create a new HTMLSectionSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) β list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(βh1β, βHeader 1β), (βh2β, βHeader 2β].
xslt_path (Optional[str]) β path to xslt file for document transformation.
passed. (Uses a default if not) β
layouts. (Needed for html contents that using different format and) β
kwargs (Any) β
Methods
__init__
(headers_to_split_on[,Β xslt_path])Create a new HTMLSectionSplitter.
convert_possible_tags_to_header
(html_content)create_documents
(texts[,Β metadatas])Create documents from a list of texts.
split_documents
(documents)Split documents.
split_html_by_headers
(html_doc)split_text
(text)Split HTML text string
split_text_from_file
(file)Split HTML file
- __init__(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any) None [source]#
Create a new HTMLSectionSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) β list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(βh1β, βHeader 1β), (βh2β, βHeader 2β].
xslt_path (str | None) β path to xslt file for document transformation.
passed. (Uses a default if not) β
layouts. (Needed for html contents that using different format and) β
kwargs (Any) β
- Return type:
None
- convert_possible_tags_to_header(html_content: str) str [source]#
- Parameters:
html_content (str) β
- Return type:
str
- create_documents(texts: List[str], metadatas: List[dict] | None = None) List[Document] [source]#
Create documents from a list of texts.
- Parameters:
texts (List[str]) β
metadatas (List[dict] | None) β
- Return type:
List[Document]
- split_html_by_headers(html_doc: str) List[Dict[str, str | None]] [source]#
- Parameters:
html_doc (str) β
- Return type:
List[Dict[str, str | None]]
Examples using HTMLSectionSplitter