Monday, April 29, 2024
 Popular · Latest · Hot · Upcoming
1
rated 0 times [  1] [ 0]  / answers: 1 / hits: 2404  / 1 Year ago, tue, april 18, 2023, 4:22:13

I'm trying to split up a 13GB xml file into small ~50MB xml files with this XSLT style sheet.



But this process kills xsltproc after I see it taking up over 1.7GB of memory (that's the total on the system).



Is there any way to deal with huge XML files with xsltproc? Can I change my style sheet? Or should I use a different processor? Or am I just S.O.L.?



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:param name="block-size" select="75000"/>

<xsl:template match="/">
<xsl:copy>
<xsl:apply-templates select="mysqldump/database/table_data/row[position() mod $block-size = 1]" />
</xsl:copy>
</xsl:template>

<xsl:template match="row">
<exsl:document href="chunk-{position()}.xml">
<add>
<xsl:for-each select=". | following-sibling::row[position() &lt; $block-size]" >
<doc>
<xsl:for-each select="field">
<field>
<xsl:attribute name="name"><xsl:value-of select="./@name"/></xsl:attribute>
<xsl:value-of select="."/>
</field>
<xsl:text>&#xa;</xsl:text>
</xsl:for-each>
</doc>
</xsl:for-each>
</add>
</exsl:document>
</xsl:template>

More From » xml

 Answers
4

This simple utility, requires you to have Python and python-lxml module (libxml2 installed in system) will let you stream parse elements, transform each element through XSLT and write it into the result file right away, no buffering



#!/usr/bin/env python3

from lxml import etree
import re

_xslt = etree.parse('FILL_XSLT_DOC')
_dom = etree.iterparse('FILL_SOURCE_XML')
transform = etree.XSLT(_xslt)
results = open('FILL_RESULT_XML','w+b')

for elem in _dom:
if (elem[1].tag.endswith('FILL_SEARCHED_ELEMENT_NAME')):
newElem = transform(elem[1])
#print(etree.tostring(newElem,xml_declaration = False,encoding='utf8'))
results.write(etree.tostring(newElem,xml_declaration = False,encoding='utf8'))
results.write(b'
')





Ok, please be aware, if your XSLT contains <xsl:strip-space elements="*"/> you can suffer from this 2010 bug, https://bugs.launchpad.net/lxml/+bug/583249


[#34850] Wednesday, April 19, 2023, 1 Year  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
eatack

Total Points: 245
Total Questions: 120
Total Answers: 113

Location: Estonia
Member since Wed, Jun 8, 2022
2 Years ago
;