DOM i Python
Minidom i Python imlementerer en del av W3C's definisjon av programmeringsgrensesnittet mot DOM og løser de fleste av de praktiske oppgavene vi skal løse, selv om det noen ganger blir litt omstendelig.
Eksempel: Olympiade-data
Vi tar for oss resultatfila fra olympiade-eksempelet, se modulene: Olympiade og Noen datasett . De aktuelle resultatene er ordnet i en XML-fil: all_results.xml
Vi skal gjøre to øvelser på denne fila
- Produsere en HTML-fil. Dette er i prinsipp samme transformasjon som den som gjøres ved XSLT i modulen: XML2HTML
- Søke etter en bestemt deltager i alle øvelser i begge olymiader.
Øvelse 1
Vi tar utgangspunkt i følgende Python program:
import xml.dom.minidom
"""
Simple demo of dom.
produce rudimetary html from xml-file with IOC-results
B. Stenseth 2009
Use:
DoIt(infile,outfile)
See default files below
"""
#-----------------------
# file io
def getTextFile(filename):
try:
file=open(filename,'r')
intext=file.read()
file.close()
return intext
except:
print 'Error reading file ',filename
return None
def storeTextFile(filename,txt):
try:
outfile=open(filename,'w')
outfile.write(txt)
outfile.close()
except:
print 'Error writing file ',filename
eol='\n'
#---------------------------
# collect all text in a node
def getText(nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
t=node.data.encode('ISO-8859-1')
rc += t
return rc
HTMLFile="""<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<META http-equiv="Content-Type" content="text/html;
charset=iso-8859-1\">
<title>Olympiade</title>
</head>
<body>
%s
</body>
</html>
"""
def handleIOC(doc):
S=''
games=doc.getElementsByTagName("OlympicGame")
for game in games:
S+=handleGame(game)
S+=eol
return S
def handleGame(game):
S= '<h2>%s</h2>\n' %game.getAttribute('place').encode('ISO-8859-1')
events=game.getElementsByTagName("event")
for event in events:
S+=handleEvent(event)
S+=eol
return S
def handleEvent(event):
S= '<h3>%s</h3>\n' %event.getAttribute('dist').encode('ISO-8859-1')
participants=event.getElementsByTagName("athlet")
for athlet in participants:
S+=handleAthlet(athlet)
S+=eol
return S
def handleAthlet(athlet):
name=athlet.getElementsByTagName("name")[0]
S= "<p>Name:%s<br/>" %getText(name.childNodes)
result=athlet.getElementsByTagName("result")[0]
S+= "Result:%s</p>" %getText(result.childNodes)
return S
# default file for demopurposes, change it
def doit(infile,outfile):
document=getTextFile(infile)
if(document!=None):
dom = xml.dom.minidom.parseString(document)
T=handleIOC(dom)
storeTextFile(outfile,HTMLFile%T)
else:
print "sorry, something went wrong"
# clean up
dom.unlink()
# basic testing
if __name__=="__main__":
doit('c:\\web\\dw\\pydom\\all_results.xml',
'c:\\web\\dw\\pydom\\py_results1.html')
Programmet foretar en enkel transformasjon av en xml-struktur til en rudimentær html-string. Sammenlign denne koden med en tilsvarende XSLT-transformasjon som er beskrevet i Olympiade-eksempelet:
Øvelse 2
Vi skriver et program som tar for seg våre olympiske data og forsøker å besvare spørsmålet: "I hvilke øvelser har nn deltatt i de aktuelle olympiadene". Dette innebærer at vi må gå ned og opp i treet. Først må vi lokalisere alle forekomstene av den aktuelle løperen, for deretter å gå opp i treet for å finne øvelse og olympiade.
import xml.dom.minidom
"""
Simple demo of dom.
find: report which events an athlet has participated in
B. Stenseth 2009
Use:
Find(athlet,file)
See default parametes below
"""
#-------------------------------------------------------------
# file io
def getTextFile(filename):
try:
file=open(filename,'r')
intext=file.read()
file.close()
return intext
except:
print 'Error reading file ',filename
return None
# collect all text in a node
def getText(nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
t=node.data.encode('ISO-8859-1')
rc += t
return rc
def searchIOC(doc,theName):
athletnamelist=doc.getElementsByTagName("name")
for athletname in athletnamelist:
txtname=getText(athletname.childNodes)
if txtname==theName:
event=athletname.parentNode.parentNode
game=event.parentNode
print game.getAttribute('place').encode('ISO-8859-1')
print ' - '+event.getAttribute('dist').encode('ISO-8859-1')
# default parameters for demopurposes
def find(runner,afile):
document=getTextFile(afile)
if(document!=None):
dom = xml.dom.minidom.parseString(document)
searchIOC(dom,runner)
else:
print "something went wrong"
# basic testing
if __name__=="__main__":
find('Frank Fredericks','c:\\web\\dw\\pydom\\all_results.xml')
Eksempel: Bok-data
Datagrunnlaget er en tekstfil med bokbeskrivelser, en bok på hver linje. Bokdataene er beskrevet i modulen Noen datasett . Bokliste som tekst bokliste.xml. Tomme linjer og linjer som begynner med // skal ignoreres.
Vi skal gjøre to øvelser på disse dataene:
- Bygge en XML-fil fra textfila (csv-fila)
- Endre strukturen på den fila vi bygger i øvelse 1.
Øvelse 1
Vi lager et Pythonprogram som tar for seg en tekstfil med bokbeskrivelser og lager en XML-fil.
import StringIO,xml.dom.minidom,codecs
"""
Demo of MINIDOM.
Building a DOM-tree based on a text-file, writing result as XML
Building each node and inserting it into the tree
Data is described on
http://www.ia.hiof.no/~borres/ml/pydom/p-pydom.html
Usage: doit(textfilename,xmlfilename)
B. Stenseth 2009
"""
#-----------------------
# file io
def getTextFile(filename):
try:
file=open(filename,'r')
intext=file.read()
file.close()
return intext
except:
print 'Error reading file ',filename
return None
def storeTextFile(filename,txt):
try:
outfile=open(filename,'w')
outfile.write(txt)
outfile.close()
except:
print 'Error writing file ',filename
#------------------------
# the job
def doit(infile,outfile):
txt=getTextFile(infile)
if(txt==None):
return
# prepare this string for unicode in a domtree
txt=txt.decode('ISO-8859-1')
lines=txt.split('\n')
# set up basic document
doc=xml.dom.minidom.Document()
root_elt=doc.createElement('booklist')
doc.appendChild(root_elt)
# walk the linelist
linecount=0
for line in lines:
line=line.strip()
# skip the blanks and the comments
if len(line) <3:
continue
if line[0:2]=="//":
continue
# we will use it
# title,author,publisher,year,isbn,pages,course,category,comment
pieces=line.split(',');
if len(pieces)!=9:
# bad line
print "ignore: " + line
continue
# make book
book_elt_node=doc.createElement('book')
book_elt_node.setAttribute('isbn',pieces[4])
book_elt_node.setAttribute('pages',pieces[5])
root_elt.appendChild(book_elt_node)
new_elt_node=doc.createElement('title')
new_elt_node.appendChild(doc.createTextNode(pieces[0]))
book_elt_node.appendChild(new_elt_node)
new_elt_node=doc.createElement('course')
new_elt_node.appendChild(doc.createTextNode(pieces[6]))
book_elt_node.appendChild(new_elt_node)
new_elt_node=doc.createElement('category')
new_elt_node.appendChild(doc.createTextNode(pieces[7]))
book_elt_node.appendChild(new_elt_node)
new_elt_node=doc.createElement('author')
new_elt_node.appendChild(doc.createTextNode(pieces[1]))
book_elt_node.appendChild(new_elt_node)
new_elt_node=doc.createElement('publisher')
new_elt_node.appendChild(doc.createTextNode(pieces[2]))
book_elt_node.appendChild(new_elt_node)
new_elt_node=doc.createElement('year')
new_elt_node.appendChild(doc.createTextNode(pieces[3]))
book_elt_node.appendChild(new_elt_node)
new_elt_node=doc.createElement('comment')
new_elt_node.appendChild(doc.createTextNode(pieces[8]))
book_elt_node.appendChild(new_elt_node)
# raw print while testing
# print doc.toxml().encode('ISO-8859-1')
# get it on file
# need the domtree, doc, as a ISO-8859-1 encoded string
s=StringIO.StringIO()
doc.writexml(codecs.getwriter('ISO-8859-1')(s))
# some dirty formatting, take care
s=s.getvalue().replace('>','>\n')
s=s.replace('<book','\n\n<book')
# fix prolog
prolog="""<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE booklist SYSTEM "bokdok.dtd">"""
s=s.replace('<?xml version="1.0" ?>',prolog)
# while testing
#print s
storeTextFile(outfile,s)
doc.unlink()
# basic testing
if __name__=="__main__":
doit('c:\\web\\dw\\pydom\\bokliste.txt',
'c:\\web\\dw\\pydom\\bokliste2.xml')
Dette gjøres ved å bygge opp et DOM-tre og ved å sette inn noder som genereres fra teksten. Denne Pythonkoden gjør i prinsipp det samme som koden som er beskrevet i modulen: HTML og XML . Der beskrives et preogram som gjør det samme som ren tekstbehandling, uten bruk av DOM,
Øvelse 2
Vi lager et program som tar for seg en XML-fil som bygget i øvelse 1 og endrer strukturen på denne, et element gjøres om til attributt og en attributt gjøres om til element.
import StringIO,codecs,xml.dom.minidom
"""
Demo of MINIDOM.
Changing the structure of a XML-file
Data is described on
http://www.ia.hiof.no/~borres/ml/python/p-python.html
change it to make:
all titles an attribute in stead of an element
all pages an element in stead of an attribute
B. Stenseth 2002
Use: doit(infile,outfile)
"""
#-----------------------
# file io
def getTextFile(filename):
try:
file=open(filename,'r')
intext=file.read()
file.close()
return intext
except:
print 'Error reading file ',filename
return None
def storeTextFile(filename,txt):
try:
outfile=open(filename,'w')
outfile.write(txt)
outfile.close()
except:
print 'Error writing file ',filename
# collect all text in a node
def getText(nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
t=node.data.encode('ISO-8859-1')
rc += t
return rc
def getStrippedText(nodelist):
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
t=node.data
t=t.strip()
t=node.data.encode('ISO-8859-1')
if t!='\n':
rc += t.strip()
return rc
def doit(infile,outfile):
txt=getTextFile(infile)
if(txt==None):
return
# prepare this string for unicode in a domtree
# txt=txt.decode('ISO-8859-1')
doc = xml.dom.minidom.parseString(txt)
books=doc.getElementsByTagName('book')
for book in books:
# pick up the title-element
title_elt=book.getElementsByTagName('title')[0]
title_str=getStrippedText(title_elt.childNodes)
# make the title an attribute
book.setAttribute('title',title_str.decode('ISO-8859-1'))
# remove the title element
book.removeChild(title_elt)
# pick up the pages-attribute
page_str=book.getAttribute('pages')
# make the element
page_elt=doc.createElement('pages')
# make the text child node
page_elt.appendChild(doc.createTextNode(page_str))
book.appendChild(page_elt)
# remove pages-attribute
book.removeAttribute('pages')
# get it on file
# need the domtree, doc, as a ISO-8859-1 encoded string
s=StringIO.StringIO()
doc.writexml(codecs.getwriter('ISO-8859-1')(s))
s=s.getvalue()
# fix prolog
prolog='<?xml version="1.0" encoding="ISO-8859-1" ?>'
s=s.replace('<?xml version="1.0" ?>',prolog)
s=s.replace('bokdok.dtd','bokdok2.dtd')
# while testing
# print s
storeTextFile(outfile,s)
doc.unlink()
# basic testing
if __name__=="__main__":
doit('c:\\web\\dw\\pydom\\bokliste2.xml',
'c:\\web\\dw\\pydom\\bokliste3.xml')
Eksempel: Skøyte-data
Tema er skøyteløp med egne tekstfiler som angir resultater fra 500m, 1500m, 5000m og 10000m. Disse tekstfilene er svært enkle og inneholder ett navn og ett resultat på hver linje. Filene heter henholdsvis s500.txt, s1500.txt, s5000.txt, s10000.txt. Vi skriver et program som gjør følgende:
- Leser de fire filene og etablerer et DOM-tre for hver av dem
- Slår sammen de fire trærne til ett
- Beregner samlet poengsum for hver løper
- Sorterer alle løpernodene etter beregnet resultat
- Lager en HTML-fil der løperne vises sortert på resultat
Dette er neppe noe optimal måte å løse problemet på, men kan tjene som en DOM-øvelse. De 5 stegene er markert i Pythonkoden.
import StringIO,xml.dom.minidom,codecs
"""
Demo of MINIDOM.
NOTE that this may not be the smartest or fastest way to
solve this problem. It is written to demonstrate minidom
function makeCompleteXML(catalog)
Building a XML-file based on three text-files:
Results from 500, 1500, 5000, 10000 m speedskating
each with lines of the form(not sorted):
name,result
Filenames are s500.txt, s1500.txt, s5000.txt, s10000.txt
Returns a tree with following structure:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<skatingevent>
<skater name="olsen">
<res500>40.00</res500>
<res1500>1.50.00</res1500>
<res5000>6.40.00</res5000>
<res10000>13.40.00</res10000>
<points>87559</points>
</skater>
...
</skatingevent>
Function doit(catalog)
calls storeXMLFile and produce a sorted html-file:
skaters.html
Job is done in 4 commented steps:
Read the 4 txtfiles and establish a DOM-tree for each
Joins the 4 trees to one tree
Calculates aggregated points for each skater
Sort skaters on points
Make an HTML-file of sorted skaters
Usage: doit(catalog)
B. Stenseth 2009
"""
def getText(nodelist):
# collect all text in a node
rc = ''
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
t=node.data.encode('ISO-8859-1')
rc += t
return rc
def makeTree(catalog,distanse):
# read a file from the catalog and establish tree
try:
filename=catalog+'\\s'+distanse+'.txt'
# sample: c:\myskatingfiles\s500.txt
file=open(filename,'r')
intxt=file.read()
file.close()
intxt=intxt.decode('ISO-8859-1')
doc=xml.dom.minidom.Document()
root_elt=doc.createElement('skatingevent')
doc.appendChild(root_elt)
lines=intxt.split('\n')
for line in lines:
pieces=line.split(',')
if len(pieces)==2:
skater_elt=doc.createElement('skater')
skater_elt.setAttribute('name',pieces[0])
result_elt=doc.createElement('res'+distanse)
result_elt.appendChild(doc.createTextNode(pieces[1]))
skater_elt.appendChild(result_elt)
root_elt.appendChild(skater_elt)
return doc
except:
print 'Error building: '+distanse
return ''
def storeXMLFile(filename,doc):
# storing an xmlfile from a tree
s=StringIO.StringIO()
doc.writexml(codecs.getwriter('ISO-8859-1')(s))
t=s.getvalue()
# some dirty formatting, take care
t=t.replace('<skater','\n<skater')
t=t.replace('<res','\n<res')
# fix prolog
prolog='<?xml version="1.0" encoding="ISO-8859-1" ?>'
t=t.replace('<?xml version="1.0" ?>',prolog)
# print while storing if you want to test
# print t
try:
outf=open(filename,'w')
outf.write(t)
outf.close()
except:
print 'Error in writing tree at:'+ filename
def makeCompleteXML(catalog='c:\\articles\\ml\\dom'):
# produce the complete tree with results from all distances
# strategy is to make a tree for each distance and then join them
# make a tree for each distance
#--------------------------------------
# STEP 1 make 4 DOM-trees
t500=makeTree(catalog,'500')
t1500=makeTree(catalog,'1500')
t5000=makeTree(catalog,'5000')
t10000=makeTree(catalog,'10000')
#--------------------------------------
# store them and print them while testing
#storeXMLFile(catalog+'\\xml500.xml',t500)
#storeXMLFile(catalog+'\\xml1500.xml',t1500)
#storeXMLFile(catalog+'\\xml5000.xml',t5000)
#storeXMLFile(catalog+'\\xml10000.xml',t10000)
#--------------------------------------
# STEP 2 join trees to one tree
# use t500 as master and assemble results from the three others
d500=t500.getElementsByTagName('skater')
d1500=t1500.getElementsByTagName('skater')
d5000=t5000.getElementsByTagName('skater')
d10000=t10000.getElementsByTagName('skater')
for p500 in d500:
name500=p500.getAttribute('name')
for p1500 in d1500:
if name500==p1500.getAttribute('name'):
p500.appendChild(p1500.getElementsByTagName('res1500')[0] )
break
for p5000 in d5000:
if name500==p5000.getAttribute('name'):
p500.appendChild(p5000.getElementsByTagName('res5000')[0] )
break
for p10000 in d10000:
if name500==p10000.getAttribute('name'):
p500.appendChild(p10000.getElementsByTagName('res10000')[0] )
break
#--------------------------------------
# now we have all results in t500
# write it if you want
# storeXMLFile(catalog+'\\sall.xml',t500)
#--------------------------------------
# STEP 3 calculated aggregated points for each skater
# we want to calculate points for each skater and add
# an element points to each skater
skaters=t500.getElementsByTagName('skater')
for skater in skaters:
# calculate timepoints in 1/100 seconds
s=getText(skater.getElementsByTagName('res500')[0].childNodes)
hsecs=makeSeconds(s)
s=getText(skater.getElementsByTagName('res1500')[0].childNodes)
hsecs+=makeSeconds(s)/3.0
s=getText(skater.getElementsByTagName('res5000')[0].childNodes)
hsecs+=makeSeconds(s)/10.0
s=getText(skater.getElementsByTagName('res10000')[0].childNodes)
hsecs+=makeSeconds(s)/20.0
points='%.3f' %(hsecs/100.0)
point_elt=t500.createElement('points')
skater.appendChild(point_elt)
point_elt.appendChild(t500.createTextNode(points))
#--------------------------------------
# and you may save it again while testing
# storeXMLFile(catalog+'\\sallpoints.xml',t500)
# clean up
t1500.unlink()
t5000.unlink()
t10000.unlink()
return t500
def makeSeconds(s):
# calculate 1/100 seconds from s
# s in form mm.ss.hh ( minutes, seconds, 1/100 seconds)
# print s
parts=s.split('.')
hsecs=0
if len(parts)==3:
hsecs=6000*int(parts[0])+100*int(parts[1])+int(parts[2])
elif len(parts)==2:
hsecs=100*int(parts[0])+int(parts[1])
else:
print 'error in timeformat: ' + s
hsecs=9999999
return hsecs
def compareSkaters(s1,s2):
# used while sorting
s1pnt=s1.getElementsByTagName('points')[0]
s2pnt=s2.getElementsByTagName('points')[0]
v1=int(float(getText(s1pnt.childNodes)))
v2=int(float(getText(s2pnt.childNodes)))
return v1 - v2
def doit(catalog):
# make the complete job from 4 text-files to html-file
# first we build the complete xml-tree
# including points and all results and calculated points
#--------------------------------------
# STEPS 1,2,3 as commented on top of script
doc=makeCompleteXML(catalog)
#--------------------------------------
# and we may save it just to test
storeXMLFile(catalog+'\\sallpoints.xml',doc)
#--------------------------------------
# STEP 4 sort skaters according to calculated points
# we want to sort on points
skaters=doc.getElementsByTagName('skater')
skaters.sort(compareSkaters)
#--------------------------------------
#--------------------------------------
# STEP 5 produce a HTML-page
# now we want to produce some html-output with results
T="""<html>
<head> <title>resultater</title>
<body>
<h1 style="font-size:14px">Resultater</h1>
"""
# run through sorted list of skaters
T+='<table cellpadding="2">\n'
for skater in skaters:
T+='<tr><td style="font-size:12px">'
T+=skater.getAttribute('name').encode('ISO-8859-1')
T+='</td><td style="font-size:12px">'
T+=getText(skater.getElementsByTagName('points')[0].childNodes)
T+='</td></tr>\n'
T+='</tr>\n</table>\n</body>\n</html>\n'
filename=catalog+'\\skaters.html'
outf=open(filename,'w')
outf.write(T)
outf.close()
#--------------------------------------
doc.unlink()
# basic testing
if __name__=="__main__":
doit('c:\\web\\dw\\pydom')