DOM i Python
Minidom i Python imlementerer en del av W3C's definisjon av programmeringsgrensesnittet mot DOM og løser de fleste av de praktiske oppgavene vi skal løse, selv om det noen ganger blir litt omstendelig.
Eksempel: Olympiade-data
Vi tar for oss resultatfila fra olympiade-eksempelet, se modulene: Olympiade og Noen datasett . De aktuelle resultatene er ordnet i en XML-fil: all_results.xml
Vi skal gjøre to øvelser på denne fila
- Produsere en HTML-fil. Dette er i prinsipp samme transformasjon som den som gjøres ved XSLT i modulen: XML2HTML
- Søke etter en bestemt deltager i alle øvelser i begge olymiader.
Øvelse 1
Vi tar utgangspunkt i følgende Python program:
import xml.dom.minidom """ Simple demo of dom. produce rudimetary html from xml-file with IOC-results B. Stenseth 2009 Use: DoIt(infile,outfile) See default files below """ #----------------------- # file io def getTextFile(filename): try: file=open(filename,'r') intext=file.read() file.close() return intext except: print 'Error reading file ',filename return None def storeTextFile(filename,txt): try: outfile=open(filename,'w') outfile.write(txt) outfile.close() except: print 'Error writing file ',filename eol='\n' #--------------------------- # collect all text in a node def getText(nodelist): rc = '' for node in nodelist: if node.nodeType == node.TEXT_NODE: t=node.data.encode('ISO-8859-1') rc += t return rc HTMLFile="""<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1\"> <title>Olympiade</title> </head> <body> %s </body> </html> """ def handleIOC(doc): S='' games=doc.getElementsByTagName("OlympicGame") for game in games: S+=handleGame(game) S+=eol return S def handleGame(game): S= '<h2>%s</h2>\n' %game.getAttribute('place').encode('ISO-8859-1') events=game.getElementsByTagName("event") for event in events: S+=handleEvent(event) S+=eol return S def handleEvent(event): S= '<h3>%s</h3>\n' %event.getAttribute('dist').encode('ISO-8859-1') participants=event.getElementsByTagName("athlet") for athlet in participants: S+=handleAthlet(athlet) S+=eol return S def handleAthlet(athlet): name=athlet.getElementsByTagName("name")[0] S= "<p>Name:%s<br/>" %getText(name.childNodes) result=athlet.getElementsByTagName("result")[0] S+= "Result:%s</p>" %getText(result.childNodes) return S # default file for demopurposes, change it def doit(infile,outfile): document=getTextFile(infile) if(document!=None): dom = xml.dom.minidom.parseString(document) T=handleIOC(dom) storeTextFile(outfile,HTMLFile%T) else: print "sorry, something went wrong" # clean up dom.unlink() # basic testing if __name__=="__main__": doit('c:\\web\\dw\\pydom\\all_results.xml', 'c:\\web\\dw\\pydom\\py_results1.html')
Programmet foretar en enkel transformasjon av en xml-struktur til en rudimentær html-string. Sammenlign denne koden med en tilsvarende XSLT-transformasjon som er beskrevet i Olympiade-eksempelet:
Øvelse 2
Vi skriver et program som tar for seg våre olympiske data og forsøker å besvare spørsmålet: "I hvilke øvelser har nn deltatt i de aktuelle olympiadene". Dette innebærer at vi må gå ned og opp i treet. Først må vi lokalisere alle forekomstene av den aktuelle løperen, for deretter å gå opp i treet for å finne øvelse og olympiade.
import xml.dom.minidom """ Simple demo of dom. find: report which events an athlet has participated in B. Stenseth 2009 Use: Find(athlet,file) See default parametes below """ #------------------------------------------------------------- # file io def getTextFile(filename): try: file=open(filename,'r') intext=file.read() file.close() return intext except: print 'Error reading file ',filename return None # collect all text in a node def getText(nodelist): rc = '' for node in nodelist: if node.nodeType == node.TEXT_NODE: t=node.data.encode('ISO-8859-1') rc += t return rc def searchIOC(doc,theName): athletnamelist=doc.getElementsByTagName("name") for athletname in athletnamelist: txtname=getText(athletname.childNodes) if txtname==theName: event=athletname.parentNode.parentNode game=event.parentNode print game.getAttribute('place').encode('ISO-8859-1') print ' - '+event.getAttribute('dist').encode('ISO-8859-1') # default parameters for demopurposes def find(runner,afile): document=getTextFile(afile) if(document!=None): dom = xml.dom.minidom.parseString(document) searchIOC(dom,runner) else: print "something went wrong" # basic testing if __name__=="__main__": find('Frank Fredericks','c:\\web\\dw\\pydom\\all_results.xml')
Eksempel: Bok-data
Datagrunnlaget er en tekstfil med bokbeskrivelser, en bok på hver linje. Bokdataene er beskrevet i modulen Noen datasett . Bokliste som tekst bokliste.xml. Tomme linjer og linjer som begynner med // skal ignoreres.
Vi skal gjøre to øvelser på disse dataene:
- Bygge en XML-fil fra textfila (csv-fila)
- Endre strukturen på den fila vi bygger i øvelse 1.
Øvelse 1
Vi lager et Pythonprogram som tar for seg en tekstfil med bokbeskrivelser og lager en XML-fil.
import StringIO,xml.dom.minidom,codecs """ Demo of MINIDOM. Building a DOM-tree based on a text-file, writing result as XML Building each node and inserting it into the tree Data is described on http://www.ia.hiof.no/~borres/ml/pydom/p-pydom.html Usage: doit(textfilename,xmlfilename) B. Stenseth 2009 """ #----------------------- # file io def getTextFile(filename): try: file=open(filename,'r') intext=file.read() file.close() return intext except: print 'Error reading file ',filename return None def storeTextFile(filename,txt): try: outfile=open(filename,'w') outfile.write(txt) outfile.close() except: print 'Error writing file ',filename #------------------------ # the job def doit(infile,outfile): txt=getTextFile(infile) if(txt==None): return # prepare this string for unicode in a domtree txt=txt.decode('ISO-8859-1') lines=txt.split('\n') # set up basic document doc=xml.dom.minidom.Document() root_elt=doc.createElement('booklist') doc.appendChild(root_elt) # walk the linelist linecount=0 for line in lines: line=line.strip() # skip the blanks and the comments if len(line) <3: continue if line[0:2]=="//": continue # we will use it # title,author,publisher,year,isbn,pages,course,category,comment pieces=line.split(','); if len(pieces)!=9: # bad line print "ignore: " + line continue # make book book_elt_node=doc.createElement('book') book_elt_node.setAttribute('isbn',pieces[4]) book_elt_node.setAttribute('pages',pieces[5]) root_elt.appendChild(book_elt_node) new_elt_node=doc.createElement('title') new_elt_node.appendChild(doc.createTextNode(pieces[0])) book_elt_node.appendChild(new_elt_node) new_elt_node=doc.createElement('course') new_elt_node.appendChild(doc.createTextNode(pieces[6])) book_elt_node.appendChild(new_elt_node) new_elt_node=doc.createElement('category') new_elt_node.appendChild(doc.createTextNode(pieces[7])) book_elt_node.appendChild(new_elt_node) new_elt_node=doc.createElement('author') new_elt_node.appendChild(doc.createTextNode(pieces[1])) book_elt_node.appendChild(new_elt_node) new_elt_node=doc.createElement('publisher') new_elt_node.appendChild(doc.createTextNode(pieces[2])) book_elt_node.appendChild(new_elt_node) new_elt_node=doc.createElement('year') new_elt_node.appendChild(doc.createTextNode(pieces[3])) book_elt_node.appendChild(new_elt_node) new_elt_node=doc.createElement('comment') new_elt_node.appendChild(doc.createTextNode(pieces[8])) book_elt_node.appendChild(new_elt_node) # raw print while testing # print doc.toxml().encode('ISO-8859-1') # get it on file # need the domtree, doc, as a ISO-8859-1 encoded string s=StringIO.StringIO() doc.writexml(codecs.getwriter('ISO-8859-1')(s)) # some dirty formatting, take care s=s.getvalue().replace('>','>\n') s=s.replace('<book','\n\n<book') # fix prolog prolog="""<?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE booklist SYSTEM "bokdok.dtd">""" s=s.replace('<?xml version="1.0" ?>',prolog) # while testing #print s storeTextFile(outfile,s) doc.unlink() # basic testing if __name__=="__main__": doit('c:\\web\\dw\\pydom\\bokliste.txt', 'c:\\web\\dw\\pydom\\bokliste2.xml')
Dette gjøres ved å bygge opp et DOM-tre og ved å sette inn noder som genereres fra teksten. Denne Pythonkoden gjør i prinsipp det samme som koden som er beskrevet i modulen: HTML og XML . Der beskrives et preogram som gjør det samme som ren tekstbehandling, uten bruk av DOM,
Øvelse 2
Vi lager et program som tar for seg en XML-fil som bygget i øvelse 1 og endrer strukturen på denne, et element gjøres om til attributt og en attributt gjøres om til element.
import StringIO,codecs,xml.dom.minidom """ Demo of MINIDOM. Changing the structure of a XML-file Data is described on http://www.ia.hiof.no/~borres/ml/python/p-python.html change it to make: all titles an attribute in stead of an element all pages an element in stead of an attribute B. Stenseth 2002 Use: doit(infile,outfile) """ #----------------------- # file io def getTextFile(filename): try: file=open(filename,'r') intext=file.read() file.close() return intext except: print 'Error reading file ',filename return None def storeTextFile(filename,txt): try: outfile=open(filename,'w') outfile.write(txt) outfile.close() except: print 'Error writing file ',filename # collect all text in a node def getText(nodelist): rc = '' for node in nodelist: if node.nodeType == node.TEXT_NODE: t=node.data.encode('ISO-8859-1') rc += t return rc def getStrippedText(nodelist): rc = '' for node in nodelist: if node.nodeType == node.TEXT_NODE: t=node.data t=t.strip() t=node.data.encode('ISO-8859-1') if t!='\n': rc += t.strip() return rc def doit(infile,outfile): txt=getTextFile(infile) if(txt==None): return # prepare this string for unicode in a domtree # txt=txt.decode('ISO-8859-1') doc = xml.dom.minidom.parseString(txt) books=doc.getElementsByTagName('book') for book in books: # pick up the title-element title_elt=book.getElementsByTagName('title')[0] title_str=getStrippedText(title_elt.childNodes) # make the title an attribute book.setAttribute('title',title_str.decode('ISO-8859-1')) # remove the title element book.removeChild(title_elt) # pick up the pages-attribute page_str=book.getAttribute('pages') # make the element page_elt=doc.createElement('pages') # make the text child node page_elt.appendChild(doc.createTextNode(page_str)) book.appendChild(page_elt) # remove pages-attribute book.removeAttribute('pages') # get it on file # need the domtree, doc, as a ISO-8859-1 encoded string s=StringIO.StringIO() doc.writexml(codecs.getwriter('ISO-8859-1')(s)) s=s.getvalue() # fix prolog prolog='<?xml version="1.0" encoding="ISO-8859-1" ?>' s=s.replace('<?xml version="1.0" ?>',prolog) s=s.replace('bokdok.dtd','bokdok2.dtd') # while testing # print s storeTextFile(outfile,s) doc.unlink() # basic testing if __name__=="__main__": doit('c:\\web\\dw\\pydom\\bokliste2.xml', 'c:\\web\\dw\\pydom\\bokliste3.xml')
Eksempel: Skøyte-data
Tema er skøyteløp med egne tekstfiler som angir resultater fra 500m, 1500m, 5000m og 10000m. Disse tekstfilene er svært enkle og inneholder ett navn og ett resultat på hver linje. Filene heter henholdsvis s500.txt, s1500.txt, s5000.txt, s10000.txt. Vi skriver et program som gjør følgende:
- Leser de fire filene og etablerer et DOM-tre for hver av dem
- Slår sammen de fire trærne til ett
- Beregner samlet poengsum for hver løper
- Sorterer alle løpernodene etter beregnet resultat
- Lager en HTML-fil der løperne vises sortert på resultat
Dette er neppe noe optimal måte å løse problemet på, men kan tjene som en DOM-øvelse. De 5 stegene er markert i Pythonkoden.
import StringIO,xml.dom.minidom,codecs """ Demo of MINIDOM. NOTE that this may not be the smartest or fastest way to solve this problem. It is written to demonstrate minidom function makeCompleteXML(catalog) Building a XML-file based on three text-files: Results from 500, 1500, 5000, 10000 m speedskating each with lines of the form(not sorted): name,result Filenames are s500.txt, s1500.txt, s5000.txt, s10000.txt Returns a tree with following structure: <?xml version="1.0" encoding="ISO-8859-1" ?> <skatingevent> <skater name="olsen"> <res500>40.00</res500> <res1500>1.50.00</res1500> <res5000>6.40.00</res5000> <res10000>13.40.00</res10000> <points>87559</points> </skater> ... </skatingevent> Function doit(catalog) calls storeXMLFile and produce a sorted html-file: skaters.html Job is done in 4 commented steps: Read the 4 txtfiles and establish a DOM-tree for each Joins the 4 trees to one tree Calculates aggregated points for each skater Sort skaters on points Make an HTML-file of sorted skaters Usage: doit(catalog) B. Stenseth 2009 """ def getText(nodelist): # collect all text in a node rc = '' for node in nodelist: if node.nodeType == node.TEXT_NODE: t=node.data.encode('ISO-8859-1') rc += t return rc def makeTree(catalog,distanse): # read a file from the catalog and establish tree try: filename=catalog+'\\s'+distanse+'.txt' # sample: c:\myskatingfiles\s500.txt file=open(filename,'r') intxt=file.read() file.close() intxt=intxt.decode('ISO-8859-1') doc=xml.dom.minidom.Document() root_elt=doc.createElement('skatingevent') doc.appendChild(root_elt) lines=intxt.split('\n') for line in lines: pieces=line.split(',') if len(pieces)==2: skater_elt=doc.createElement('skater') skater_elt.setAttribute('name',pieces[0]) result_elt=doc.createElement('res'+distanse) result_elt.appendChild(doc.createTextNode(pieces[1])) skater_elt.appendChild(result_elt) root_elt.appendChild(skater_elt) return doc except: print 'Error building: '+distanse return '' def storeXMLFile(filename,doc): # storing an xmlfile from a tree s=StringIO.StringIO() doc.writexml(codecs.getwriter('ISO-8859-1')(s)) t=s.getvalue() # some dirty formatting, take care t=t.replace('<skater','\n<skater') t=t.replace('<res','\n<res') # fix prolog prolog='<?xml version="1.0" encoding="ISO-8859-1" ?>' t=t.replace('<?xml version="1.0" ?>',prolog) # print while storing if you want to test # print t try: outf=open(filename,'w') outf.write(t) outf.close() except: print 'Error in writing tree at:'+ filename def makeCompleteXML(catalog='c:\\articles\\ml\\dom'): # produce the complete tree with results from all distances # strategy is to make a tree for each distance and then join them # make a tree for each distance #-------------------------------------- # STEP 1 make 4 DOM-trees t500=makeTree(catalog,'500') t1500=makeTree(catalog,'1500') t5000=makeTree(catalog,'5000') t10000=makeTree(catalog,'10000') #-------------------------------------- # store them and print them while testing #storeXMLFile(catalog+'\\xml500.xml',t500) #storeXMLFile(catalog+'\\xml1500.xml',t1500) #storeXMLFile(catalog+'\\xml5000.xml',t5000) #storeXMLFile(catalog+'\\xml10000.xml',t10000) #-------------------------------------- # STEP 2 join trees to one tree # use t500 as master and assemble results from the three others d500=t500.getElementsByTagName('skater') d1500=t1500.getElementsByTagName('skater') d5000=t5000.getElementsByTagName('skater') d10000=t10000.getElementsByTagName('skater') for p500 in d500: name500=p500.getAttribute('name') for p1500 in d1500: if name500==p1500.getAttribute('name'): p500.appendChild(p1500.getElementsByTagName('res1500')[0] ) break for p5000 in d5000: if name500==p5000.getAttribute('name'): p500.appendChild(p5000.getElementsByTagName('res5000')[0] ) break for p10000 in d10000: if name500==p10000.getAttribute('name'): p500.appendChild(p10000.getElementsByTagName('res10000')[0] ) break #-------------------------------------- # now we have all results in t500 # write it if you want # storeXMLFile(catalog+'\\sall.xml',t500) #-------------------------------------- # STEP 3 calculated aggregated points for each skater # we want to calculate points for each skater and add # an element points to each skater skaters=t500.getElementsByTagName('skater') for skater in skaters: # calculate timepoints in 1/100 seconds s=getText(skater.getElementsByTagName('res500')[0].childNodes) hsecs=makeSeconds(s) s=getText(skater.getElementsByTagName('res1500')[0].childNodes) hsecs+=makeSeconds(s)/3.0 s=getText(skater.getElementsByTagName('res5000')[0].childNodes) hsecs+=makeSeconds(s)/10.0 s=getText(skater.getElementsByTagName('res10000')[0].childNodes) hsecs+=makeSeconds(s)/20.0 points='%.3f' %(hsecs/100.0) point_elt=t500.createElement('points') skater.appendChild(point_elt) point_elt.appendChild(t500.createTextNode(points)) #-------------------------------------- # and you may save it again while testing # storeXMLFile(catalog+'\\sallpoints.xml',t500) # clean up t1500.unlink() t5000.unlink() t10000.unlink() return t500 def makeSeconds(s): # calculate 1/100 seconds from s # s in form mm.ss.hh ( minutes, seconds, 1/100 seconds) # print s parts=s.split('.') hsecs=0 if len(parts)==3: hsecs=6000*int(parts[0])+100*int(parts[1])+int(parts[2]) elif len(parts)==2: hsecs=100*int(parts[0])+int(parts[1]) else: print 'error in timeformat: ' + s hsecs=9999999 return hsecs def compareSkaters(s1,s2): # used while sorting s1pnt=s1.getElementsByTagName('points')[0] s2pnt=s2.getElementsByTagName('points')[0] v1=int(float(getText(s1pnt.childNodes))) v2=int(float(getText(s2pnt.childNodes))) return v1 - v2 def doit(catalog): # make the complete job from 4 text-files to html-file # first we build the complete xml-tree # including points and all results and calculated points #-------------------------------------- # STEPS 1,2,3 as commented on top of script doc=makeCompleteXML(catalog) #-------------------------------------- # and we may save it just to test storeXMLFile(catalog+'\\sallpoints.xml',doc) #-------------------------------------- # STEP 4 sort skaters according to calculated points # we want to sort on points skaters=doc.getElementsByTagName('skater') skaters.sort(compareSkaters) #-------------------------------------- #-------------------------------------- # STEP 5 produce a HTML-page # now we want to produce some html-output with results T="""<html> <head> <title>resultater</title> <body> <h1 style="font-size:14px">Resultater</h1> """ # run through sorted list of skaters T+='<table cellpadding="2">\n' for skater in skaters: T+='<tr><td style="font-size:12px">' T+=skater.getAttribute('name').encode('ISO-8859-1') T+='</td><td style="font-size:12px">' T+=getText(skater.getElementsByTagName('points')[0].childNodes) T+='</td></tr>\n' T+='</tr>\n</table>\n</body>\n</html>\n' filename=catalog+'\\skaters.html' outf=open(filename,'w') outf.write(T) outf.close() #-------------------------------------- doc.unlink() # basic testing if __name__=="__main__": doit('c:\\web\\dw\\pydom')