PYTHON での XML 解析

この記事では、特定の XML ファイルを解析し、そこから構造化された方法で有用なデータを抽出する方法に焦点を当てます。 XML: XML は eXtensible Markup Language の略です。データを保存および転送するために設計されました。 XML は、人間と機械の両方が読み取れるように設計されています。そのため、XML の設計目標は、インターネット全体でのシンプルさ、汎用性、使いやすさを重視しています。このチュートリアルで解析する XML ファイルは、実際には RSS フィードです。 RSS: RSS (Really Simple Syndication とも呼ばれるリッチサイトサマリー) は、標準的な Web フィード形式のファミリーを使用して、ブログエントリ、ニュースヘッドライン、オーディオビデオなど、頻繁に更新される情報を公開します。 RSS は XML 形式のプレーンテキストです。

RSS 形式自体は、自動プロセスでも人間でも同様に比較的簡単に読み取ることができます。
このチュートリアルで処理される RSS は、人気のあるニュース Web サイトのトップニュース記事の RSS フィードです。チェックしてみてくださいここ。私たちの目標は、この RSS フィード (または XML ファイル) を処理し、将来使用できるように他の形式で保存することです。

使用したPythonモジュール: この記事では、組み込みの使用に焦点を当てます。 XML XML を解析するための Python のモジュールであり、主な焦点は ElementTree XML API このモジュールの。 実装： Python

#Python code to illustrate parsing of XML files # importing the required modules import csv import requests import xml.etree.ElementTree as ET def loadRSS(): # url of rss feed url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml' # creating HTTP response object from given url resp = requests.get(url) # saving the xml file with open('topnewsfeed.xml' 'wb') as f: f.write(resp.content) def parseXML(xmlfile): # create element tree object tree = ET.parse(xmlfile) # get root element root = tree.getroot() # create empty list for news items newsitems = [] # iterate news items for item in root.findall('./channel/item'): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content:media if child.tag == '{https://video.search.yahoo.com/mrss': news['media'] = child.attrib['url'] else: news[child.tag] = child.text.encode('utf8') # append news dictionary to news items list newsitems.append(news) # return news items list return newsitems def savetoCSV(newsitems filename): # specifying the fields for csv file fields = ['guid' 'title' 'pubDate' 'description' 'link' 'media'] # writing to csv file with open(filename 'w') as csvfile: # creating a csv dict writer object writer = csv.DictWriter(csvfile fieldnames = fields) # writing headers (field names) writer.writeheader() # writing data rows writer.writerows(newsitems) def main(): # load rss from web to update existing xml file loadRSS() # parse xml file newsitems = parseXML('topnewsfeed.xml') # store news items in a csv file savetoCSV(newsitems 'topnews.csv') if __name__ == '__main__': # calling main function main()

Above code will:

指定した URL から RSS フィードを読み込み、XML ファイルとして保存します。
XML ファイルを解析して、各辞書が 1 つのニュース項目である辞書のリストとしてニュースを保存します。
ニュース項目を CSV ファイルに保存します。

コードを部分的に理解してみましょう。

def loadRSS(): # url of rss feed url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml' # creating HTTP response object from given url resp = requests.get(url) # saving the xml file with open('topnewsfeed.xml' 'wb') as f: f.write(resp.content)

トップニュースフィード.xml

parseXML()

xml.etree.ElementTree

要素ツリー

要素

要素ツリー

要素

parseXML()

tree = ET.parse(xmlfile)

要素ツリー

xmlファイル。

root = tree.getroot()

getrooted()

木

要素

for item in root.findall('./channel/item'):

アイテム

./チャンネル/アイテム

XPath

アイテム

チャネル

根

ここ

for item in root.findall('./channel/item'): # empty news dictionary news = {} # iterate child elements of item for child in item: # special checking for namespace object content:media if child.tag == '{https://video.search.yahoo.com/mrss': news['media'] = child.attrib['url'] else: news[child.tag] = child.text.encode('utf8') # append news dictionary to news items list newsitems.append(news)

アイテム

ニュース

for child in item:

if child.tag == '{https://video.search.yahoo.com/mrss': news['media'] = child.attrib['url']

child.attrib

URL

メディア:コンテンツ

news[child.tag] = child.text.encode('utf8')

child.tag

子テキスト

{'description': 'Ignis has a tough competition already from Hyun....  'guid': 'http://www.hindustantimes.com/autos/maruti-ignis-launch....  'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch....  'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/...  'pubDate': 'Thu 12 Jan 2017 12:33:04 GMT ' 'title': 'Maruti Ignis launches on Jan 13: Five cars that threa..... }

ニュース記事

CSV()に保存

フォーマットされたデータは次のようになります。

ご覧のとおり、階層型 XML ファイルデータは単純な CSV ファイルに変換され、すべてのニュース記事がテーブル形式で保存されます。これにより、データベースの拡張も容易になります。また、JSON のようなデータをアプリケーションで直接使用することもできます。これは、パブリック API は提供しないが、いくつかの RSS フィードを提供する Web サイトからデータを抽出するための最良の代替手段です。上記の記事で使用されているすべてのコードとファイルが見つかります。ここ。 次は何でしょうか？

上の例で使用されているニュース Web サイトの RSS フィードをさらに見ることができます。他の RSS フィードも解析して、上記の例の拡張バージョンの作成を試みることができます。
あなたはクリケットのファンですか？それからこれ RSS フィードに興味があるはずです。この XML ファイルを解析してクリケットのライブ試合に関する情報を収集し、デスクトップ通知の作成に使用できます。

HTMLとXMLのクイズ クイズの作成

TechCodeview