In python, BeautifulSoup is used for operating with HTML queries and XML queries. It helps to take HTML and XML codes is based on tags. Tags can take on the basis of id and class also this will get as an object here we can do several operations.
* To parse a document it can be open as a file or given as a string
#Code
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")
Operations:
* If we need to find any tag in the HTML and XML,
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
print(soup.find(‘tag_to_find’))
#Example
html_cont = “<div>
<p>HTML FILE</p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
print(soup.find(‘p’))
#Output
<p>HTML FILE</p>
* In the above method, it will find the first matched tag in the HTML code.
* If you want to find all the matched tags, you need to call the find_all method.
* It returns as a list of matched tags.
#Example
html_cont = “<div>
<p>HTML FILE</p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
print(soup.find_all(‘p’))
#Output
[<p>HTML FILE</p>, <p>END</p>]
* A tag may have any number of attributes, we can access attributes treating it has a dictionary.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
tag = soup.find(‘tag_to_find’)
print(tag[‘id’])
#Example
html_cont = “<div>
<p id=”boldest”>HTML FILE</p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
tag = soup.find(‘p’)
print(tag[‘id’])
#Output
boldest
* If you want to call all the attributes of the specified tag.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
tag = soup.find(‘tag_to_find’)
print(tag.attrs)
#Example
html_cont = “<div>
<p id=”boldest” class=”bold-class”>HTML FILE</p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
tag = soup.find(‘p’)
print(tag.attrs)
#Output
{id: ”boldest”, class: ”bold-class”}
* If you want, add a new attribute to the tag.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
tag = soup.find(‘tag_to_find’)
tag[‘your_attribute’’] = “attribute_value”
#Example
html_cont = “<div>
<p id=”boldest” class=”bold-class”>HTML FILE</p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
tag = soup.find(‘p’)
tag[‘new_attribute’] = “1”
print(tag)
#Output
<p id=”boldest” class=”bold-class” new_attribute = “1”>HTML FILE</p>
* If you want to remove any attribute from tag it can be done by in the below method, by using it we can delete attributes int the specific tag.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
tag = soup.find(‘tag_to_find’)
del tag[‘your_attribute’’]
#Example
html_cont = “<div>
<p id=”boldest” class=”bold-class”>HTML FILE</p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
tag = soup.find(‘p’)
del tag[‘id’]
print(tag)
#Output
<p class=”bold-class”>HTML FILE</p>
* You can find the tags with only not its tag name and we can also find the tags with id and class.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
print(soup.find(id=‘id_to_find’, class_=’class_to_find’))
#Example
html_cont = “<div>
<p class=”p_class” >HTML FILE</p>
<img id=”img”>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
print(soup.find(id=’img’))
print(soup.find(class_= “p_class”))
#Output
<img id=”img”>Image</image>
<p class=”p_class” >HTML FILE</p>
* Sometimes we need to see all the texts in the code it can be easily done by using Beautifulsoup. In the below method explains how to get all texts in the code.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
print(soup.get_text())
#Example
html_cont = “<div>
<p>HTML FILE</p>
<p></p>
<img>Image</image>
<p>END</p>
</div>”
soup = BeautifulSoup(html_cont, 'html.parser')
print(soup.get_text())
#Output
HTML FILE
Image
END
* If need to change the tag name in an HTML
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
tag = soup.p
tag.name = ‘h1’
#Example
html_cont = “<div>
<p>HTML FILE</p>
<p></p>
<img>Image</image>
<p>END</p>
</div>”
tag = soup.p
tag.name = ‘h1’
print(tag)
#Output
<h1>HTML FILE</h1>
* We can take texts in a tag using the below method
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
print(soup.find(‘p’).string)
#Example
html_cont = “<div>
<p>HTML FILE</p>
<p></p>
<img>Image</image>
<p>END</p>
</div>”
print(soup.find(‘p’).string)
#Output
HTML FILE
* If we need to change any string of the tag, we can’t edit string on its place. It can be replaced with the other string.
#Code
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code, 'html.parser')
print(soup.find(‘p’).replace_with(‘new_string’))
#Example
html_cont = “<div>
<p>HTML FILE</p>
<p></p>
<img>Image</image>
<p>END</p>
</div>”
soup.find(‘p’).replace_with(‘edited file’)
print(soup.find(‘p’))
#Output
<p>edited file</p>