HTML cheat sheet for web scraping

This is a cheat sheet for the not-so HTML/CSS savvy out there. Remember that you can always search the page with your browser for elements.

Tags

These are the most common tags:

<!--...--> <!DOCTYPE>  <a> <abbr> <acronym> <address> <applet> <area> <article> <aside> <audio> <b> <base> <basefont> <bdi> <bdo> <big> <blockquote> <body> <br> <button> <canvas> <caption> <center> <cite> <code> <col> <colgroup> <data> <datalist> <dd> <del> <details> <dfn> <dialog> <dir> <div> <dl> <dt> <em> <embed> <fieldset> <figcaption> <figure> <font> <footer> <form> <frame> <frameset> <h1> to <h6> <head> <header> <hr> <html> <i> <iframe> <img> <input> <ins> <kbd> <label> <legend> <li> <link> <main> <map> <mark> <meta> <meter> <nav> <noframes> <noscript> <object> <ol> <optgroup> <option> <output> <p> <param> <picture> <pre> <progress> <q> <rp> <rt> <ruby> <s> <samp> <script> <section> <select> <small> <source> <span> <strike> <strong> <style> <sub> <summary> <sup> <svg> <table> <tbody> <td> <template> <textarea> <tfoot> <th> <thead> <time> <title> <tr> <track> <tt> <u> <ul> <var> <video> <wbr>

Elements consist of tags.

Elements

Elements are just tags (tags: enclosed in angle brackets) closed together to form an element. For example:

<div>something inside here...</div>

The two tags form an HTML element. Elements are what HTML is made up of. Note that not all elements require a closing tag ( </…>).

Attributes

An element can have an attribute inside it. The most common attribute is probably the href inside the <a> tag. For example:

<a href="https://duckduckgo.com">DuckDuckGo</a>

Another example is the class attribute that is used for CSS styling. For example <div class="loremipsum"></div>.

Here are some common attributes:

accept accept-charset accesskey action align allow alt async autocapitalize autocomplete autofocus autoplay background bgcolor border buffered capture challenge charset checked cite class code codebase color cols colspan content contenteditable contextmenu controls coords crossorigin csp data data-* datetime decoding default defer dir dirname disabled download draggable enctype enterkeyhint for form formaction formenctype formmethod formnovalidate formtarget headers height hidden high href hreflang http-equiv icon id importance integrity intrinsicsize inputmode ismap itemprop keytype kind label lang language loading list loop low manifest max maxlength minlength media method min multiple muted name novalidate open optimum pattern ping placeholder poster preload radiogroup readonly referrerpolicy rel required reversed rows rowspan sandbox scope scoped selected shape size sizes slot span spellcheck src srcdoc srclang srcset start step style summary tabindex target title translate type usemap value width wrap

Classes and IDs

Classes and IDs are really just attributes.

Although there are no formal rules, usually classes are used for styling while IDs are used for scripts. For example:

Class:

<h1 class="CustomStyling">My Header</h1>

ID:

<h1 id="myHeader">My Header</h1>

A tag can have multiple classes, but they usually never have more than one ID.

Leave a Reply

Your email address will not be published. Required fields are marked *