How to extract img src title and alt from html using php

Question

I would like to create a page where all images which reside on my website are listed with title and alternative representation   I already wrote me a little program to find and load all HTML files  but now I am stuck at how to extract src  title and alt from this HTML    lt img src   image fluffybunny jpg  title  Harvey the bunny  alt  a cute little fluffy bunny      I guess this should be done with some regex  but since the order of the tags may vary  and I need all of them  I don t really know how to parse this in an elegant way  I could do it the hard char by char way  but that s painful

User · Answer

Just to give a small example of using PHP's XML functionality for the task:

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

I did use the DOMDocument::loadHTML() method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to a SimpleXMLElement is not necessary - it just makes using xpath and the xpath results more simple.

User · Answer

url  http   example com     html   file get contents  url     doc   new DOMDocument      doc- gt loadHTML  html     tags    doc- gt getElementsByTagName  img     foreach   tags as  tag           echo  tag- gt getAttribute  src

User · Answer

I have read the many comments on this page that complain that using a dom parser is unnecessary overhead   Well  it may be more expensive than a mere regex call  but the OP has stated that there is no control over the order of the attributes in the img tags   This fact leads to unnecessary regex pattern convolution   Beyond that  using a dom parser provides the additional benefits of readability  maintainability  and dom-awareness  regex is not dom-aware    I love regex and I answer lots of regex questions  but when dealing with valid HTML there is seldom a good reason to regex over a parser   In the demonstration below  see how easy and clean DOMDocument handles img tag attributes in any order with a mixture of quoting  and no quoting at all    Also notice that tags without a targeted attribute are not disruptive at all -- an empty string is provided as a value   Code   Demo    test    lt  lt  lt HTML  lt img src   image fluffybunny jpg  title  Harvey the bunny  alt  a cute little fluffy bunny    gt   lt img src   image pricklycactus jpg  title  Roger the cactus  alt  a big green prickly cactus    gt   lt p gt This is irrelevant text  lt  p gt   lt img alt  an annoying white cockatoo  title  Polly the cockatoo  src   image noisycockatoo jpg  gt   lt img title something src somethingelse gt  HTML   libxml use internal errors true       silences forgives complaints from the parser  remove to see what is generated   dom   new DOMDocument     dom- gt loadHTML  test   foreach   dom- gt getElementsByTagName  img   as  i   gt   img        echo  IMG   i   n       echo   tsrc        img- gt getAttribute  src       n       echo   ttitle        img- gt getAttribute  title       n       echo   talt        img- gt getAttribute  alt       n       echo  --- n       Output   IMG 0      src    image fluffybunny jpg     title   Harvey the bunny     alt   a cute little fluffy bunny --- IMG 1      src    image pricklycactus jpg     title   Roger the cactus     alt   a big green prickly cactus --- IMG 2      src    image noisycockatoo jpg     title   Polly the cockatoo     alt   an annoying white cockatoo --- IMG 3      src   somethingelse     title   something     alt    ---   Using this technique in professional code will leave you with a clean script  fewer hiccups to contend with  and fewer colleagues that wish you worked somewhere else

User · Answer

I used preg match to do it   In my case  I had a string containing exactly one  lt img gt  tag  and no other markup  that I got from Wordpress and I was trying to get the src attribute so I could run it through timthumb      get the featured image  image   get the post thumbnail  photos  i - gt ID       get the src for that image  pattern     src              preg match  pattern   image   matches    src    matches 1   unset  matches     In the pattern to grab the title or the alt  you could simply use  pattern     title              to grab the title or  pattern     title              to grab the alt   Sadly  my regex isn t good enough to grab all three  alt title src  with one pass though

User · Answer

The script must be edited like this  foreach   result 0  as  img tag   because preg match all return array of arrays

User · Answer

EDIT   now that I know better  Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code  Better use an HTML parser    Solution With regexp  In that case it s better to split the process into two parts     get all the img tag extract their metadata   I will assume your doc is not xHTML strict so you can t use an XML parser  E G  with this web page source code       preg match all match the regexp in all the  html string and output everything as  an array in  result   i  option is used to make it case insensitive     preg match all    lt img   gt    gt  i   html   result     print r  result   Array        0    gt  Array                        0    gt   lt img src   Content Img stackoverflow-logo-250 png  width  250  height  70  alt  logo link to homepage    gt               1    gt   lt img class  vote-up  src   content img vote-arrow-up png  alt  vote up  title  This was helpful  click again to undo     gt               2    gt   lt img class  vote-down  src   content img vote-arrow-down png  alt  vote down  title  This was not helpful  click again to undo     gt               3    gt   lt img src  http   www gravatar com avatar df299babc56f0a79678e567e87a09c31 s 32 amp d identicon amp r PG  height 32 width 32 alt  gravatar image    gt               4    gt   lt img class  vote-up  src   content img vote-arrow-up png  alt  vote up  title  This was helpful  click again to undo     gt                        Then we get all the img tag attributes with a loop     img   array    foreach   result as  img tag        preg match all    alt title src            i   img tag   img  img tag       print r  img    Array         lt img src   Content Img stackoverflow-logo-250 png  width  250  height  70  alt  logo link to homepage    gt     gt  Array                        0    gt  Array                                        0    gt  src   Content Img stackoverflow-logo-250 png                       1    gt  alt  logo link to homepage                                  1    gt  Array                                        0    gt  src                      1    gt  alt                                 2    gt  Array                                        0    gt    Content Img stackoverflow-logo-250 png                       1    gt   logo link to homepage                                      lt img class  vote-up  src   content img vote-arrow-up png  alt  vote up  title  This was helpful  click again to undo     gt     gt  Array                        0    gt  Array                                        0    gt  src   content img vote-arrow-up png                       1    gt  alt  vote up                       2    gt  title  This was helpful  click again to undo                                   1    gt  Array                                        0    gt  src                      1    gt  alt                      2    gt  title                                 2    gt  Array                                        0    gt    content img vote-arrow-up png                       1    gt   vote up                       2    gt   This was helpful  click again to undo                                       lt img class  vote-down  src   content img vote-arrow-down png  alt  vote down  title  This was not helpful  click again to undo     gt     gt  Array                        0    gt  Array                                        0    gt  src   content img vote-arrow-down png                       1    gt  alt  vote down                       2    gt  title  This was not helpful  click again to undo                                   1    gt  Array                                        0    gt  src                      1    gt  alt                      2    gt  title                                 2    gt  Array                                        0    gt    content img vote-arrow-down png                       1    gt   vote down                       2    gt   This was not helpful  click again to undo                                       lt img src  http   www gravatar com avatar df299babc56f0a79678e567e87a09c31 s 32 amp d identicon amp r PG  height 32 width 32 alt  gravatar image    gt     gt  Array                        0    gt  Array                                        0    gt  src  http   www gravatar com avatar df299babc56f0a79678e567e87a09c31 s 32 amp d identicon amp r PG                       1    gt  alt  gravatar image                                  1    gt  Array                                        0    gt  src                      1    gt  alt                                 2    gt  Array                                        0    gt   http   www gravatar com avatar df299babc56f0a79678e567e87a09c31 s 32 amp d identicon amp r PG                       1    gt   gravatar image                                                       Regexps are CPU intensive so you may want to cache this page  If you have no cache system  you can tweak your own by using ob start and loading   saving from a text file   How does this stuff work    First  we use preg  match  all  a function that gets every string matching the pattern and ouput it in it s third parameter   The regexps     lt img   gt    gt    We apply it on all html web pages  It can be read as every string that starts with   lt img   contains non     char and ends with a      alt title src              We apply it successively on each img tag  It can be read as every string starting with  alt    title  or  src   then a      then a        a bunch of stuff that are not       and ends with a        Isolate the sub-strings between       Finally  every time you want to deal with regexps  it handy to have good tools to quickly test them  Check this online regexp tester   EDIT   answer to the first comment   It s true that I did not think about the  hopefully few  people using single quotes   Well  if you use only    just replace all the   by      If you mix both  First you should slap yourself  -   then try to use       instead or   and       to replace

User · Answer

Here is THE solution  in PHP   Just download QueryPath  and then do as follows    doc  qp  myHtmlDoc    foreach  doc- gt xpath    img   as  img         src   img- gt attr  src        title   img- gt attr  title        alt   img- gt attr  alt         That s it  you re done

User · Answer

If it s XHTML  your example is  you need only simpleXML    lt  php  input     lt img src   image fluffybunny jpg  title  Harvey the bunny  alt  a cute little fluffy bunny   gt     sx   simplexml load string  input   var dump  sx     gt    Output   object SimpleXMLElement  1  1         attributes    gt    array 3          src    gt      string 22    image fluffybunny jpg        title    gt      string 16   Harvey the bunny        alt    gt      string 26   a cute little fluffy bunny

User · Answer

Here s A PHP Function I hobbled together from all of the above info for a similar purpose  namely adjusting image tag width and length properties on the fly     a bit clunky  perhaps  but seems to work dependably   function ReSizeImagesInHTML  HTMLContent  MaximumWidth  MaximumHeight        find image tags preg match all    lt img   gt    gt  i   HTMLContent   rawimagearray PREG SET ORDER        put image tags in a simpler array  imagearray   array    for   i   0   i  lt  count  rawimagearray    i          array push  imagearray   rawimagearray  i  0          put image attributes in another array  imageinfo   array    foreach  imagearray as  img tag         preg match all    src width height            i   img tag   imageinfo  img tag          combine everything into one array  AllImageInfo   array    foreach  imagearray as  img tag          ImageSource   str replace           imageinfo  img tag  2  0         OrignialWidth   str replace           imageinfo  img tag  2  1         OrignialHeight   str replace           imageinfo  img tag  2  2          NewWidth    OrignialWidth        NewHeight    OrignialHeight       AdjustDimensions    F        if  OrignialWidth  gt   MaximumWidth              diff    OrignialWidth- MaximumHeight            percnt reduced      diff  OrignialWidth  100             NewHeight   floor  OrignialHeight-   percnt reduced  OrignialHeight  100              NewWidth   floor  OrignialWidth- diff             AdjustDimensions    T              if  OrignialHeight  gt   MaximumHeight              diff    OrignialHeight- MaximumWidth            percnt reduced      diff  OrignialHeight  100             NewWidth   floor  OrignialWidth-   percnt reduced  OrignialWidth  100              NewHeight  floor  OrignialHeight- diff             AdjustDimensions    T                thisImageInfo   array  OriginalImageTag    gt   img tag    ImageSource    gt   ImageSource    OrignialWidth    gt   OrignialWidth    OrignialHeight    gt   OrignialHeight    NewWidth    gt   NewWidth    NewHeight    gt   NewHeight   AdjustDimensions    gt   AdjustDimensions       array push  AllImageInfo   thisImageInfo         build array of before and after tags  ImageBeforeAndAfter   array    for   i   0   i  lt  count  AllImageInfo    i           if  AllImageInfo  i   AdjustDimensions       T              NewImageTag   str ireplace  width       AllImageInfo  i   OrignialWidth           width       AllImageInfo  i   NewWidth           AllImageInfo  i   OriginalImageTag              NewImageTag   str ireplace  height       AllImageInfo  i   OrignialHeight           height       AllImageInfo  i   NewHeight           NewImageTag             thisImageBeforeAndAfter   array  OriginalImageTag    gt   AllImageInfo  i   OriginalImageTag      NewImageTag    gt   NewImageTag           array push  ImageBeforeAndAfter   thisImageBeforeAndAfter               execute search and replace for   i   0   i  lt  count  ImageBeforeAndAfter    i           HTMLContent   str ireplace  ImageBeforeAndAfter  i   OriginalImageTag    ImageBeforeAndAfter  i   NewImageTag     HTMLContent      return  HTMLContent

User · Answer

You may use simplehtmldom  Most of the jQuery selectors are supported in simplehtmldom  An example is given below     Create DOM from URL or file  html   file get html  http   www google com         Find all images foreach  html- gt find  img   as  element         echo  element- gt src     lt br gt        Find all links foreach  html- gt find  a   as  element         echo  element- gt href     lt br gt

[php] How to extract img src, title and alt from html using php?

Examples related to php

Examples related to html

Examples related to regex

Examples related to html-parsing

Examples related to html-content-extraction