Black Hole Revelations: Understanding Flash
  • 15 Comments
by nik on July 2, 2008

This week Google and Yahoo announced that over 10 years after web users were first haunted with flash intro splash screens, they will finally be able to index the content of SWF files in their search engines. Adobe Flash is the most prevalent web platform today, available on 98% of desktop browsers, yet content locked up in binary SWF files has been part of a big black hole in the web that search engines and other services have not been able to read and understand.

The solution offered here from Adobe to both Google and Yahoo (and probably offered to that other search providing company) is a special ‘flash player’ that allows the search engine to dive into existing SWF files. It might be akin to a decompiler, in that the raw objects are extracted and then the text is parsed out (decompiling Flash 9 is very possible).

What Google and Yahoo have now is simply access to the text-based content within Flash applets – it does not guarantee that the search engines will treat it equally with well-formed text-based markup. While text can be extracted, the contents still do not have the same structure and context as a text-based page, such as a header, metadata, inbound links, headings, other markup tags and everything else. Futher, if your SWF files use graphic-based text, the search engines still won’t be able to see it.

There seems to be a lot of misunderstanding about just what this means and the importance of it. First of all, in the context of web applications, search engine optimization is not important when offering a private user application view. In that case, such as with an email application, there is no public search or index. The important part here is in public-facing flash applications (or websites) where the main site content is locked up in a binary container running on a proprietary runtime/virtual machine. In these cases, up until now most site owners have replicated that same content with a proper URI structure in HTML to gain the most out of search engine indexes and referrals. This is a more ideal solution as it gives sites and content more structure that the crawlers from Google and Yahoo readily understand and can interpret: the addition of being able to grep out the text components of a SWF file add little by way of structure or organization to the web.

The next issue is when comparing different RIA technologies, the argument is often made that they are all equally poor with representing content and data in context for search engines to easily understand. This may be very true of SWF, but is is untrue of XHTML + Javascript applications, or event applications using XUL or XAML as part of Silverlight: as they are text-based formats with clear markup rules that signify to an interpreter the context and significance of content. For Google or Yahoo to understand another text type is an elementary step, and a step that they have taken numerous times (eg. parsing RSS, Atom, Microformats, etc.) – so there is no reason why the same steps can’t be taken for these engines to grok XAML or any other format.

In RIA logic is often encapsulated in code – either as Javascript in AJAX applications or in bytecode in the case of Silverlight. This is not as important as application logic is irelevant for a search engine to understand content and context, which can both be conveyed through markup and in presentation. So while the announcement from Adobe, Google and Yahoo does shine a light into a black hole, the resultant data output is nothing more than a stream of bits which the search engines themselves must determine the importance of.

I strongly believe that it is almost impossible to build a true semantic web within binary file formats and proprietary virtual machines. We can hack some way towards it, but it will never be close to what plain text markup can offer.

Advertisement

Comments rss icon

  • As most content is loaded in from external resources, into a Flash file, and everyone uses different techniques for deep linking, I think it will be a very slow process for the benefits of this to be felt. And getting to the first page of Google for a given query with a Flash file would be a very great achievement! Perhaps there should be a competition. Anyone who gets to page 1 for “George Bush” with a Flash file!

    But it’s of course good news that innovation with RIAs, from UI interfaces to search, continues.

  • Why does everybody miss the most important point here?

    –> Google and Yahoo could have created this software themselves, the SWF specification would have allowed them to do so!

    The fact that they did NOT do that – even Google, which has unlimited resources – shows, how little interest they had in Flash content.
    It’s not like Google never created parsing software for other file formats – they have parsers for PDF (open like SWF), MSOffice documents, etc.

  • This is definitely a good move for Flash and SEO (for consumer application). I would be curious to know if this works for dynamic content as well.

  • Wait a minute: flash indexing has been there for many years, when Adobe gave out swf2txt, a simple command line tool to extract text from SWF files. What would be interesting is to know how better the new version given out by Adobe performs, especially for retrieval purposes.

  • Well put Nic.

    Two words for web developers considering flash…”Progressive Enhancement”

  • It seems to be a good news if the benefits could be realised sooner.

  • The big opportunity here is to design apps with searchable states, not to get hung up about Google and Yahoo! attempting to index UX. You still need a strategy.

  • agree with Charles, progressive enhancement is key whether you’re using a swf or javascript. Our favourite method is to write out the dynamic content into the html and replace that content using a swf(which is displaying the same content, only nicer) served with swfobject, this ensures that if you’re one of the minority not using a flash player or have javascript disabled, you still have access to the content you expect to find…and so do the search engines.

  • None of this matters. If you have one of the millions of (designed completely in) flash sites the only url you will see in your toolbar will be THE SAME URL for every page. Think about what this means in terms of deep linking pages on your site, ranking keyword landing pages – still impossible. Organic deep linking will not happen.

  • Ryan is correct. This is of no benefit if the content is 10 clicks deep within the flash app and not indexable by a query string. You will still need to design for crawlability from the outset,

  • Ye good point Ryan.

  • This is all so irrelevant. No one seems to understand that these days when it comes to serious content driven RIAs, the swf files, xap or xaml contain absolutely NO content, they are merely content pulling engines. The content worth indexing comes from databases or web services most of the time. So all the content the engines are going to find is some dummy placeholders the developers left in for testing.

  • Ovi is correct, in that with RIA’s deep search engine indexing isn’t even necessary in the first place. They just need to get people to the front door and the current HTML that holds the flash/flex apps can do that with the proper tags. But people just love to bash anything related to Flash without being informed.

Leave Comment

Commenting Options

Enter your personal information to the left, or sign in with your Facebook account by clicking the button below.

Alternatively, you can create an avatar that will appear whenever you leave a comment on a Gravatar-enabled blog.

Trackback URL