Metadata

Posted:
in macOS edited January 2014
I've been thinking about this for a while: How do you really do metadata? The usual approach I've seen is to insist on one technology as a solution, but this is backwards: You determine what you want to do, then you figure out how to implement it. You let need drive implementation, not the other way around.



So, from the beginning? What is metadata? It's data about data. A filename is metadata. An icon is metadata. A file path is metadata. A bundle bit (which tells the OS that a folder is to be treated as a bundle) is metadata.



There's been a lot of brouhaha about filesystem metadata, but this is not adequate to the task and it has never been. The desktop database stored file metadata, but not with the files. Aliases cached the last known path of a file, but when the file was no longer there the Alias Manager went looking for it, derived the file path, and cached it in the alias.



So we already have three kinds of metadata:



1) file metadata in the file;



2) file metadata stored elsewhere;



3) file metadata that is not stored, but calculated when necessary;



Filesystem-level metadata support basically handles one out of three. The Desktop Database was Finder's responsibility. In that respect, it was the forefather of the databases maintained by iTunes and iPhoto.



But since the original HFS design, two very crucial new variables have cropped up:



Files are no longer the primary data containers. Especially in OS X, copy and paste and drag and drop work to unprecedented degrees. OS X inherited a metadata-aware pasteboard (like the MacOS Clipboard) from OpenStep, and this is critical, because, again, the filesystem will never be involved here. It is more crucial than ever for metadata to be associated with the data, not just the bits written to backing store (an old name for storage like hard drives, which is suddenly relevant again now that everything is kept in RAM and computers are hardly ever shut down). After all, the data you drag and drop might never have been saved to disk!



The other variable is the multi-user operating system. In MacOS, you could unambiguously associate all metadata with a particular file because there was only one user: The person at the keyboard. In OS 9, you could use iTunes to give a song a 4 star rating, and that was that. In OS X, the question is: for whom does that song have a 4 star rating?



The solution to the first variable is to have a standard set of metadata accessible to the pasteboard, the application, and the filesystem, so that a user can set an attribute, the application can recognize it as metadata and store it in its model, the pasteboard can grab it and transfer it to another application, which can recognize it, and then it can be saved to the filesystem, which recognizes it.



But what about application-specific metadata? If we associate all file metadata with the file, then we either have a restricted set of metadata types to keep things more or less under control, restricting the options available to an application, or we have each file carrying its very own Windows Registry around. This gets especially messy with multiple users. But there is no question that some metadata is best associated with the file it describes, and this metadata should be a limited set that is of general interest - the filename, for example. There are also sticky permissions issues, since every user with metadata stored with file X would have to have a certain set of permissions on it, and since permissions don't apply to parts of files this would make it exceptionally easy for one user to modify another's metadata.



Application-specific metadata storage didn't get used much in MacOS, largely because cooperative multitasking and interapplication communication are oil and water (although, despite that, MacOS did remarkably well in this regard). In OS X, applications can and do talk to each other all the time, remarkably efficiently, because the preemptive kernel assures that applications send and receive messages to and from each other in a timely fashion - roughly as efficiently as they can send and receive messages to and from the operating system. So this is a definite option. It is no more difficult to ask iTunes what information it has about a file than it is to ask Finder. If the application has a daemon (a headless application, like a web server) running (as iTunes does, and Address Book) then the message can be received and processed by this little daemon without the whole application getting involved (you don't have to have launched Address Book to access its database). This paradigm of applications cooperating to accomplish a goal is deeply rooted in UNIX and NeXTstep both, and the immediate effect is that you have a set of tools, each designed specifically for its job, collaborating on a project - instead of, say, Microsoft Outlook.



Finally, there's user-specific metadata. But OS X is already set up to deal with this - at the application level. Files are just files, but every application (including Finder) knows to look in your home directory's library to fetch things like preferences. Why shouldn't iTunes look here to see how you rated Trois Gymnopédies instead of looking at the AAC files themselves?



It's not even that simple, though. For instance, here's a question I still haven't resolved: How do you set a label for only your own use, or set it for everyone to see, without introducing another layer of complication to the process of labelling? Perhaps labels are a bad example, since for legacy reasons they will always be associated with the file, not any application or user.



I'm not presenting a solution here. I just reached a point where I hope that I've got enough of a handle on the issues involved to encapsulate them and offer them for discussion. So, the questions are:



What kinds of metadata are we dealing with? Not just categories (file-specific, app-specific, user-specific) but in terms of concrete examples. We have to know what functionality we need before we can sensibly implement a solution!



What systems do we need to have in place to handle rich metadata? How extensible should they be? How much do they need to be in conversation with each other?



Have at it.
«1

Comments

  • Reply 1 of 31
    bartobarto Posts: 2,246member
    I read about the Plan 9 project once, and what it was about. Basically, breaking down the concept of an "OS" to its simplest concepts.



    One of those is "An operating system aims to represent everything as a file."



    Looking at Unix like systems for example, you have the /dev directory which provides uniform access to devices by prepresenting them as files, even if they are not actual files (in the sense of existing in a backing store - as Amorph put it).



    Consider metadata associated with files on Unix like systems, you have permissions, date modified, size and so on. It doesn't really matter where these things are stored, as long as there is a uniform system for accessing this metadata.



    Taking both those things into account, there already exists a system in place for metadata - the OS representing all objects as files and providing uniform access to file metadata.



    The pasteboard isn't a file, but it could be represented as a file and have the same associated metadata.



    Apple intends to add essentially extensible metadata to HFS+, where additional forks are present for each file. This seems messy however, and users would be able to override the metadata of other users (if the file was read/writable), or not be able to add their own metadata (if the file was read only). Perhaps an HFS-3 is required with organised metadata or a Longhorn-style database for use by all applications.



    My point is that separating stored files from other objects is ignoring the existing practice of reducing everything to files. Access to data is simplified and - if and when it is implemented as a database or part of the file system - access to uniform, extensible metadata is simplified.



    Barto
  • Reply 2 of 31
    costiquecostique Posts: 1,084member
    Barto is right. To some extent.



    There are indeed 2 kinds of information: one which is absolute or essential to a file's data (like its creation date) and one which is relative or user-related (like the file's importance rating or urgency label). For example, in a multi-user environment, I can label some file as requiring urgent attention. You may have unrestricted access to the file's data, but if you don't care about the file right now, you don't care about the label I set either. Since this kind of metadata is user-specific, it should, perhaps, be separated from the file data (possibly stored in ~/Library/). On the other hand, if I copy the file to a CD to work on it at home, I want my metadata copied with it; if you take the file home, you also want all metadata preserved because when you modify the file at home and rename it and bring it back, I want my metadata to still be in place.



    What if we end up having a huge metadata set stored with files? For example, files can be easily implemented with bundles (make every file a bundle holding an encrypted property list of metadata associated with each user, one plist per user). Such a structure would be fully compatible with crappy file systems and still preserve all metadata. The question is, how space-efficient the storage would be and how time-efficient search would be.



    One more thing. All data apart from the core data is metadata. File names, icons and user permissions are metadata too. All of these can be overriden by many users simultaneously and even every individual user can have his/her permissions defined in his/her respective plist inside the file bundle. Restrictions? What restrictions?
  • Reply 3 of 31
    bartobarto Posts: 2,246member
    Like you say, if you store information in the Library, you can't take it with you easily (witness iTunes Music Libraries). If you store it inside bundles, searching a drive of 100,000+ files becomes impractical.



    You need some kind of central service. The obvious implementations are a central database, or as part of the file system.



    Barto
  • Reply 4 of 31
    costiquecostique Posts: 1,084member
    Quote:

    Originally posted by Barto

    Like you say, if you store information in the Library, you can't take it with you easily (witness iTunes Music Libraries).



    Alas.

    Quote:

    If you store it inside bundles, searching a drive of 100,000+ files becomes impractical.



    Yes, unless there is an advanced caching system which will update its internal index on the fly (during file creation/modification/copy operations).

    Quote:

    You need some kind of central service. The obvious implementations are a central database, or as part of the file system.



    This will bring compatibility issues. Though they are, perhaps, a necessary trade-off. In case of the central database separated from the file system there will be issues with preserving metadata across volumes.



    Personally, I believe Apple will choose the new shiny file system approach since it can be implemented most efficiently and, last but not least, HFS API suck. Creating a new FS from scratch has a benefit of getting rid of ugly legacy trade-offs and hacks.
  • Reply 5 of 31
    mmmpiemmmpie Posts: 628member
    Metadata _has_ to be unified. The current system of having application specific databases doesnt scale. How do I replace itunes, but keep the metadata search capabilities??? You can publish an itunes framework that your replacement app has to implement. But this scheme doesnt scale ( what happens when all you apps are publishing metadata? ) and adds a lot of work for developers. What if you want to use some linux command line app to play your music? I want a system where one app can do the meta data queries, and then pass the results to another app to handle.



    Unified metadata certainly raises issues about multiple users, but I dont think it is impossible. I think the steps are as follows ( the HFS+ scheme of having multiple forks seems ... bad ).



    a) files are actually a directory as well ( at least they look that way to programs ). This is how reiserfs works, eg: you might have a song: ~/Music/Pink Floyd/Dark Side of the Moon/Speak to Me.mp3 - if a program opens that file it gets a standard mp3 file. But metadata looks like this:

    ~/Music/Pink Floyd/Dark Side of the Moon/Speak to Me.mp3/rating - if you open that file you get the contents of that file, in this case the rating.



    When you copy a file it will copy its basic data, or if the copy mechanism understands meta data, it can package it appropriately for transport.



    b) this doesnt solve the multi user issue. An obvious solution is to have:

    ~/Music/Pink Floyd/Dark Side of the Moon/Speak to Me.mp3/bob/rating - where the file system will automatically check the path for your username and anything in that path overides the general case.



    c) all of these new files have standard permissions applied to them ( they are just files ), so users can only what they are allowed to ( what is a good default? same as parent file? ).



    d) when you copy a file to a new machine the user may well be different, so using a username to identify metadata isnt very secure. Still, thats no different to normal files.



    e) Searching of contents needs to be handled just like it is now, each user would have an index that needs to be kept up to date with changes, and that only indexes the accessible data. This needs to be pushed by the OS, so that user not logged in stay up to date - its a special OS file I guess.



    The key thing here is the file/firectory duality of files. Everything in the directory is just another file ( they arent forks, or bundles either ). This means that any program that knows how to manipulate files can manipulate metadata, without any additional programming. Reiser is designed to handle small files ( that metadata tends to be ) without additional cost.
  • Reply 6 of 31
    dobbydobby Posts: 797member
    As costique said there is essential metadata and user based metadata.

    The user metadata is for personalisation and is irrelevant for the OS. My cars cd player does not care whether I have a song.mp3 with a yellow colour label, some non standard icon I associated with it, a 20 star rating and I open song.mp3 on my mac with windows media player.

    The car stereo needs to know file size, permissions and thats about it.



    If I smb mount a windows system can I still copy song.mp3 with the associated metadata as I do want the colour label, star rating and my preferred media player (an example).

    However, as Amorph mentioned, I have a different username on this system so should I see the metadata another user created?



    The old/current mac resouce fork (or metadata) shows how useless the metadata is if it is not contained in the file itself. If the metadata is in the file then should everyone be allowed to see it. I might not want everyone to know that I rate a bucks fizz song with a 5 star rating.



    What about an eps file. I send a freehand eps to a prepress house who is Adobe only. They open the file in Photoshop and lose the vectorization (or whatever).



    I simple way of adding metadata without increasing filesize and allowing it to traverse multiple OS/file systems and protocols is a simple extension.

    Move away from the 3 char .xxx to a longer extension of say 6 chars.



    File metadata is a complicated issue and even hard to get any type of standard in place, especially if it needs to be non platform specific. Firstly MS will probably try to bend it their own way out of principle to give Linux, Mac and other OSes a headache.



    Just more food for thought.



    Dobby.
  • Reply 7 of 31
    kickahakickaha Posts: 8,760member
    Quote:

    Originally posted by dobby

    Move away from the 3 char .xxx to a longer extension of say 6 chars.



    Er... this isn't Windows.



    A file extension can be up to 255 characters, IIRC.
  • Reply 8 of 31
    dobbydobby Posts: 797member
    Sorry, poorly worded but speaking of poorly, I thought I read somewhere that Microsoft will be introducing a new filesystem (WinFS) with metadata or something along those lines when Microsofts new OS Wrongendofthehorn comes out.



    Dobby.

  • Reply 9 of 31
    dobbydobby Posts: 797member
    I ust had a quick read on WinFS and I can't find a mention about cross platform compatability. I saw something about syncronization between other data stores on other machines but it wasn't clear whether they were WinFS store or say AFS connecting to the WinFS api.



    Does WinFS mean we have to adopt to its metadata schema?



    Dobby.
  • Reply 10 of 31
    mmmpiemmmpie Posts: 628member
    WinFS is an api, it takes the normal NTFS file system and merges it with a SQL Server database. The sql server handles all of the metadata. Im not what sort of schema they are using, but like any system that goes beyond basic metadata, you will have a hard time using it elsewhere or moving it.



    In some ways you can see this an embrace and extend by microsoft. There will be essential file information that you cant exchange with users who arent on longhorn ( note that they are using the backwards compatibility hamstring to force thier own users to upgrade to ).



    Reiser fs is the best solution I see at the moment ( and in the near future ). They have done a lot of work to make it feasible to have files that only contain a few characters of data without exorbitant cost ( eg: minimum 4k file sizes ).
  • Reply 11 of 31
    bartobarto Posts: 2,246member
    The ReiserFS solution sounds very good, almost an improved version of what Apple is trying to do. Unfortunately, it does not allieviate the problem of having to search individual files, rather than a database, for metadata.
  • Reply 12 of 31
    othelloothello Posts: 1,054member
    can i just say -- good thread people. i'm really enjoying ti, and wish i was clever enough to input to it
  • Reply 13 of 31
    costiquecostique Posts: 1,084member
    Quote:

    Originally posted by Barto

    The ReiserFS solution sounds very good, almost an improved version of what Apple is trying to do. Unfortunately, it does not allieviate the problem of having to search individual files, rather than a database, for metadata.



    Yes, it seems that it takes a team of specialists to solve that sort of task. I would first try to intercept any file system changes and update the separately stored index so that actual searching would be done on the index rather than on thousands of files. A clear disadvantage is data duplication and index synchronization issues.

    Quote:

    Originally posted by mmmpie

    Reiser fs is the best solution I see at the moment ( and in the near future ). They have done a lot of work to make it feasible to have files that only contain a few characters of data without exorbitant cost ( eg: minimum 4k file sizes ).



    Some 5 or 6 years ago I formatted a drive with 1k blocks with FWB HDT and did not notice any speed hits (though it was a slow SCSI drive). Maybe the trick still makes sense, seeing the number of files in OS X installations?
  • Reply 14 of 31
    mmmpiemmmpie Posts: 628member
    Even when you have 1k block sizes you waste a lot of space if you are allocating files with only a couple of bytes in them. When you are talking about metadata you are talking about having multiple of these small 'files' per file ( eg: every id3 tag for an mp3 becomes a file ). To solve this most modern file systems allow more than one file to reside within a block untill it out grows it ( Im not sure if HFS+ does this, but resource forks are a similar idea ). These shared blocks are really slow to access ( in comparison to a normal file ), so it isnt an attractive solution to handle metadata the way reiser does ( lots of small files ). The team at Reiser have spent a lot of time developing new algorithms for storing and retrieving small files, so that the dont encounter any speed hit compared to a large ( 1 block or bigger ) file. Because of this they can afford the idea of having millions upon millions of files on a drive ( rather than the 100,000's we have now ).



    Searching the metadata is no worse a problem than searching file contents already is ( or searching in a database ). You can maintain indexes ( which works ), and you can do exhaustive searches. If you consider itunes, it is already doing this sort of thing. The id3 tags in the mp3 file are still there, itunes just copies them into its database. Im pretty sure if you edited the id3 tags behind itunes back it wouldnt update automatically. Its only that people do everything through itunes. With a general purpose metadata system everything gets done through the OS, so it can maintain indexes pretty well.



    Unfortunately, I have read that Mac applications ( im not sure if its just carbon, or cocoa as well ) are very tightly tied to there knowledge about the HFS implementation ( trying to use UFS for an OS X end user install is difficult [?] ), and so it may be very unlikely that we will see a wholely new fs like reiser step up to the plate. Much more likely is that we'll see a less elegant solution grafted into the multi fork structure that hfs+ currently provides. This could be exactly the reason why so little progress seems to have been made on this front.
  • Reply 15 of 31
    bartobarto Posts: 2,246member
    Ok, so it's decided. Apple needs to implement universal metadata using ReiserFS style bundles combined with synchronised databases for the best of both worlds, or we're all going to Cupertino with pitchforks!
  • Reply 16 of 31
    mattjohndrowmattjohndrow Posts: 1,618member
    Quote:

    Originally posted by othello

    can i just say -- good thread people. i'm really enjoying ti, and wish i was clever enough to input to it



    me too...i wish i was a little bit taller, i wish i was a baller, i wish i had a rabbit in a hat with a bat and a...wait, this isn't where i wish for things, is it? dammit, everytime!!!!!!
  • Reply 17 of 31
    costiquecostique Posts: 1,084member
    Quote:

    Originally posted by othello

    can i just say -- good thread people. i'm really enjoying ti, and wish i was clever enough to input to it



    Here's a good read.
  • Reply 18 of 31
    irfotonirfoton Posts: 15member
    I'm not as informed about FSs as I should be but I don't think HFS+ requires multiple forks to store metadata. Also BFS seemed to be very strong in its ability to store and access metadata. I'm not sure how ReiserFS compares to BFS. Finally, the original creator of the BFS is now an Apple employee. The Mac community has high hopes his knowledge of BFS will be infused into HFS+.



    irfoton
  • Reply 19 of 31
    mmmpiemmmpie Posts: 628member
    Quote:

    Originally posted by irfoton

    I'm not as informed about FSs as I should be but I don't think HFS+ requires multiple forks to store metadata. Also BFS seemed to be very strong in its ability to store and access metadata. I'm not sure how ReiserFS compares to BFS. Finally, the original creator of the BFS is now an Apple employee. The Mac community has high hopes his knowledge of BFS will be infused into HFS+.



    irfoton




    I read a white paper about HFS+, and was impressed, and surprised, to see that HFS+ was given multi fork support for the explicit reason of supporting metadata.



    It would certainly be possible to add metadata without using the multi fork functionality, but it would probably break backwards compatibility, which I do see as a bad thing ( if you are going to break backwards compatibility you may as well give the FS a total overhaul ).



    You could add metadata support the way MS is, by using a seperate database. The downside to any of this is that it is hard to copy files and retain their metadata ( just like it is hard to copy mac files to stupid file systems ). The challenge of retaining that information without putting the file in an archive ( so it is accessible to a foreign OS ) is mighty. To some degree reiser addresses this by making the semantics of metadata match the semantics of normal file access - meta data is a file.



    BeOS stored metadata as a integral part of the files structure ( just like name, date etc are in other FS's ). Searching the data required an index to make it fast. BeOS' solution didnt scale well ( but was better than their original filesystem is a database system ).



    It has been a long time since anything has come from Apple on the FS front, so my take is that they are working on the very tough problem of integrating metadata seamlessly into a world that has been very poor in file semantic expression. Maybe we will see something in 10.4, I dont think they will just use reiser fs, but I do think they will draw on a lot of the quality work that has gone into reiser.
  • Reply 20 of 31
    bartobarto Posts: 2,246member
    Quote:

    Originally posted by mmmpie

    It has been a long time since anything has come from Apple on the FS front, so my take is that they are working on the very tough problem of integrating metadata seamlessly into a world that has been very poor in file semantic expression. Maybe we will see something in 10.4, I dont think they will just use reiser fs, but I do think they will draw on a lot of the quality work that has gone into reiser.



    Going back to the whitepaper you mention (TN1150), Apple introduced HFSX with 10.3 (the only difference being case sensitivity). HFSX is more scalable - in terms of addition of new features - than HFS+, so it's a pretty safe bet Apple will be adding major file system improvements in the next few OS releases.



    Barto
Sign In or Register to comment.