Understanding Federated identity, RA21 and other authentication methods

This post was originally published on this site


Debates on privacy in libraries are not new though they have recently become more heated over two issues. One issue resolves around learning analytics and this has been brewing for a while. The argument began in a different form with correlations of student success studies (which can be seen to be doing in a adhoc and limited manner what learning analytics aims to do).  While there has been talk of libraries particulating in learning analyticsthis particular Educause talk seems to have triggered librarians.

Pushback has been strong on Twitter by librarians and elsewhere (See “Learning Analytics and the Academic Library: Professional Ethics Commitments at a Crossroads”  and “Can we demonstrate academic library value without violating student privacy?“) .

The other issue  which has recently emerged but is still under the radar for many librarians involves a  a push to move away from IP authentication of resources under the auspices of RA21: Resource Access for the 21st Century  and this is what I want to discuss here.

RA21 – the push beyond IP Authentication

RA21: Resource Access for the 21st Century is a movement led by the International Association of Scientific, Technical, and Medical Publishers (STM) and the National Information Standards Organization (NISO).

But what is the goal and mission statement of RA21?

“Publishers, libraries, and consumers have all come to the understanding that authorizing access to content based on IP address no longer works in today’s distributed world. The RA21 project hopes to resolve some of the  fundamental issues that create barriers to moving to federated identity in place of IP address authentication by looking at some of the products and services available in the identity discovery space today, and determining best practice for future implementations going forward. [Italics added]” – RA21

So what is the issue here? Basically while RA21, has the potential to make access more seamless, there is also a potential for the loss of user privacy.

Librarians are unfamilar with non-IP methods of Authentication

As I write this, many librarians are still relatively unaware of this very recent push by publishers to move away from IP-authentication systems that have been in place to access electronic resources seem the early 2000s.

For instance at the most recent Charleston Library Conference (which is usually attended by electronic resource librarians who are likely to handle authentication and access issues ) on RA21,  less than half the people in the room raised their hands when asked if they had heard of RA21.

The stated aims of RA21 are to provide more seamless access for users anywhere and on any device. As an added benefit moving toward individual login systems as opposed to IP authentication, personalized services can also be offered. Of course the flip side of providing personalized services is that publishers will potentially get their hands on rich analytics of users.

Some including this librarian think that the threat of Sci-hub is a major reason why there is this sudden push away from IP authentication. Firstly, access to resources off-campus with IP methods are very clunky, and it seems some users prefer to use Sci-hub to access papers instead of their institution subscriptions, so to compete more seamless access of subscribed material is necessary.

Secondly, policing violations where users share their access is a lot easier if users login with individual accounts as opposed to access being granted by purely IP ranges.

It is interesting to wonder which of these many reasons are the strongest reasons for pushing for this switch. But before even considering this, I think part of the reason why this issue is currently still relatively unknown is because understanding how authenication works can be quite arcane even for fairly techy librarians.

This essay below is an attempt by me to explain the issues at stake to myself. Do take this with a pinch of salt since I’m still learning.

This is also going to be a non-technical discussion and focus on the essentials that might affect privacy.

The current situation – IP Authentication with proxy.

Currently, when a institution subscribes to a electronic resource say a journal, the publishers of the platform or database will be given a IP range of the institution’s campus and whenever they detect a user is from that IP range they are probably users of that institution and are allowed access to the content.

 

User with right IP (in campus) is allowed access

 

This work fine when the user is in-campus but what happens if they are off-campus? They can’t be recognised by the platform or database because they don’t have the right ip address.

                                    User with wrong IP (off campus) is not allowed access

Multiple solutions have been tried throughout the years to solve this, one method is to have the user use a VPN (Virtual Private Network) when off-campus so the user appears to have the right IP , but this method I believe is relatively unpopular now compared to providing access via a proxy.

Currently, the proxy software used that has pretty much a monopoly in the library world is OCLC’s ezproxy. The proxy server sits between the user and the journal or resource and passes requests from the user to the journal. From the point of view of the journal, all they see is the proxy server making the request and obviously the proxy server has the appropriate IP address to get access.

User with wrong IP (off campus) passes request through the proxy server (with right IP) and gains access

Using the proxy method means, every request you make is sent via the proxy and the response from the publisher (the content) is sent back via the proxy.

The problem with this method is that for the proxy to work, you can’t use the normal URL like http:/www.jstor.org. There needs to be a special string in the URL used by the user, so that the proxy server knows which URL to request. So for example for my University to access JSTOR, you need the following URL

http://libproxy.smu.edu.sg/login?url=http://www.jstor.org

The part in bold is called the proxy stem and adding it will ensure your request to JSTOR will go through your proxy server.

If you start off from the library homepage or library search systems, there is no problem because all such links are “proxy enabled”.

But these days, most people do not start their searches at the library homepage. They may use Google, click on a link sent to them from a friend, or a multitude of ways and land on the publisher page that do not include the proxy stem.
The savy user gets around such issues by using some quick way to add the proxy (bookmarklets are popular), and over the years this blog has blogged about the various ways to accomplish this (see here for example). Current state of art is to use a extension to help find both subscribed resources and open access items (see Lean Library Browser extension, Google Scholar button, Kopernio etc) but of course getting a large percentage of your user base to install the extension is impossible.

Still despite the inconvenience , this setup has great privacy benefits to the user. The publisher only recognises the IP address of the user and not the user individually. This also explains why the COUNTER statistics provided by publishers seem so simple. They can tell you the number of views, number of downloads but not who downloaded them not even in aggregate by user groups.

 

Typical COUNTER compliant statistics from publishers only tells you requests, downloads and searches but not who did them. 

To get these statistics you typically will have to parse the proxy logs yourself (or use OpenAthens see later).

Off-campus, the ezproxy even shields the user such that the publisher can’t even attempt to loosely identify users by IP, as the IP address all comes from the ezproxy server. Ever wonder why when there is a massive downloading incident, the publishers or databases shut off access to the whole community and not just bar the individual? The reason is it’s impossible for them to identify the individual and they will need the help of the institution to do so. (Note: It’s not a walk in the park for the library or Campus IP either to id the individual because IP addresses don’t map to individuals).

Campus Activated Subscriber Access (CASA) – a detour

Before we talk about RA21, let’s talk about another new related authentication method – Campus Activated Subscriber Access (CASA)

I’m willing to wager even less librarians have heard of  Campus Activated Subscriber Access (CASA).

It is currently in use  by HeinonlineProject MUSE/Highwire and even JSTOR in cooperation with Google.

How it works from the user point of view is this.

The user is in-campus and is searching in Google Scholar for an article. He clicks on a link to say Heinonline/JSTOR. The slight nuance here is that this is a special type of link in Google scholar, it’s a “subscriber link” and the publisher needs to have opted in to the “Subscriber Links program for Google Scholar”

https://scholar.google.com/intl/en/scholar/publishers.html#otherpolicies

Most important to note is this isn’t the same as the library links program that add the findit@libraries link resolver link.

Here Elsevier explains what it is.

“In 2013 Google Scholar launched the Subscriber Links program. If libraries opt in, subscribed users can easily see when they have full-text access to an article in Google Scholar search results and click through directly to the content. Full-text indicators help to ensure users have uninterrupted access to the research materials they are seeking. ”

See also Annual Reviews comments on Subscriber Links that says it “is intended to compliment (not replace) existing link resolver systems that libraries have implimented through Google Scholar and Library Links”.

My understanding of how Google Scholar subscriber links works is that publishers like Heinonline (with permission?) from libraries give Google Scholar, not just the holdings of the library but also the IP range of the University.

So when a person searchs Google Scholar with the right IP address Google scholar knows what holdings he will have access to and display the links to those platforms (in this case Heinonline/JSTOR) instead of variants from other platforms in the Google Scholar search results.

CASA builds on this further. When the user links on this special subscriber link in Google Scholar,  Google will now create a cookie in the browser that the user accessed Heinonline from this IP range and hence is a user of that institution. So when the user is off-campus and access the link again from Google Scholar from the same device, he will be given access automatically even though he does not have the right ip address.

The clearest explaination of CASA I could find is this UKSG article.

 

https://insights.uksg.org/articles/10.1629/uksg.360/

Sounds good right? User can now access links from Google Scholar to journals even if he is off campus! No need to fiddle with weird proxy stuff or unreliable link resolvers. That is exactly the workflow the user wants.

The issue is this, what information does Google store in the cookie? Google now presumably at the minimum knows the user is from University X and this might be associated with your google account, means Google knows even more about you.

The journal publisher is presumably “told” that the user is from University X to grant access, but assuming Google doesn’t share anything else such as the associated google account, the publisjer can’t tell anything more than that the user is from University X which isn’t so bad in my opinion if you trust Google.

After all, with IP Authentication, Heinonline/JSTOR will know from your IP address which institution you are from anyway. The only difference is when you are off-campus with the proxy, your true IP address is shielded while in this case, your IP won’t be shielded from Heinonline. Still, I think most people won’t be shocked or unhappy with this, it’s basically the cost of doing business when accessing websites.

So far so good, as long as you trust Google and the analysis above holds.

Let’s now go on to RA21.

Resource Access for the 21st Century (RA 21)

RA21 or  Resource Access for the 21st Century is a joint initative by NISO and STM. I won’t talk more about the background and history of the initative because it is not the focus of this piece.

When I first heard of  RA21, I had a big problem trying to understand. Sure I understood the final goals, but how specifically were they proposing to solve the issue?

As I have never worked in institutions that used federated identity management system such as athens or Shibboleth for acccess of electronic resources and as such I had a lot of trouble understanding what RA21 referred to as “federated identity”. Add the confusion when you read about Shibboleth/Athens/SAML and the whole thing seems impenetrable.

Understanding Federated identity and SAML

After a lot of reading and watching videos on authentication, I think the key is to understand SAML (Security Assertion Markup Language) based methods of authentication for access to electronic resources. RA21 seems to be mostly about improving that (in particularly the WAYF or Where are you from process – more on that latter).

SAML is behind Shibboleth and Openathens, which is currently employed by some institutions as a form of authentication. Athens in particularly is pretty common in UK Higher education but in recent years OpenAthen implementations are starting to appear worldwide, and even in Singapore some institutions are trialing it (via Ebsco).

But for a non-IT trained librarian like myself, trying to understand the implications behind SAML is daunting, but here’s my best understanding of SAML below.

As a sidenote, there are differences between OpenAthens and Shibboleth  though they are both based on SAML. If you are interested watch the video below  (essentially OpenAthen claims to be easier to manage for librarians and has a built-in proxy to handle resources that don’t support SAML methods) but for the purposes of this post you don’t really need to know the difference as both are based on SAML and provide Single Sign on (SSO) using existing user accounts.

 

Why SAML?

SAML stands for Security Assertion Markup Language. “SAML is an XML-based, open-standard data format for exchanging authentication and authorization data between an identity provider (like the Gluu Server) and a service provider (like Dropbox, O365, etc.)

Clear as mud isn’t it?

Let me try. The key idea here is this. When someone tries to login to a resource or service, the resource or service could in theory do the authentication at that end. But that means the resource or service would need to create and store user names and passwords and worse the user would have to remember a user and password for each resource or service he wants to access. Is there a way to reuse an existing system that already does authentication?

So instead of doing that, the resource or service called “Service Provider” (SP) in SAML speak redirects the user back to a ” identity provider” (IdP). The user then signs in at the identity provider and when this is done, the  identity provider will redirect back to the original service provider and make an assertation that the user is indeed verified.

 

Example of SAML access to Salesforce.  [Source]

In the diagram below, the user access SaleForce (a CRM external system) using the company logins achieving Single Sign On (SSO).

In the higher education University context , your Campus IT might actually have this up already for some systems. For example, my Institution has single sign on (SSO) via Shibboleth setup for emails, staff portal, student portal and even Alma. This can of course be further extended to electronic resources but currently isn’t done here.

Let’s now look at a electronic resource example. Our user tries to sign in at say JSTOR which is the service provider (SP), JSTOR then redirects back to our identity provider (IdP) (how it knows which identity provider we will explain later). In most cases, this identity provider is our local university system’s authentication system and you will be greeted with a familiar sign-on screen (assuming you haven’t signed on during this session) you use when signing on to your email.

 

Typical sign-in screen you might see when redirected back to your IdP

Once this is successfully done  (this can be done with any regular local authentication scheme), the identity provider (IdP) will redirect the user back to the service provider (SP) in this case JSTOR with an assertion that the user is a valid user.

Then the service provider (SP) – JSTOR will use this assertion to decide to give access to the user.

To recap. SAML allows the service provider to pass on the load of signing on to the local institution’s IdP server that the user belongs to. The IdP  doesn’t need to share passwords with the SP either all it does is to send a digitally trusted assertation (digitally signed) to the IdP that the user has been verified. Best of all the user doesn’t need to remember a password for JSTOR, he just uses his normal institutional one.

Depending on the timeout limit set, the user can now click on “login” on other SPs and be instantly logged-in through the same IdP without having to authenicate again.

Which Identity providers should I send the user to? Or the WAYF/Discovery problem.

So we said

1. User tries to sign in/access resource from SP
2. SP redirects user to IdP
3. User signs in at IdP
4. IdP checks if sign-in is correct and then redirects user back to SP with trusted assertation user is verified
5. SP gives access

But here’s a question does the SP at step 2 know which IdP to ask?

To make it concrete. You land on JSTOR, and try to access/login , how does JSTOR know which institution you are from and hence which identity provider to redirect you? There could be thousands of institutions that subscribe to JSTOR and each as their own IdP.

This is a long standing problem known as WAYF (Way are you from) or the discovery process. What are the solutions?

Firstly, where the user starts from a library home page or library discovery service before linking off to a service provider this isn’t a problem and it is easy to specify to the SP which IdP to use. How? Typically, the link from the library webpage or search will be a WAYFless URL . It provides information to the service provider to allow it to know which identity provider to redirect back to.

So for example this is a WAYFless URL in Shibboleth

 

WAYFless URL in UK Federation Shibboleth, ?entityID= specifies the IdP server

There are SP-side and IdP-side WAYFless Urls but the main thing to know is that there’s a string in the URL that tells the SP which IdP is used.

What if a WAYFless URL is not used?

Of course, you may be thinking that’s WAYFless URLs are in a sense not much different from adding a proxy stem to links and you might be right. The really hard problem to solve is when the user just lands on the SP and the SP has to figure out which of the potentially thousand of organizations and hence IdPs that could be used is the right one.

This is where the friction lies, typically for Shibboleth or openathens use, the user will have to indicate somehow where he is from. If openathens, one can use a openathens username and password (I understand this is less common?) or browse through a list of institutions to select their own. This is a very confusing experience even for a librarian like me who barely understands things like federations and shibboleth.

Let me use ScienceDirect login has a typical example. I click on sign in on Sciencedirect and I see this.

Straight away I see three sign in options, which should I use?

Clicking on other institutions I see yet another three options.

The “OpenAthens login” brings you to the OpenAthen login screen (see below), the “Search for your institution and click the name to login is a auto-complete system or you can select manually in a drop down menu below.

Honestly it is very confusing to me even as a librarian. And as more organizations use Openathens or Shibboleth it can get even worse.

Is there a way to speed up this process? Perhaps for the SP to automatically remember or intelligently guess which IdP you are likely to want?

RA21 is trying to fix the pain points in SAML based methods

So now that we have a rough understanding of what SAML based methods are and points where friction exists (Where are you from – WAYF problem), we can understand RA21 has trying ideas to improve on this particular pain point.

RA21 currently has three pilots, most of them I believe focus on reducing the friction of the WAYF problem. I’ll describe two of them to give you a taste of it.

WAYF Cloud

For example, the WAYF cloud proposal is a proposal for a cloud system to exist such that when a user logins to one SP with one IdP , the following is stored in the cloud. Firstly, a hash of the DeviceID (so device can be identified) and secondly the IdP used.

This means other SPs will know which devices use which IdP even if the device has never been seen at the SP.

 

You may be wondering about any privacy concerns. In theory they should be minimal. The DeviceID is “pseudoanonymized (hashed) device IDs (randomly generated identifiers) ” so it’s not possible to personally identify the user as individual Mr X (not without other information anyway). Even without the WAYF cloud, the user would eventually choose the IdP he wants to use at the SP and the SP would eventually get that information anyway.

Concrete example. Some device visits JSTOR and the user goes through the normal WAYF/Discovery process and selects the IdP at institution X. The WAYF cloud stores information on the device (hashed) and IdP used.

Next he goes to Sciencedirect and clicks on login, even though he has never been to Sciencedirect, Sciencedirect will grab his hashed device ID and query the WAYF cloud to ask which IdPs has this deviceID used in the past (even at other SPs) and if it finds one or more IdPs used in the past, it will offer it up to the user automatically for selection.

Possible UI, the WAYF cloud lets the SP know the user (or rather device) has used these 2 IdPs in the past at other SPs.

No need to hunt through long lists of institutions! This even works if the user has signed off everything as the information of what IdP is used is stored in the cloud, as long as the device can be recognised.

You can try the demo here.

Notice that the above are institutions, the documentation notes that in theory personal account IDs stored by SP (not WAYF cloud which only stores IdPs) can be displayed as well in this UI.

This SP knows what Idps this user uses (via WAYF cloud) and also what individual accounts used (other methods).

Privacy Preserving Persistent WAYF

The Privacy Preserving Persistent WAYF or P3 WAYF is a wide ranging attempt to improve the WAYF process.

Part of it involves “various hints from federation metadata (such as email domain, IP address, GEO locations, to suggest to the user the most relevant IdPs to select from. From the user’s perspective, they will only be presented with IdPs that have worked for them in the past or which are known to work with a given SP.”

I understand part of it aims to create a registry that automatically checks to see which IdPs work with which SPs, so it can offer a stream-lined list of probable IdPs the user might want. It may also just use the domain part of emails to figure out which IdP to use.

Once selected, this information on IdP will be stored in the user’s browser and will be resued when possible

So what’s the issue? What is being shared with SP?

All this seems to be good for our users. Who doesn’t want more seamless access?

The question is this, when the IdP sends an assertation over to the SP to convince the SP to allow access what exactly is asserted?

Is it

1. The user trying to access is a member of institution X

or

2. The user trying to access is a member of institution X and is from usergroup Faculty or  worse has email aarontay@smu.edu.sg

Obviously #1 is harmless as it is the exact same information the SP can infer when allowing access via IP ranges. Arguably SAML methods do allow SP to get the exact IP address of the user while ezproxy that shields the actual IP of the user, but arguably this isn’t too heinous,

#2 is what is probably alarming. In SAML speak, what user attributes are given up to the SP?

Think this is purely theory? That institutions would never share such information? Here’s Lisa Hinchliffe who has been trying to get the word out to librarians on this issue about what the InCommon Shibboleth shares.

Notice that Proquest as service provider knows all sorts of details about Lisa including full name is emails. According to Lisa she has not given up this information herself to Proquest. Other tweets suggest that this is not uncommon.

OpenAthens does it too potentially

Want an Openathens example? Ebsco which is a redistributor of openathens has been given a series of webinars introducing openathens. In most examples, the presenter talks about accessing resources via the library webpage and the discovery service Ebsco discovery service (EDS). On first look the results aren’t that impressive.

In the video below, he goes to the library webpage. Log-ins to openathens, then in the same session starts clicking on various links on the website and EDS to resources and they all are immediately authenicated without the user having to sign-in (before he already authenicated earlier).

You might think this isn’t very impressive. You can offer the same experience with ezproxy when starting from the library home page and you would be mostly right.

Except notice in some of the examples the login doesn’t just show the institution name after signing-on, but shows the name of the user!

In the solo example, where he shows the first common use case where someone goes directly into the resource in this case Sciencedirect, the same thing happens.  once the sign-in is done, the user is logged-in into his individual account.

Think very carefully of what is happening in the background for this to occur. Basically the information about who the user is has to be passed on to the service provider! SAML is telling the service provider in this case Sciencedirect, who the user is (at the minimum the email).

In SAML speak there are “attributes” that are asserted and passed on to the service provider.

This of course necessary for the often talked about personalization.

Roger Schonfeld has talked about the improved ease of use of single user accounts  and how decoupling access from being institution based to individual based brings a host of advantages to the user such as the ability to choose between mutiple affiliations for access. he

Currently using ip authenication/proxy methods the user will have to authenicate twice to get personalised service, once via the proxy for access and once within the platform to enjoy the use of individual accounts. Openathen seems to have bypassed all this with one login.

 

Why talk about RA21?

 

Notice that the issues I mention above are already in play for institutions that use shibboleth and openathens. Even without RA21 this issue exists and it’s unclear how much worse proposals like WAYF cloud will make things in terms of privacy. So why “blame” RA21?

The main thing to be aware is that RA21 if succuesful is likely to speed up our push towards such federated identity solutions and away from IP access,

Who decides what user information to share?

RA21’s position is that it is up to each institution to decide on what information to pass along to the service providers. Or in other words if institutions choose to send along more information than simply the assertation that the user is from X and say it sends X is aarontay@smu.edu.sg to allow personalization, it is the choice of the institution X and hopefully they are careful about this.

Another flaw with this argument is that while the institution can decide and they do it thoughtfully, there is no provision to give individuals the right to decide. There is no way currently for individuals to even know what user information is passed on. Some commenters also are skeptic of the institution’s ability to properly manage such data flows and prefer choice to be in the hands of individuals.

Are there systems that put users in control? I understand that some considerations were given to using OAuth or OpenID as a solutions as this gives user’s control but was eventually rejected compared to SAML based methods which are well used but work at institution level.

Federated identity solutions everywhere or a hybrid solution

And interesting question posed at a recent UKSG webinar on RA21 by Lisa Hinchliffe was this. Does RA21 envision a total 100% push away from IP auth or is there room for a hybrid solution. Another interesting question was about how RA21 would handle walk-in users to electronic resources since they typically can’t authenicate with local systems.

I question the point we are dancing around is, is Federated identity solutions meant to totally replace IP solutions? Or are they meant to co-exist?

Let’s go back to the reasons why we want to replace IP solutions. From the user point of view, one is to make access more seamless and the other is to provide personalized services. Let’s leave aside the reasons why publishers want to replace IP solutions for now.

If seamless access is all that is wanted, arguably IP solutions should remain at least for on-campus use. After all even in the most frictionless version of shibboleth or openathens, it is less frictionless than IP access in campus.

Compare someone clicking a generic link in a blog in-campus and lands on an article in JSTOR. IP authenication means he immediately has access, no action required. If only Openathens is available, the user still has to click on the Login button, and then click on the IdP offered (hopefully it already intelligently selected the right ones for you) before he has access!

That said Federated identity solutions provide the advantages of personalization and coincidentally make it easier for publishers to track users and shut down compromised accounts and if that is the major reason for RA21 then it makes sense to eventually do away with IP authenication totally.

Conclusion

This has been a long complicated post. Hopefully I haven’t mangled things up too much.

There are still many things I don’t understand about SAML. For example who sets what level of attributes should be sent? Is it the service provider (Sciencedirect/JSTOR) or the IdP? What is generally asserted before SPs provide access?

It would be great to be able to get a list of which SPs are getting grainular attributes like email, usergroup, names and which are just getting a general assertation.

RA21 is of course a new developing issue so it might be too early to decide what you think of it.

Acknowledgement : Lisa Hinchliffe has been tireless in trying to build interest in RA21 among librarians. While I had heard of RA21 before she inspired me to look closer and this piece is the result.

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑