Separate "distribution" repo for CSL styles

Hi all,

One of the outstanding issues with the CSL styles repository is that
the timestamps in cs:updated are often not up-to-date (see
http://xbiblio-devel.2463403.n2.nabble.com/new-zotero-styles-page-td6363622.html
and https://github.com/citation-style-language/schema/issues/50 for
previous discussion). But Zotero for instance needs correct timestamps
so that style auto-updating works properly. They currently update
timestamps themselves for the Zotero Style Repository using a webhook
to the “styles” repo (see
https://github.com/zotero/styles-repo/blob/e0f07ea38eae76eaae05219e6e85175a67325f08/scripts/generate-index#L220
).

Obviously, it would be preferable if the CSL project could offer its
styles directly with correct timestamps. Carles Pina was kind enough
to follow up on a suggestion of mine to create a dedicated
“distribution” repo, and has just finished building a working
prototype. It consists of a Python script, currently run on his
private server (https://github.com/citation-style-language/utilities/blob/master/styles-distribution.py),
and a secondary repository (currently
https://github.com/cpina/styles-distribution, but this would move to
citation-style-language/styles-distribution).

Carles’s script listens to a Travis CI webhook. The
“styles-distribution” repo only gets updated after Travis CI clears a
commit made to the “styles” repo. No more broken styles in the
“styles-distribution” repo! The script also removes some of the Travis
CI files that clutter up the original “styles” repo. It’s pretty
speedy too. A commit to the “styles” repo made it to the
“styles-distribution” repo in 7 minutes, which includes 6 min of
Travis CI testing:


https://travis-ci.org/citation-style-language/styles/builds/16013811
https://github.com/cpina/styles-distribution/commit/24adeeb1681a1885e0b18917c40403a2240ca758

With this setup, we can keep using the “styles” repo in the exact same
way. The only change is that Zotero, Mendeley, etc. should pull their
styles from the “styles-distribution” repo.

There are some things that have to be decided:

  • Is everybody on board? Please speak up if you see problems with this solution.
  • Somebody will have to host the script. I think it makes most sense
    if Dan Stillman of Zotero can take care of this. Dan, how do you feel
    about this?
  • Any preferences on a name for the repo? I’m fine with “styles-distribution”.

Best,

Rintze

Just a question: why a different repo?

I guess the options for storing the timestamp-corrected styles are:

  1. different repo
  2. same repo, different branch
  3. same branch

The problem with using the same repo is that we get double commits in
the repo whenever we make a change, which makes the “styles” repo
harder to navigate. The problem with using the same branch is that
writing back the updated timestamp changes the file, which would
trigger another timestamp update, ad infinitum, unless you screen
against that.

Rintze

I would agree in this setup it makes sense to have a separate repo, to avoid “polluting” the main repo with automatic commits, and to avoid infinite loops.

I have just one comment / concern. Please understand that I really appreciate the work done here, and I am really in favor of having some automated testing + clean release of the styles is a critical feature of CSL nowadays. I am only commenting here to help, but please feel free to ignore my ramblings, as I am not doing much nowadays in terms of code etc…

I can’t help but notice that the only reason for proposing this extra repo and script is to fix the issue with the ‘updated’ field. I agree this needs to be fixed and enforced. One other option, then, would be to make that part of the validation itself: comparing the modification date of the file with the value of the field. With Travis, this might need to be set up based on the git commit date. This way, things could remain as they are now, where commits are not moved to the master branch until they pass the Travis validation. This would keep things under one consistent, simpler, model, that’s also consistent with how software projects are typically run.

Charles

I can’t help but notice that the only reason for proposing this extra repo
and script is to fix the issue with the ‘updated’ field.

As I wrote, there are two small additional benefits.

  • we would only update the second repo when the Travis tests pass. We do
    most things via pull request nowadays, but sometimes we still break a few
    styles when committing directly to the master branch and not running the
    tests locally (although we could just be more careful and not fail the
    tests, of course).
  • the script doesn’t copy over the files related to Travis testing, which
    makes the second repo a bit cleaner.

I agree this needs to be fixed and enforced. One other option, then,
would be to make that part of the validation itself: comparing the
modification date of the file with the value of the field.

But this would require us to demand that contributors provide correct
timestamps, right? That seems really burdensome. The timestamp needs to be
correct with more than day precision, since we not infrequently want to
push multiple updates on a day. And the ISO format isn’t the most
approachable to newcomers, especially when dealing with time zones.

In pull requests, we often ask contributors to make some changes. If we
require correct timestamps, they will have to correct the timestamp
multiple times.

I understand why you’d like to avoid a second repo, but I like the
set-and-forget nature of what Carles put together.

Rintze

As I wrote, there are two small additional benefits.

  • we would only update the second repo when the Travis tests pass. We do most things via pull request nowadays, but sometimes we still break a few styles when committing directly to the master branch and not running the tests locally (although we could just be more careful and not fail the tests, of course).

I thought we had discussed (implemented?) a trigger that would only merge into master commits from the dev branch that pass the validation?

  • the script doesn’t copy over the files related to Travis testing, which makes the second repo a bit cleaner.

I think it’s a very minor issue, but fair enough :slight_smile:

I agree this needs to be fixed and enforced. One other option, then, would be to make that part of the validation itself: comparing the modification date of the file with the value of the field.

But this would require us to demand that contributors provide correct timestamps, right? That seems really burdensome. The timestamp needs to be correct with more than day precision, since we not infrequently want to push multiple updates on a day. And the ISO format isn’t the most approachable to newcomers, especially when dealing with time zones.

In pull requests, we often ask contributors to make some changes. If we require correct timestamps, they will have to correct the timestamp multiple times.

I think the validation can be fairly lenient and allow e.g. +/- 1 h leeway. And yes, it’s an extra burden on contributors, but IMO, it’s easier than the rest of the CSL syntax that they have to deal with.

I understand why you’d like to avoid a second repo, but I like the set-and-forget nature of what Carles put together.

I see the appeal, and I like the simplicity and pragmatism of the setup. I am just worried that the "set-and-forget” is a double-edged sword. You are adding another clean-up/validation phase, separate from the other validation, that will confuse contributors and CSL clients. We just need to make sure its use is restricted to the goals you outlined. And the fact that it needs an extra script to be run somewhere on another server, that also needs to be maintained, makes the ‘forget’ part a bit misleading :wink:

With the above caveats, thumbs up.

Charles

I like that too. Why make us do things the computer could do for us?

FYI, I have in Textmate a trigger, where if I type ‘updated’ followed by tab, I get a fully formed field.

Charles

As I wrote, there are two small additional benefits.

  • we would only update the second repo when the Travis tests pass. We do most things via pull request nowadays, but sometimes we still break a few styles when committing directly to the master branch and not running the tests locally (although we could just be more careful and not fail the tests, of course).

I thought we had discussed (implemented?) a trigger that would only merge into master commits from the dev branch that pass the validation?

We currently ask for pull requests against “master”, which we merge
when they pass Travis validation.

I think the validation can be fairly lenient and allow e.g. +/- 1 h leeway. And yes, it’s an extra burden on contributors, but IMO, it’s easier than the rest of the CSL syntax that they have to deal with.

I just don’t think most contributors will get this right on the first
try. Which would mean we’d have several more comments and commits per
pull request just to get the timestamp right, which puts an additional
burden on Sebastian and me as well.

You are adding another clean-up/validation phase, separate from the other validation, that will confuse contributors and CSL clients.

We would only mention the “styles-distribution” repo to developers.
All contributors would be steered to the “styles” repo.

Rintze

What is “the script” here?

Rintze

I think he meant the validation script, but I was just thinking: how about a commit git hook? I have no experience with this, but it seems it could be a nice way to fix those on the client side even before it’s pushed to the repo, and it would be part of the repo by default, so it would run on everybody’s machine without them doing anything.

Most pull requests come from users who only use the GitHub.com website.
They won’t ever clone the repo to their computer.

Rintze

I would expect the commit hooks to still be executed, but I don’t know for sure. If only I had time to look into this… :slight_smile:

GitHub.com doesn’t run arbitrary code:
http://stackoverflow.com/questions/19041220/how-to-run-post-receive-hook-on-github

No response from Zotero? Dan, I’d really appreciate hearing your thoughts.

The other two alternatives I can come up with:

  • leave things as they are. As I mentioned, this way projects
    downstream of CSL can’t rely on our timestamps, and will have to roll
    out their own timestamp update strategy if they need correct
    timestamps.
  • keep things in a single repo and demand more from contributors. This
    could be done if we:
    • update all current styles to have accurate timestamps (based on
      the git log; I can just merge from Carles’s repo)
    • update the Travis CI script to require accurate timestamps (this
      needs to be based on the modification date of the (modified) files if
      the Travis CI tests are run locally, since in this case I run the
      tests before committing)
    • require contributors to provide accurate timestamps. We might be
      able to make this slightly easier for our users by linking to a
      timestamp generator like http://coderstoolbox.net/unixtimestamp/

In the latter case we need to be at least somewhat flexible with
timestamps since contributors can’t anticipate the exact time of their
git commit. We would also want to avoid contributors having to update
the timestamp each time they make a change to their pull request. And
there will be a discrepancy between file modification times and commit
times. E.g. we could require that the value of cs:updated matches the
git log timestamp or be at most x hours earlier (e.g., 24 hours). We
don’t have this kind of limitation if we use a separate repo, which
always will have more accurate timestamps.

Rintze

  • Somebody will have to host the script. I think it makes most sense
    if Dan Stillman of Zotero can take care of this. Dan, how do you feel
    about this?

Sorry, I didn’t see this thread until now. The proposed solution seems
good to me, and I think Zotero can host the script.

My one concern is the 6-minute delay on every commit. I assume that’s
because it’s testing every style on every commit? Is there a way to test
only recent styles (say, from the last week or month, in case something
isn’t fixed right away) in the CI script by checking the git log for
modified files? I’ve been meaning to hook up our styles page to a GitHub
commit hook in order to have instantaneous availability of styles (so
contributors can stop saying “30 minutes” when adding/fixing styles for
people in the Zotero forums), and it’d be nice to have something closer
to that here.

  • Any preferences on a name for the repo? I’m fine with “styles-distribution”.

Sounds good to me.

Sylvester already did some work to allow tests to be run on individual
styles. See https://github.com/citation-style-language/styles/wiki/Test-Environment#testing-styles-in-isolation
. However, a significant number of tests requires parsing of all
styles (e.g. we test against duplicate ISSNs and titles in the
repository, template and independent-parent links must point to
existing styles, etc.), so I’m not entirely sure how much this would
speed things up.

In addition, building the environment seems to take a few minutes
anyway. Travis CI is experimenting with cached environments, but this
is currently only available for the (pricey) private repos:
http://about.travis-ci.org/docs/user/caching/ . Not sure if this
restriction is also in place for WAD (see
http://about.travis-ci.org/docs/user/speeding-up-the-build/ ). I also
found https://coderwall.com/p/x8exja (“Speed up Travis-CI build
preparation time by 800%”), which might have some tips.

Rintze

Hi Dan,On 7 January 2014 20:25, Dan Stillman <@Dan_Stillman> wrote:

On 12/26/13, 9:54 PM, Rintze Zelle wrote:

  • Somebody will have to host the script. I think it makes most sense
    if Dan Stillman of Zotero can take care of this. Dan, how do you feel
    about this?

Sorry, I didn’t see this thread until now. The proposed solution seems
good to me, and I think Zotero can host the script.

Let me know, perhaps off-list, if you need any help setting up the
script and the callback method.

Regards,

The styles-distribution repo is now up and running:

As previously discussed, it’s updated automatically on every commit to
the main styles repo based on a Travis CI webhook, so it should contain
only valid styles. The corollary is that, if an invalid style is
committed to the main repo, the distribution repo won’t be updated at
all until the invalid style is fixed.

The code for this is here:

This is using a mostly rewritten version of Carles’s script, with more
efficient updates, consistent UTC timestamps (not based on commit time
zones), more logging, and some other improvements. The update code is
called internally by a WSGI webapp, which is running on a Zotero server.

Thanks to Carles for getting this started and to Rintze for helping to
set this up.

  • Dan

That’s great news, thank you!

I’ll monitor for a few days and then we will switch to the new
repository (probably in 2 weeks, I’m off for some time and I don’t
want to change it just before leaving).

Regards,